* [ANNOUNCE] Lustre Lite 1.0 beta 1
@ 2003-03-12 17:56 Peter Braam
2003-03-16 5:38 ` David Chow
0 siblings, 1 reply; 6+ messages in thread
From: Peter Braam @ 2003-03-12 17:56 UTC (permalink / raw)
To: lustre-announce, linux-fsdevel, linux-kernel
Subject: Lustre Lite 0.6 released (1.0 beta 1)
Summary
-------
We're pleased to announce that the first Lustre Lite beta (0.6) has
been tagged and released. Seven months have passed since our last
major release, and Lustre Lite is quickly approaching the goal of
being stable, consistent, and fast on clusters up to 1,000 nodes.
Over the last few months we've spent thousands of hours improving and
testing the file system, and now it's ready for a wider audience of
early adopters. In particular, Lustre users on ia32 and ia64 Linux
systems running 2.4.19 and Red Hat 2.4.18-based kernels. Lustre may
work on other Linux platforms, but has not been extensively tested,
and may require some additional porting effort.
We expect that you will find many bugs that we are unable to provoke
in our testing, and we hope that you will take the time to report them
to our bug system (see Reporting Bugs below).
Features
--------
Lustre Lite 0.6:
- has been tested extensively on ia32 and ia64 Linux platforms
- supports TCP/IP and Quadrics Elan3 interconnects
- supports multiple Object Storage Targets (OSTs) for file data storage
- supports multiple Metadata Servers (MDSs) in an active/passive
failover configuration (requires shared storage between MDS nodes).
Automatic failover requires an external failover package such as
Red Hat's clumanager.
- provides a nearly POSIX-compliant filesystem interface (some areas
remain non-compliant; for example, we do not synchronize atimes)
- aims to recover from any single failure without loss of data or
application errors
- scales well; we have tested with as many as 1,100 clients and 128 OSTs
- is Free Software, released under the terms and conditions of the GNU
General Public License
Risks
-----
As with any beta software, but especially kernel modules, Lustre Lite
carries the real risk of data loss or system crashes. It is very
likely that users will test situations which we have not, and provoke
bugs which crash the system. We must insist that you
BACKUP YOUR DATA
prior to installing Lustre, and that you understand that
we make NO GUARANTEES about Lustre.
Please read the COPYING file included with the distribution for more
information about the licensing of Lustre.
Known Bugs
----------
Although Lustre is for the most part stable, there are some known bugs
with this current version that you should be particularly aware of:
- Some high-load situations involving multiple clients have been known
to provoke a client crash in the lock manager (bug 984)
- Failover support is incomplete; some access patterns will not
recover correctly
- Recovery does not gracefully handle multiple services present on the
same node
- Failures can lead to unrecoverable states, which require the system
to be umounted and remounted (and, in some case, nodes may require a
reboot)
- Unmounting a client while an MDS is failed may hang the "umount"
command, which will need to be "kill"ed manually (bug 978)
- Metadata recovery will time out and abort if there are clients which
hold uncommitted requests, but which do not detect the death and
failover of the MDS. Running a metadata operation on quiescent
clients will cause them to join recovery. (bug 957)
Getting Started
---------------
<https://projects.clusterfs.com/lustre/LustreHowto> contains
instructions for downloading, building, configuring, and running
Lustre. If you encounter problems, you can seek help from others in
the lustre-discuss mailing list (see below).
Reporting Bugs
--------------
We are eager to hear about new bugs, especially if you can tell us how
to reproduce them. Please visit <http://bugzilla.lustre.org/> to
report problems.
The closer that you can come to the ideal described in
<https://projects.clusterfs.com/lustre/BugFiling>, the better.
Mailing Lists
-------------
See <http://www.lustre.org/lists.html> for links to the various Lustre
mailing lists.
Acknowledgement
---------------
The US government has funded much of the Lustre effort through the
National Laboratories. In addition to money they have provided
invaluable experience and fantastic help with testing both in terms of
equipment and people. We thank them all, but in particular Mark
Seager, Bill Boas and Terry Heidelberg's team at LLNL which went far
beyond the call of duty, Lee Ward (Sandia) and Gary Grider (LANL),
Scott Studham (PNNL). We have also had the benefit of partnerships
with UCSC, HP, Intel, BlueArc and DDN and we are grateful to them.
---
Thank you for your interest in and testing of Lustre. We appreciate
your effort, patience, and bug reports as we work towards Lustre Lite
1.0.
The Cluster File Systems team
Peter J. Braam <braam@clusterfs.com>
Phil Schwan <phil@clusterfs.com>
Andreas Dilger <adilger@clusterfs.com>
Robert Read <rread@clusterfs.com>
Eric Barton <eeb@clusterfs.com>
Radhika Vullikanti <radhika@clusterfs.com>
Mike Shaver <shaver@clusterfs.com>
Eric Mei <ericm@clusterfs.com>
Zach Brown <zab@clusterfs.com>
Chris Cooper <ccooper@clusterfs.com>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [ANNOUNCE] Lustre Lite 1.0 beta 1
2003-03-12 17:56 [ANNOUNCE] Lustre Lite 1.0 beta 1 Peter Braam
@ 2003-03-16 5:38 ` David Chow
2003-03-16 8:38 ` Andreas Dilger
0 siblings, 1 reply; 6+ messages in thread
From: David Chow @ 2003-03-16 5:38 UTC (permalink / raw)
To: Peter Braam; +Cc: linux-fsdevel
Peter Braam wrote:
>Features
>--------
>
>Lustre Lite 0.6:
>
>- has been tested extensively on ia32 and ia64 Linux platforms
>- supports TCP/IP and Quadrics Elan3 interconnects
>- supports multiple Object Storage Targets (OSTs) for file data storage
>- supports multiple Metadata Servers (MDSs) in an active/passive
> failover configuration (requires shared storage between MDS nodes).
> Automatic failover requires an external failover package such as
> Red Hat's clumanager.
>- provides a nearly POSIX-compliant filesystem interface (some areas
> remain non-compliant; for example, we do not synchronize atimes)
>- aims to recover from any single failure without loss of data or
> application errors
>- scales well; we have tested with as many as 1,100 clients and 128 OSTs
>- is Free Software, released under the terms and conditions of the GNU
> General Public License
>
>
>
>
Quite interesting. How is it different from gfs and other cluster file
systems? Is it a SAN/Shared-SCSI disk sharing file system or some
cluster file system which ride on something like gnbd/nbd/iSCSI pool ?
Not quite sensible to run cluster file systems data channels over TCP/IP
as it is too slow, as the price of fibre-channel storage keep dropping
as well as those for implement cluster file systems in a production
environment are usually affordable for fibre-channel storage. The name
"OST" seems interesting but I don't really like the iSCSI or pool of
block devices spreading across multiple machines as it is very hard to
manage and likely to fail easily. How is lustre handling these senarios?
regards,
David Chow
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [ANNOUNCE] Lustre Lite 1.0 beta 1
2003-03-16 5:38 ` David Chow
@ 2003-03-16 8:38 ` Andreas Dilger
2003-03-17 17:13 ` David Chow
0 siblings, 1 reply; 6+ messages in thread
From: Andreas Dilger @ 2003-03-16 8:38 UTC (permalink / raw)
To: David Chow; +Cc: Peter Braam, linux-fsdevel
On Mar 16, 2003 13:38 +0800, David Chow wrote:
> Peter Braam wrote:
> >Features
> >--------
> >Lustre Lite 0.6:
> >
> >- has been tested extensively on ia32 and ia64 Linux platforms
> >- supports TCP/IP and Quadrics Elan3 interconnects
> >- supports multiple Object Storage Targets (OSTs) for file data storage
> >- supports multiple Metadata Servers (MDSs) in an active/passive
> > failover configuration (requires shared storage between MDS nodes).
> > Automatic failover requires an external failover package such as
> > Red Hat's clumanager.
> >- provides a nearly POSIX-compliant filesystem interface (some areas
> > remain non-compliant; for example, we do not synchronize atimes)
> >- aims to recover from any single failure without loss of data or
> > application errors
> >- scales well; we have tested with as many as 1,100 clients and 128 OSTs
> >- is Free Software, released under the terms and conditions of the GNU
> > General Public License
>
> Quite interesting. How is it different from gfs and other cluster file
> systems?
One primary difference between Lustre and other cluster file systems is
that Lustre is designed with enormous scalability in mind. It can already
(in the first "Lite" version) handle over 1000 clients, multi GB/s aggregate
IO rates, and many tens or hundreds of TB of storage. Future plans scale
up to tens of thousands of clients, and PB of storage.
> Is it a SAN/Shared-SCSI disk sharing file system or some cluster file
> system which ride on something like gnbd/nbd/iSCSI pool ?
Not really either of these. Primarily it is a multiple-server based
distributed file system, with (currently) one server doing all of the
metadata work, and (some configurable number between 1 and hundreds
of) storage servers (OSTs). The servers are independent (generally
Linux) boxes with attached disks. There is the possibility of a "SAN"
interface from the clients to the storage servers, but this isn't the
common usage scenario.
> Not quite sensible to run cluster file systems data channels over TCP/IP
> as it is too slow, as the price of fibre-channel storage keep dropping
> as well as those for implement cluster file systems in a production
> environment are usually affordable for fibre-channel storage.
Well, TCP/IP is one of the network types we support (via Portals Network
Abstraction Layers, NALs) mostly for development/testing, along with
Quadrics Elan (our large clusters use these, VERY FAST), and in-development
Myrinet NAL.
The reason that FC interconnect isn't the primary interconnect is that:
1) FC is still more expensive than GigE (although less expensive than Elan)
2) you can't scale FC to thousands of nodes
3) more people have TCP interconnect than anything else (even if it is
TCP-over-FC)
4) none of our customers actually have asked for FC yet
> The name "OST" seems interesting but I don't really like the iSCSI or pool
> of block devices spreading across multiple machines as it is very hard to
> manage and likely to fail easily. How is lustre handling these senarios?
Well, each OST is primarily (from the Lustre sense) the network interface
protocol, and the internal implementation is opaque to the outside world.
Each OST is independent of the others, although the clients end up allocating
files on all of them.
For Linux OSTs we use ext3 on regular block devices (raw disk, MD RAID,
LOV, whatever you want to use) for the actual data storage, and the
filesystem is journaled/managed totally independent from all of the
other OSTs. We have also used reiserfs for OST storage at times (and
concievably you could use XFS/JFS), and there are also 3rd party vendors
who are building OST boxes with their own non-Linux internals.
Since this is just regular disk attached to regular Linux boxes, it is
also possible to do storage server failover (already being implemented)
without clients even being aware of a problem. Single disk failure is
expected to be handled by RAID of some kind.
Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [ANNOUNCE] Lustre Lite 1.0 beta 1
2003-03-16 8:38 ` Andreas Dilger
@ 2003-03-17 17:13 ` David Chow
2003-03-17 17:46 ` Andreas Dilger
2003-03-17 17:47 ` Peter Braam
0 siblings, 2 replies; 6+ messages in thread
From: David Chow @ 2003-03-17 17:13 UTC (permalink / raw)
To: Andreas Dilger; +Cc: Peter Braam, linux-fsdevel
>
>
>Well, each OST is primarily (from the Lustre sense) the network interface
>protocol, and the internal implementation is opaque to the outside world.
>Each OST is independent of the others, although the clients end up allocating
>files on all of them.
>
>For Linux OSTs we use ext3 on regular block devices (raw disk, MD RAID,
>LOV, whatever you want to use) for the actual data storage, and the
>filesystem is journaled/managed totally independent from all of the
>other OSTs. We have also used reiserfs for OST storage at times (and
>concievably you could use XFS/JFS), and there are also 3rd party vendors
>who are building OST boxes with their own non-Linux internals.
>
>Since this is just regular disk attached to regular Linux boxes, it is
>also possible to do storage server failover (already being implemented)
>without clients even being aware of a problem. Single disk failure is
>expected to be handled by RAID of some kind.
>
>Cheers, Andreas
>
>
Andreas,
Thanks for your lengthly explanation. The design looked like Coda with
OST as you refer to the actual data storage. In fact, it is a stacked
file cache or your store data in files persistently on existing file
systesms. However, how can it handle a disconnected storage server?
Where this is the most diffcult problem for any cluster file systems
that support disconnection. It is obviously not allow disconnection for
system like having thousands of nodes is bad. The chance of node failure
is very high in those cases. As file allocation is still allowed to be
done across multiple storage servers. The answer to resolving data
conflicts transparently after disconnection is impossible! I would
really like to hear this from Lustre as it already played around with
1000 nodes. When I came down to design a distributed file system end up
blowing my head about this. Thanks for comments or may yo give some
directions for me as I am very interested in this topic.
regards,
David Chow
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [ANNOUNCE] Lustre Lite 1.0 beta 1
2003-03-17 17:13 ` David Chow
@ 2003-03-17 17:46 ` Andreas Dilger
2003-03-17 17:47 ` Peter Braam
1 sibling, 0 replies; 6+ messages in thread
From: Andreas Dilger @ 2003-03-17 17:46 UTC (permalink / raw)
To: David Chow; +Cc: Peter Braam, linux-fsdevel
On Mar 18, 2003 01:13 +0800, David Chow wrote:
> >Well, each OST is primarily (from the Lustre sense) the network interface
> >protocol, and the internal implementation is opaque to the outside world.
> >Each OST is independent of the others, although the clients end up allocating
> >files on all of them.
> >
> >For Linux OSTs we use ext3 on regular block devices (raw disk, MD RAID,
> >LOV, whatever you want to use) for the actual data storage, and the
> >filesystem is journaled/managed totally independent from all of the
> >other OSTs. We have also used reiserfs for OST storage at times (and
> >concievably you could use XFS/JFS), and there are also 3rd party vendors
> >who are building OST boxes with their own non-Linux internals.
> >
> >Since this is just regular disk attached to regular Linux boxes, it is
> >also possible to do storage server failover (already being implemented)
> >without clients even being aware of a problem. Single disk failure is
> >expected to be handled by RAID of some kind.
>
> Thanks for your lengthly explanation. The design looked like Coda with
> OST as you refer to the actual data storage. In fact, it is a stacked
> file cache or your store data in files persistently on existing file
> systems.
Correct, although there is not any aspect of a file "cache" around like
Coda/InterMezzo. The clients are only clients and do not store any
persistent data locally. They can cache data in memory (page cache), and
there is a distributed lock manager to keep this data coherent across the
cluster.
> However, how can it handle a disconnected storage server?
There is no concept of a "disconnected storage server" in the sense that
Coda/InterMezzo support disconnected operation. You can have storage
server failover, so the clients still have access to the same (shared)
disk storage, and we also handle storage server failure (data on that
server is inaccessible (-EIO), but you are still able to read data from
other servers for which a file is striped across, and new files are not
striped over the failed server.
> Where this is the most diffcult problem for any cluster file systems
> that support disconnection. It is obviously not allow disconnection for
> system like having thousands of nodes is bad. The chance of node failure
> is very high in those cases.
Since clients don't have any persistent state (this is purely a client/server
architecture and not a peer-peer relationship), we don't care about client
failure. The servers will notice the client has failed, or it is not
responding in a timely manner, and the server will revoke all of the locks
that client is holding and boot it out of the cluster. If it tries to connect
back, it will do so as a "new" client, and any existing open files will get
-EIO if they are accessed.
> I would really like to hear this from Lustre as it already played around
> with 1000 nodes. When I came down to design a distributed file system end
> up blowing my head about this. Thanks for comments or may yo give some
> directions for me as I am very interested in this topic.
I would really suggest that you read the Lustre book, which is available
online at the lustre.org web site, or as part of the CVS repository.
Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [ANNOUNCE] Lustre Lite 1.0 beta 1
2003-03-17 17:13 ` David Chow
2003-03-17 17:46 ` Andreas Dilger
@ 2003-03-17 17:47 ` Peter Braam
1 sibling, 0 replies; 6+ messages in thread
From: Peter Braam @ 2003-03-17 17:47 UTC (permalink / raw)
To: David Chow; +Cc: Andreas Dilger, linux-fsdevel
> Andreas,
>
> Thanks for your lengthly explanation. The design looked like Coda with
> OST as you refer to the actual data storage. In fact, it is a stacked
> file cache or your store data in files persistently on existing file
> systesms. However, how can it handle a disconnected storage server?
> Where this is the most diffcult problem for any cluster file systems
> that support disconnection. It is obviously not allow disconnection for
> system like having thousands of nodes is bad. The chance of node failure
> is very high in those cases. As file allocation is still allowed to be
> done across multiple storage servers. The answer to resolving data
> conflicts transparently after disconnection is impossible! I would
> really like to hear this from Lustre as it already played around with
> 1000 nodes. When I came down to design a distributed file system end up
> blowing my head about this. Thanks for comments or may yo give some
> directions for me as I am very interested in this topic.
>
> regards,
> David Chow
Hi David,
Lustre recovers from client failures and clients recover from server
failures, but does not allow disconnected operations. Disconnected
operation referes to the ability to make updates when the clients are
not connected to the servers.
The Lustre architecture does allow for modular extensions that enable
disconnected operation, but no customers have asked for it yet.
We have designed a very simple, automatic algortithm for handling
conflicts arising during disconnected operations for InterMezzo, see
the paper on www.inter-mezzo.org. Again, this could be implemented
for Lustre, but we are waiting for a contract before we will do this.
Disconnected operation would involve a client cache, which will be
many many times slower than the distributed network infrastructure
when the file sizes exceed what can be cached in memory on the
client. This is unusual but quite important for supercomputing and
some industrial applications.
- Peter -
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2003-03-17 17:47 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-03-12 17:56 [ANNOUNCE] Lustre Lite 1.0 beta 1 Peter Braam
2003-03-16 5:38 ` David Chow
2003-03-16 8:38 ` Andreas Dilger
2003-03-17 17:13 ` David Chow
2003-03-17 17:46 ` Andreas Dilger
2003-03-17 17:47 ` Peter Braam
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).