Re: [ANNOUNCE] Lustre Lite 1.0 beta 1

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Andreas Dilger <adilger@clusterfs.com>
To: David Chow <davidchow@shaolinmicro.com>
Cc: Peter Braam <braam@clusterfs.com>, linux-fsdevel@vger.kernel.org
Subject: Re: [ANNOUNCE] Lustre Lite 1.0 beta 1
Date: Sun, 16 Mar 2003 01:38:58 -0700	[thread overview]
Message-ID: <20030316013858.C12806@schatzie.adilger.int> (raw)
In-Reply-To: <3E740DE1.6010204@shaolinmicro.com>; from davidchow@shaolinmicro.com on Sun, Mar 16, 2003 at 01:38:41PM +0800

On Mar 16, 2003  13:38 +0800, David Chow wrote:
> Peter Braam wrote:
> >Features
> >--------
> >Lustre Lite 0.6:
> >
> >- has been tested extensively on ia32 and ia64 Linux platforms
> >- supports TCP/IP and Quadrics Elan3 interconnects
> >- supports multiple Object Storage Targets (OSTs) for file data storage
> >- supports multiple Metadata Servers (MDSs) in an active/passive
> >  failover configuration (requires shared storage between MDS nodes).
> >  Automatic failover requires an external failover package such as
> >  Red Hat's clumanager.
> >- provides a nearly POSIX-compliant filesystem interface (some areas
> >  remain non-compliant; for example, we do not synchronize atimes)
> >- aims to recover from any single failure without loss of data or
> >  application errors
> >- scales well; we have tested with as many as 1,100 clients and 128 OSTs
> >- is Free Software, released under the terms and conditions of the GNU
> >  General Public License
>
> Quite interesting. How is it different from gfs and other cluster file 
> systems?

One primary difference between Lustre and other cluster file systems is
that Lustre is designed with enormous scalability in mind.  It can already
(in the first "Lite" version) handle over 1000 clients, multi GB/s aggregate
IO rates, and many tens or hundreds of TB of storage.  Future plans scale
up to tens of thousands of clients, and PB of storage.

> Is it a SAN/Shared-SCSI disk sharing file system or some cluster file
> system which ride on something like gnbd/nbd/iSCSI pool ? 

Not really either of these.  Primarily it is a multiple-server based
distributed file system, with (currently) one server doing all of the
metadata work, and (some configurable number between 1 and hundreds
of) storage servers (OSTs).  The servers are independent (generally
Linux) boxes with attached disks.  There is the possibility of a "SAN"
interface from the clients to the storage servers, but this isn't the
common usage scenario.

> Not quite sensible to run cluster file systems data channels over TCP/IP 
> as it is too slow, as the price of fibre-channel storage keep dropping 
> as well as those for implement cluster file systems in a production 
> environment are usually affordable for fibre-channel storage.

Well, TCP/IP is one of the network types we support (via Portals Network
Abstraction Layers, NALs) mostly for development/testing, along with
Quadrics Elan (our large clusters use these, VERY FAST), and in-development
Myrinet NAL.

The reason that FC interconnect isn't the primary interconnect is that:
1) FC is still more expensive than GigE (although less expensive than Elan)
2) you can't scale FC to thousands of nodes
3) more people have TCP interconnect than anything else (even if it is
   TCP-over-FC)
4) none of our customers actually have asked for FC yet

> The name "OST" seems interesting but I don't really like the iSCSI or pool
> of block devices spreading across multiple machines as it is very hard to 
> manage and likely to fail easily.  How is lustre handling these senarios?

Well, each OST is primarily (from the Lustre sense) the network interface
protocol, and the internal implementation is opaque to the outside world.
Each OST is independent of the others, although the clients end up allocating
files on all of them.

For Linux OSTs we use ext3 on regular block devices (raw disk, MD RAID,
LOV, whatever you want to use) for the actual data storage, and the
filesystem is journaled/managed totally independent from all of the
other OSTs.  We have also used reiserfs for OST storage at times (and
concievably you could use XFS/JFS), and there are also 3rd party vendors
who are building OST boxes with their own non-Linux internals.

Since this is just regular disk attached to regular Linux boxes, it is
also possible to do storage server failover (already being implemented)
without clients even being aware of a problem.  Single disk failure is
expected to be handled by RAID of some kind.

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/

next prev parent reply	other threads:[~2003-03-16  8:38 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2003-03-12 17:56 [ANNOUNCE] Lustre Lite 1.0 beta 1 Peter Braam
2003-03-16  5:38 ` David Chow
2003-03-16  8:38   ` Andreas Dilger [this message]
2003-03-17 17:13     ` David Chow
2003-03-17 17:46       ` Andreas Dilger
2003-03-17 17:47       ` Peter Braam

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20030316013858.C12806@schatzie.adilger.int \
    --to=adilger@clusterfs.com \
    --cc=braam@clusterfs.com \
    --cc=davidchow@shaolinmicro.com \
    --cc=linux-fsdevel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).