From mboxrd@z Thu Jan  1 00:00:00 1970
From: Andreas Dilger <adilger@clusterfs.com>
Subject: Re: Reiserfs with Samba vs NetApp Filer
Date: Sun, 13 Oct 2002 00:38:09 -0600
Message-ID: <20021013063809.GL3045@clusterfs.com>
References: <Pine.LNX.4.33L2.0210100853240.1670-100000@localhost.localdomain> <200210121052.22603.bofh@coker.com.au> <20021012150028.G14731@vestdata.no> <200210121600.39712.bofh@coker.com.au> <3DA89A37.2070801@namesys.com> <20021012222950.GK3045@clusterfs.com> <3DA8CAF5.7050203@namesys.com>
Mime-Version: 1.0
Return-path: <reiserfs-list-return-11672-reiserfs=m.gmane.org@namesys.com>
list-help: <mailto:reiserfs-list-help@namesys.com>
list-unsubscribe: <mailto:reiserfs-list-unsubscribe@namesys.com>
list-post: <mailto:reiserfs-list@namesys.com>
Errors-To: flx@namesys.com
Content-Disposition: inline
In-Reply-To: <3DA8CAF5.7050203@namesys.com>
List-Id: <reiserfs-devel.vger.kernel.org>
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
To: Hans Reiser <reiser@namesys.com>
Cc: Russell Coker <bofh@coker.com.au>, Ragnar ? <reiserfs@ragnark.vestdata.no>, reiserfs-list@namesys.com

On Oct 13, 2002  05:23 +0400, Hans Reiser wrote:
> Andreas Dilger wrote:
> >On Oct 13, 2002  01:55 +0400, Hans Reiser wrote:
> >>Someday not too long from now, it will look like one filesystem even 
> >>though it is in multiple cases.  Whether that is in reiser5 or reiser6 
> >>depends on what sponsors fund first.
> >
> >you should take a look at Lustre - www.lustre.org.  We are basically
> >already developing what you are suggesting - a distributed filesystem
> >which is built atop two or more local filesystems.  The aggregate
> >throughput of N lustre storage servers is basically N times the
> >throughput of a single server (clients communicate directly with the
> >storage targets, so the cross-sectional bandwidth in perfectly
> >scalable on a switched network).
> >
> >Like Intermezzo, Lustre can be stacked on top of journaling local
> >filesystems, so it would be possible to use reiserfs for both the
> >metadata and storage targets.
> >
> >We are deploying on a 1000-node cluster early next year, and expect
> >total throughput around 4GB/s (we have already made a limited test
> >at 1.4GB/s) with 90TB of storage - on a 2.4 kernel.  Because we are
> >using multiple separate filesystems, we are not hampered by the
> >2TB block device limit, and we get all sorts of parallelisms that
> >are not possible with a single large server.
>
> I really don't understand what is the advantage of object based disk 
> storage.  It seems like its main effect is to prevent people from coming 
> up with optimizations the drive manufacturer did not think of.  I don't 
> at all understand these supposed metadata advantages.  We are lucky that 
> we don't have in disk drives the sort of innovation inhibiting 
> separation of compilers and CPUs that our compatriots in the language 
> design business suffer from.  The more smarts that go into the drive, 
> the more our field will ossify, unless they work closely with FS authors.

While it is _possible_ that you have object based storage on a drive,
the reality is that the object storage targets (OSTs henceforth) are
in practise really large storage systems, like a IBM Shark, or a Linux
box with a few TB of disk and RAID and LVM, and most importantly have a
regular filesystem like ext3 or reiserfs on top of all that storage to
do all of the real storage management.

The benefit of an object based netowrk protocol like Lustre is that the
client is free from all of the details of file and block allocation, and
the OST filesystem can do all of this.  Since each OST has an independent
filesystem, it can handle all of the locking/threading for block and
inode allocation locally.  It can also do this in any way it sees fit,
so it actually allows for MORE innovation at the OST filesystem level
than other distributed filesystems.

The Lustre network protocol could be considered akin to a network
version of the Linux VFS - the Lustre client (like a Linux process)
is doing I/O on a file, but the Lustre OST (like a Linux filesystem)
is free to implement the details of storing data within that file as
it sees fit.  Similarly, the metadata server (MDS) is free to store
filenames, EA data, etc however it wants.

Lustre, like the VFS, needs locking to ensure multiple processes do
not do conflicting things.  The Lustre locking code actually is only
doing per-node locking, and trusts the Linux VFS to do the right thing
internally, so we leverage as much of Al Viro's work in this complex
area as we possibly can ;-).

Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/