From mboxrd@z Thu Jan 1 00:00:00 1970 From: Hans Reiser Subject: Re: Reiserfs with Samba vs NetApp Filer Date: Sun, 13 Oct 2002 17:48:03 +0400 Message-ID: <3DA97993.4090403@namesys.com> References: <200210121052.22603.bofh@coker.com.au> <20021012150028.G14731@vestdata.no> <200210121600.39712.bofh@coker.com.au> <3DA89A37.2070801@namesys.com> <20021012222950.GK3045@clusterfs.com> <3DA8CAF5.7050203@namesys.com> <20021013063809.GL3045@clusterfs.com> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Return-path: list-help: list-unsubscribe: list-post: Errors-To: flx@namesys.com List-Id: Content-Type: text/plain; charset="us-ascii"; format="flowed" To: Andreas Dilger Cc: Russell Coker , Ragnar ? , reiserfs-list@namesys.com Andreas Dilger wrote: >On Oct 13, 2002 05:23 +0400, Hans Reiser wrote: > > >>Andreas Dilger wrote: >> >> >>>On Oct 13, 2002 01:55 +0400, Hans Reiser wrote: >>> >>> >>>>Someday not too long from now, it will look like one filesystem even >>>>though it is in multiple cases. Whether that is in reiser5 or reiser6 >>>>depends on what sponsors fund first. >>>> >>>> >>>you should take a look at Lustre - www.lustre.org. We are basically >>>already developing what you are suggesting - a distributed filesystem >>>which is built atop two or more local filesystems. The aggregate >>>throughput of N lustre storage servers is basically N times the >>>throughput of a single server (clients communicate directly with the >>>storage targets, so the cross-sectional bandwidth in perfectly >>>scalable on a switched network). >>> >>>Like Intermezzo, Lustre can be stacked on top of journaling local >>>filesystems, so it would be possible to use reiserfs for both the >>>metadata and storage targets. >>> >>>We are deploying on a 1000-node cluster early next year, and expect >>>total throughput around 4GB/s (we have already made a limited test >>>at 1.4GB/s) with 90TB of storage - on a 2.4 kernel. Because we are >>>using multiple separate filesystems, we are not hampered by the >>>2TB block device limit, and we get all sorts of parallelisms that >>>are not possible with a single large server. >>> >>> >>I really don't understand what is the advantage of object based disk >>storage. It seems like its main effect is to prevent people from coming >>up with optimizations the drive manufacturer did not think of. I don't >>at all understand these supposed metadata advantages. We are lucky that >>we don't have in disk drives the sort of innovation inhibiting >>separation of compilers and CPUs that our compatriots in the language >>design business suffer from. The more smarts that go into the drive, >>the more our field will ossify, unless they work closely with FS authors. >> >> > >While it is _possible_ that you have object based storage on a drive, >the reality is that the object storage targets (OSTs henceforth) are >in practise really large storage systems, like a IBM Shark, or a Linux >box with a few TB of disk and RAID and LVM, and most importantly have a >regular filesystem like ext3 or reiserfs on top of all that storage to >do all of the real storage management. > >The benefit of an object based netowrk protocol like Lustre is that the >client is free from all of the details of file and block allocation, and >the OST filesystem can do all of this. Since each OST has an independent >filesystem, it can handle all of the locking/threading for block and >inode allocation locally. It can also do this in any way it sees fit, >so it actually allows for MORE innovation at the OST filesystem level >than other distributed filesystems. > >The Lustre network protocol could be considered akin to a network >version of the Linux VFS - the Lustre client (like a Linux process) >is doing I/O on a file, but the Lustre OST (like a Linux filesystem) >is free to implement the details of storing data within that file as >it sees fit. Similarly, the metadata server (MDS) is free to store >filenames, EA data, etc however it wants. > >Lustre, like the VFS, needs locking to ensure multiple processes do >not do conflicting things. The Lustre locking code actually is only >doing per-node locking, and trusts the Linux VFS to do the right thing >internally, so we leverage as much of Al Viro's work in this complex >area as we possibly can ;-). > >Cheers, Andreas >-- >Andreas Dilger >http://www-mddsp.enel.ucalgary.ca/People/adilger/ >http://sourceforge.net/projects/ext2resize/ > > > > > Ok, this makes some sense. How does Seagate feel about that view, I am curious? Hans