Re: [RFC] mount flag "direct" (fwd)

All of lore.kernel.org
 help / color / mirror / Atom feed

* Re: [RFC] mount flag "direct" (fwd)
@ 2002-09-03 15:39 Peter T. Breuer
  2002-09-03 15:44 ` Rik van Riel
  0 siblings, 1 reply; 28+ messages in thread
From: Peter T. Breuer @ 2002-09-03 15:39 UTC (permalink / raw)
  To: riel; +Cc: linux kernel

Hi!

Thanks for the comment!

On Tue, 3 Sep 2002, Peter T. Breuer wrote:

> > Rationale:
> > No caching means that each kernel doesn't go off with its own idea of
> > what is on the disk in a file, at least. Dunno about directories and
> > metadata.

> And what if they both allocate the same disk block to another
> file, simultaneously ?

I see - yes, that's a good one.

I assumed that I would need to make several VFS operations atomic
or revertable, or simply forbid things like new file allocations or
extensions (i.e.  the above), depending on what is possible or not.

This is precisely the kind of objection that I want to hear about.

OK - reply:
It appears that in order to allocate away free space, one must first
"grab" that free space using a shared lock. That's perfectly feasible.

Thank you.

Where could I intercept the block allocation in VFS?

> A mount option isn't enough to achieve your goal.
> 
> It looks like you want GFS or OCFS. Info about GFS can be found at:

No, I don't want ANY FS. Thanks, I know about these, but they're not
it. I want support for /any/ FS at all at the VFS level.

>	http://www.opengfs.org/
>	http://www.sistina.com/  (commercial GFS)

> Dunno where Oracle's cluster fs is documented.

I know about that too, but no, I do not want ANY FS, I want /any/ FS.
:-)

Peter

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] mount flag "direct" (fwd)
  2002-09-03 15:39 [RFC] mount flag "direct" (fwd) Peter T. Breuer
@ 2002-09-03 15:44 ` Rik van Riel
  2002-09-03 15:50   ` Peter T. Breuer
  0 siblings, 1 reply; 28+ messages in thread
From: Rik van Riel @ 2002-09-03 15:44 UTC (permalink / raw)
  To: Peter T. Breuer; +Cc: linux kernel

On Tue, 3 Sep 2002, Peter T. Breuer wrote:

> I assumed that I would need to make several VFS operations atomic
> or revertable, or simply forbid things like new file allocations or
> extensions (i.e.  the above), depending on what is possible or not.

> No, I don't want ANY FS. Thanks, I know about these, but they're not
> it. I want support for /any/ FS at all at the VFS level.

You can't.  Even if each operation is fully atomic on one node,
you still don't have synchronisation between the different nodes
sharing one disk.

You really need filesystem support.

Rik
-- 
	http://www.linuxsymposium.org/2002/
"You're one of those condescending OLS attendants"
"Here's a nickle kid.  Go buy yourself a real t-shirt"

http://www.surriel.com/		http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] mount flag "direct" (fwd)
  2002-09-03 15:44 ` Rik van Riel
@ 2002-09-03 15:50   ` Peter T. Breuer
  2002-09-03 15:56     ` Chris Wedgwood
                       ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: Peter T. Breuer @ 2002-09-03 15:50 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Peter T. Breuer, linux kernel

"A month of sundays ago Rik van Riel wrote:"
> On Tue, 3 Sep 2002, Peter T. Breuer wrote:
> 
> > I assumed that I would need to make several VFS operations atomic
> > or revertable, or simply forbid things like new file allocations or
> > extensions (i.e.  the above), depending on what is possible or not.
> 
> > No, I don't want ANY FS. Thanks, I know about these, but they're not
> > it. I want support for /any/ FS at all at the VFS level.
> 
> You can't.  Even if each operation is fully atomic on one node,
> you still don't have synchronisation between the different nodes
> sharing one disk.

Yes, I do have synchronization - locks are/can be shared between both
kernels using a device driver mechanism that I implemented. That is
to say, I can guarantee that atomic operations by each kernel do not
overlap "on the device", and remain locally ordered at least (and
hopefully globally, if I get the time thing right).

It's not that hard - the locks are held on the remote disk by a
"guardian" driver, to which the drivers on both of the kernels
communicate.  A fake "scsi adapter", if you prefer.

> You really need filesystem support.

I don't think so. I think you're not convinced either! But
I would really like it if you could put your finger on an
overriding objection.

Peter

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] mount flag "direct" (fwd)
  2002-09-03 15:50   ` Peter T. Breuer
@ 2002-09-03 15:56     ` Chris Wedgwood
  2002-09-03 15:59       ` Peter T. Breuer
  2002-09-03 16:09     ` Richard B. Johnson
  2002-09-03 16:58     ` Anton Altaparmakov
  2 siblings, 1 reply; 28+ messages in thread
From: Chris Wedgwood @ 2002-09-03 15:56 UTC (permalink / raw)
  To: Peter T. Breuer; +Cc: Rik van Riel, linux kernel

On Tue, Sep 03, 2002 at 05:50:42PM +0200, Peter T. Breuer wrote:

    Yes, I do have synchronization - locks are/can be shared between both
    kernels using a device driver mechanism that I implemented.

What happens if one of the kernels/nodes dies?


  --cw

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] mount flag "direct" (fwd)
  2002-09-03 15:56     ` Chris Wedgwood
@ 2002-09-03 15:59       ` Peter T. Breuer
  0 siblings, 0 replies; 28+ messages in thread
From: Peter T. Breuer @ 2002-09-03 15:59 UTC (permalink / raw)
  To: Chris Wedgwood; +Cc: Peter T. Breuer, Rik van Riel, linux kernel

"A month of sundays ago Chris Wedgwood wrote:"
> On Tue, Sep 03, 2002 at 05:50:42PM +0200, Peter T. Breuer wrote:
> 
>     Yes, I do have synchronization - locks are/can be shared between both
>     kernels using a device driver mechanism that I implemented.
> 
> What happens if one of the kernels/nodes dies?

With the lock held, you mean? Depends on policy. There are two
implemented at present:

   a) show all errors
   b) hide all errors

In case b) the lock will continue to be held until the other
node comes back up. In case a) the lock will be abandoned after
timeout, and pending requests will be errored.

I'll explore the ramifications later.

Peter

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] mount flag "direct" (fwd)
  2002-09-03 15:50   ` Peter T. Breuer
  2002-09-03 15:56     ` Chris Wedgwood
@ 2002-09-03 16:09     ` Richard B. Johnson
  2002-09-03 16:29       ` Peter T. Breuer
  2002-09-03 16:58     ` Anton Altaparmakov
  2 siblings, 1 reply; 28+ messages in thread
From: Richard B. Johnson @ 2002-09-03 16:09 UTC (permalink / raw)
  To: Peter T. Breuer; +Cc: Rik van Riel, linux kernel

On Tue, 3 Sep 2002, Peter T. Breuer wrote:

> "A month of sundays ago Rik van Riel wrote:"
> > On Tue, 3 Sep 2002, Peter T. Breuer wrote:
> > 
> > > I assumed that I would need to make several VFS operations atomic
> > > or revertable, or simply forbid things like new file allocations or
> > > extensions (i.e.  the above), depending on what is possible or not.
> > 
> > > No, I don't want ANY FS. Thanks, I know about these, but they're not
> > > it. I want support for /any/ FS at all at the VFS level.
> > 
> > You can't.  Even if each operation is fully atomic on one node,
> > you still don't have synchronisation between the different nodes
> > sharing one disk.
> 
> Yes, I do have synchronization - locks are/can be shared between both
> kernels using a device driver mechanism that I implemented. That is
> to say, I can guarantee that atomic operations by each kernel do not
> overlap "on the device", and remain locally ordered at least (and
> hopefully globally, if I get the time thing right).
> 
> It's not that hard - the locks are held on the remote disk by a
> "guardian" driver, to which the drivers on both of the kernels
> communicate.  A fake "scsi adapter", if you prefer.
> 
> > You really need filesystem support.
> 
> I don't think so. I think you're not convinced either! But
> I would really like it if you could put your finger on an
> overriding objection.
> 
> Peter

Lets say you have a perfect locking mechanism, a fake SCSI layer
as you state. You are now going to create a new file on the
shared block device. You are careful that you use only space
that you "own", etc., so you perfectly create a new file on
your VFS.

How does the other user's of this device "know" that there is
a new file so it can update its notion of the block-device state?

You have created perfect isolation so, by definition, the other
isolated user's don't know that you have just used space that they
think that they own.

Now, the notion of a complete 'file-system' for support may not be
required. What you need is like a file-system without all the frills.
It needs to act like a "hard disk malloc" or slab allocator. That way,
you can have independence between the systems that are accessing the
device.

So, if you made this, you are still stuck with the problem of duplicate
file-names, but this could be resolved by using a "librarian" layer
so that a new file-name and its meta-data gets known by all the
users of the device.

FYI, the "librarian" layer is the file-system so, I have shown that
you need file-system support.

Cheers,
Dick Johnson
Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).
The US military has given us many words, FUBAR, SNAFU, now ENRON.
Yes, top management were graduates of West Point and Annapolis.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] mount flag "direct" (fwd)
  2002-09-03 16:09     ` Richard B. Johnson
@ 2002-09-03 16:29       ` Peter T. Breuer
  2002-09-03 16:33         ` Rik van Riel
                           ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: Peter T. Breuer @ 2002-09-03 16:29 UTC (permalink / raw)
  To: root; +Cc: Peter T. Breuer, Rik van Riel, linux kernel

"Richard B. Johnson wrote:"
> On Tue, 3 Sep 2002, Peter T. Breuer wrote:
> > It's not that hard - the locks are held on the remote disk by a
> > "guardian" driver, to which the drivers on both of the kernels
> > communicate.  A fake "scsi adapter", if you prefer.
> > 
> > > You really need filesystem support.

> Lets say you have a perfect locking mechanism, a fake SCSI layer

OK.

> as you state. You are now going to create a new file on the
> shared block device. You are careful that you use only space
> that you "own", etc., so you perfectly create a new file on
> your VFS.

OK.

> How does the other user's of this device "know" that there is
> a new file so it can update its notion of the block-device state?

The block device itself is stateless at the block level. Every block
access goes "direct to the metal".

The question is how much FS state is cached on either kernel.
If it is too much, then I will ask how I can cause to be less, perhaps
by use of a flag that parallels how O_DIRECT works.  I thought that new
files were entries in a directories inode and I agree that inodes are
held in memory!  But I don't know when they are first read or reread.
The directory entry would ceryainly have to be reread after a write
operation on disk that touched it - or more simply, the directory entry
would hvae to be reread every time it were needed, i.e. be uncached.

If that presently is not possible, then I would like to think about
making it possible. Isn't there some kind of inode reading that goes on
at mount? Can I cause it to happen (or unhappen) at will?

> You have created perfect isolation so, by definition, the other
> isolated user's don't know that you have just used space that they
> think that they own.

Well, I don't think that's a fair analogy .. if a "reserve_blocks"
call is added to VFS, then I can use it to prelock the "space that
they think they own", and prevent contention. The question is how
each FS does the block reservation, and why it should not go through
a generic method in the VFS layer.

> Now, the notion of a complete 'file-system' for support may not be
> required. What you need is like a file-system without all the frills.

I think that's the wrong tack, though simply _disabling_ some
operations initially (such as making new files!) may be the way to go.
Just enable more ops as generic support is added.

> FYI, the "librarian" layer is the file-system so, I have shown that
> you need file-system support.

Nice try - your argument reduces to saying that the state of the
directory inodes must be shared. I agree and suggest two remedies

  1) maintain no directory inode state, but reread them every time
     (how?)
  2) force rereading of a particular inode or all inodes when
     signalled to do so.

I would prefer (1). It seems in the spirit of O_DIRECT. I imagine that
(2) is presently easy to do (but of course horrible).

Peter

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] mount flag "direct" (fwd)
  2002-09-03 16:29       ` Peter T. Breuer
@ 2002-09-03 16:33         ` Rik van Riel
  2002-09-03 17:32         ` Richard B. Johnson
  2002-09-03 18:53         ` Lars Marowsky-Bree
  2 siblings, 0 replies; 28+ messages in thread
From: Rik van Riel @ 2002-09-03 16:33 UTC (permalink / raw)
  To: Peter T. Breuer; +Cc: root, linux kernel

On Tue, 3 Sep 2002, Peter T. Breuer wrote:

> > How does the other user's of this device "know" that there is
> > a new file so it can update its notion of the block-device state?
>
> The block device itself is stateless at the block level. Every block
> access goes "direct to the metal".
>
> The question is how much FS state is cached on either kernel.
> If it is too much, then I will ask how I can cause to be less, perhaps
> by use of a flag that parallels how O_DIRECT works.  I thought that new
> files were entries in a directories inode and I agree that inodes are
> held in memory!  But I don't know when they are first read or reread.

And neither can you know.  After all, this is filesystem dependant.

You cannot decide whether filesystem-independant clustering is
possible unless you know that all the filesystems play by your
rules.  So much for filesystem-independance.

regards,

Rik
-- 
	http://www.linuxsymposium.org/2002/
"You're one of those condescending OLS attendants"
"Here's a nickle kid.  Go buy yourself a real t-shirt"

http://www.surriel.com/		http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] mount flag "direct" (fwd)
  2002-09-03 15:50   ` Peter T. Breuer
  2002-09-03 15:56     ` Chris Wedgwood
  2002-09-03 16:09     ` Richard B. Johnson
@ 2002-09-03 16:58     ` Anton Altaparmakov
  2002-09-03 17:26       ` Peter T. Breuer
  2 siblings, 1 reply; 28+ messages in thread
From: Anton Altaparmakov @ 2002-09-03 16:58 UTC (permalink / raw)
  To: Peter T. Breuer; +Cc: Rik van Riel, linux kernel

On Tue, 3 Sep 2002, Peter T. Breuer wrote:
> "A month of sundays ago Rik van Riel wrote:"
> > On Tue, 3 Sep 2002, Peter T. Breuer wrote:
> > 
> > > I assumed that I would need to make several VFS operations atomic
> > > or revertable, or simply forbid things like new file allocations or
> > > extensions (i.e.  the above), depending on what is possible or not.
> > 
> > > No, I don't want ANY FS. Thanks, I know about these, but they're not
> > > it. I want support for /any/ FS at all at the VFS level.
> > 
> > You can't.  Even if each operation is fully atomic on one node,
> > you still don't have synchronisation between the different nodes
> > sharing one disk.
> 
> Yes, I do have synchronization - locks are/can be shared between both
> kernels using a device driver mechanism that I implemented. That is
> to say, I can guarantee that atomic operations by each kernel do not
> overlap "on the device", and remain locally ordered at least (and
> hopefully globally, if I get the time thing right).
> 
> It's not that hard - the locks are held on the remote disk by a
> "guardian" driver, to which the drivers on both of the kernels
> communicate.  A fake "scsi adapter", if you prefer.

You have synchronisation at block layer level which is completely
insufficient.

> > You really need filesystem support.
> 
> I don't think so. I think you're not convinced either! But
> I would really like it if you could put your finger on an
> overriding objection.

You think wrong... (-;

I will give you a few examples of the why you are wrong:

1) Neither the block layer nor the VFS have anything to do with block
allocations and hence you cannot solve this problem at VFS nor block layer
level. The only thing the VFS does is tell the file system driver "write X
number of bytes to the file F at offset Y". Nothing more than that! The
file system then goes off and allocates blocks in its own disk block
bitmap and then writes the data. The only locking used is file system
specific. For example NTFS has a per mounted volume rw_semaphore to
synchronize accesses to the disk block bitmap. But other file systems most
certainly implement this differently...

2) Some file systems cache the metadata. For example in NTFS the
disk block bitmap is stored inside a normal file called $Bitmap. Thus NTFS
uses the page cache to access the block bitmap and this means that when
new blocks are allocated, we take the volume specific rw_semaphore and
then we search the page cache of $Bitmap for zero bits, set the
required number of bits to one, and then we drop the rw_semaphore and
return which blocks were allocated to the calling ntfs function.

Even if you modified the ntfs driver so that the two hosts accessing the
same device would share the same rw_semaphore, it still wouldn't work,
because there is no synchroisation between the disk block bitmap on the
two hosts. When one has gone through the above procedure and has dropped
the lock, the allocate clusters are held in memory only, thus the other
host doesn't see that some blocks have been allocated and goes off and
allocates the same blocks to a different file as Rik and myself described
already.

And this is just the tip of the iceberg. The only way you could get
something like this to work is by modifying each and every file system
driver to use some VFS provided mechanism for all (de-)allocations, both
disk block, and inode ones. Further you would need to provide shared
memory, i.e. the two hosts need to share the same page cache / address
space mappings. So basically, it can only work if the two hosts are
virtually the same host, i.e. if the two hosts are part of a Single System
Image Cluster...

Best regards,

	Anton
-- 
Anton Altaparmakov <aia21 at cantab.net> (replace at with @)
Linux NTFS maintainer / IRC: #ntfs on irc.openprojects.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] mount flag "direct" (fwd)
  2002-09-03 16:58     ` Anton Altaparmakov
@ 2002-09-03 17:26       ` Peter T. Breuer
  0 siblings, 0 replies; 28+ messages in thread
From: Peter T. Breuer @ 2002-09-03 17:26 UTC (permalink / raw)
  To: Anton Altaparmakov; +Cc: Peter T. Breuer, Rik van Riel, linux kernel

"A month of sundays ago Anton Altaparmakov wrote:"
> > It's not that hard - the locks are held on the remote disk by a
> > "guardian" driver, to which the drivers on both of the kernels
> > communicate.  A fake "scsi adapter", if you prefer.
> 
> You have synchronisation at block layer level which is completely
> insufficient.

No, I have syncronization whenever one cares to ask for it (the level
is purely notional), but I suggest that one adds a "tag" request type
to the block layers in order that one may ask for a lock at VFS level
by issuing a "tag block request", which does nothing except stop
anybody else from processing the named notional resource until the
corresponding "untag block request" is issued.

> 1) Neither the block layer nor the VFS have anything to do with block
> allocations and hence you cannot solve this problem at VFS nor block layer

That's OK. We've already agreed that the fs's need to reserve blocks
before they make an allocaton, and that they need to do that by calling
up to VFS to reserve it, and that VFS ought to call back down to let
them reserve it the way they like, but take the opportunity to notice
the reserve call.

> level. The only thing the VFS does is tell the file system driver "write X
> number of bytes to the file F at offset Y". Nothing more than that! The
> file system then goes off and allocates blocks in its own disk block

Well, it needs to be altered to call back up first, telling the VFS not
to allow any allocations for a moment (that's a lock), and then the
VFS calls back down and finds out what it feels like reserving, and
now we get to the tricky bit, because each kernel has its own bitmap
... well you tell me. I can see several generic implementations:

   1) the bitmap is required to be held on disk by a FS and to be reread
   each time any kernel wants to make a new file allocation (that's not
   so expensive - new files are generally rare and we don't care).

   2) the VFS holds the bitmap and we add ops to read and write the
   bitmap in VFS, and intercept those calls and share them (somehow -
   details to be arranged).

   3) .. any or all of this behavior be forced by a MMETADIRECT
   flag that formids metadata to be cached in memory without being
   synced to disk.

> bitmap and then writes the data. The only locking used is file system
> specific. For example NTFS has a per mounted volume rw_semaphore to
> synchronize accesses to the disk block bitmap. But other file systems most
> certainly implement this differently...

Then they will have to be patched to do it generically ..?

> 2) Some file systems cache the metadata. For example in NTFS the

This seems like a pretty valid objection!

> disk block bitmap is stored inside a normal file called $Bitmap. Thus NTFS
> uses the page cache to access the block bitmap and this means that when

This is the same objection as your first objection, I think, except
made particular. My response must therefore be the same - make the
bitmap operations pass through VFS at least, and add a METADIRECT
flag that makes the information be reread when it is needed.

The question is how best to force it, or if the data should be shared
via the VFS's directly (I can handle that - I can make a fake device
that contains the bitmap datam, for example).

> new blocks are allocated, we take the volume specific rw_semaphore and
> then we search the page cache of $Bitmap for zero bits, set the
> required number of bits to one, and then we drop the rw_semaphore and
> return which blocks were allocated to the calling ntfs function.

I'm not sure what relevance the semaphore has. I'm advocating that the
bitmaps ops become generic, which automatically gives the opportunity
for generic locking mechanisms.

> Even if you modified the ntfs driver so that the two hosts accessing the
> same device would share the same rw_semaphore, it still wouldn't work,

I won't modify it except to use new generic ops  instead of fs
particular ones. One could say that only FS's which use the
generic VFS ops are suitable candidates to BE fs's on a shared device.
Then it ceases to be a problem, and becomes a desired goal.

> And this is just the tip of the iceberg. The only way you could get

Well, how much more is there? What you mentioned  didn't worry me
because it wasn't a generic strategic objection.

> something like this to work is by modifying each and every file system
> driver to use some VFS provided mechanism for all (de-)allocations, both

Yes. Precisely. There is nothing wrong with that.

> disk block, and inode ones. Further you would need to provide shared
> memory, i.e. the two hosts need to share the same page cache / address

Well, that I don't know about. Can you elaborate a bit on that? I'm not
at all sure that is the case. Can you provide another of your very
useful concretizations?

> space mappings. So basically, it can only work if the two hosts are
> virtually the same host, i.e. if the two hosts are part of a Single System
> Image Cluster...

Thank you! I find that input very enlightening.

Peter

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] mount flag "direct" (fwd)
  2002-09-03 16:29       ` Peter T. Breuer
  2002-09-03 16:33         ` Rik van Riel
@ 2002-09-03 17:32         ` Richard B. Johnson
  2002-09-03 18:53         ` Lars Marowsky-Bree
  2 siblings, 0 replies; 28+ messages in thread
From: Richard B. Johnson @ 2002-09-03 17:32 UTC (permalink / raw)
  To: Peter T. Breuer; +Cc: Rik van Riel, linux kernel

On Tue, 3 Sep 2002, Peter T. Breuer wrote:

> "Richard B. Johnson wrote:"
> > On Tue, 3 Sep 2002, Peter T. Breuer wrote:
> > > It's not that hard - the locks are held on the remote disk by a
> > > "guardian" driver, to which the drivers on both of the kernels
> > > communicate.  A fake "scsi adapter", if you prefer.
> > > 
> > > > You really need file-system support.
> 
> > Lets say you have a perfect locking mechanism, a fake SCSI layer
> 
> OK.
> 
> > as you state. You are now going to create a new file on the
> > shared block device. You are careful that you use only space
> > that you "own", etc., so you perfectly create a new file on
> > your VFS.
> 
> OK.
> 
> > How does the other user's of this device "know" that there is
> > a new file so it can update its notion of the block-device state?
> 
> The block device itself is stateless at the block level. Every block
> access goes "direct to the metal".
> 

Well it doesn't. In particular SCSI and Fire-wire Drives have data
queued and, to give the CPU something to do while the writes are
occurring, the block-layer sleeps. So, you can have some other
task "think" wrong about the state of the machine.

> The question is how much FS state is cached on either kernel.
> If it is too much, then I will ask how I can cause to be less, perhaps
> by use of a flag that parallels how O_DIRECT works.  I thought that new
> files were entries in a directories inode and I agree that inodes are
> held in memory!  But I don't know when they are first read or reread.

Unless you unmount/re-mount, they will not be re-read. That's why you
need to "share" at the file-system level. FYI, it's already being
done and clustered disks were first done by DEC under RSX/11, then
under VAX/VMS. It's truly "old-hat".

> The directory entry would certainly have to be reread after a write
> operation on disk that touched it - or more simply, the directory entry
> would hvae to be reread every time it were needed, i.e. be uncached.
> 
> If that presently is not possible, then I would like to think about
> making it possible. Isn't there some kind of inode reading that goes on
> at mount? Can I cause it to happen (or unhappen) at will?
> 

Yes but you have a problem with synchronization. You need to synchronize
a file-system at the file-system level so that one process accessing the
file-system, obtains the exact same image as any other process.

> > You have created perfect isolation so, by definition, the other
> > isolated user's don't know that you have just used space that they
> > think that they own.
> 
> Well, I don't think that's a fair analogy .. if a "reserve_blocks"
> call is added to VFS, then I can use it to prelock the "space that
> they think they own", and prevent contention. The question is how
> each FS does the block reservation, and why it should not go through
> a generic method in the VFS layer.
> 
> > Now, the notion of a complete 'file-system' for support may not be
> > required. What you need is like a file-system without all the frills.
> 
> I think that's the wrong tack, though simply _disabling_ some
> operations initially (such as making new files!) may be the way to go.
> Just enable more ops as generic support is added.

Well, if you don't make new files, and you don't update any file-data,
they you just mount R/O and be done with it. When a FS is mounted
R/O, one doesn't care about atomicity anymore, only performance.

Once you allow a file's contents to be altered, you have the problem
of making certain that every processes' notion of the file contents
is identical. Again, that's done at the file-system layer, not at
some block layer.

> 
> > FYI, the "librarian" layer is the file-system so, I have shown that
> > you need file-system support.
> 
> Nice try - your argument reduces to saying that the state of the
> directory inodes must be shared. I agree and suggest two remedies
> 
>   1) maintain no directory inode state, but reread them every time
>      (how?)

If you don't maintain some kind of state, you end up reading all
directory inodes. I don't think you want that. You need to maintain
that "directory inode state" and that's what a file-system does.

>   2) force rereading of a particular inode or all inodes when
>      signalled to do so.

The signaler needs to "know". Which means that somebody is maintaining
the file-system state. You shouldn't have to re-invent file-systems to
do that. You just maintain synchronomy at the file-system level and
be done with it.

> 
> I would prefer (1). It seems in the spirit of O_DIRECT. I imagine that
> (2) is presently easy to do (but of course horrible).
> 
> Peter

Cheers,
Dick Johnson
Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).
The US military has given us many words, FUBAR, SNAFU, now ENRON.
Yes, top management were graduates of West Point and Annapolis.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] mount flag "direct" (fwd)
  2002-09-03 16:29       ` Peter T. Breuer
  2002-09-03 16:33         ` Rik van Riel
  2002-09-03 17:32         ` Richard B. Johnson
@ 2002-09-03 18:53         ` Lars Marowsky-Bree
  2002-09-03 21:07           ` Peter T. Breuer
  2 siblings, 1 reply; 28+ messages in thread
From: Lars Marowsky-Bree @ 2002-09-03 18:53 UTC (permalink / raw)
  To: Peter T. Breuer, root; +Cc: Rik van Riel, linux kernel

On 2002-09-03T18:29:02,
   "Peter T. Breuer" <ptb@it.uc3m.es> said:

> > Lets say you have a perfect locking mechanism, a fake SCSI layer
> OK.

BTW, I would like to see your perfect distributed locking mechanism.


> The directory entry would ceryainly have to be reread after a write
> operation on disk that touched it - or more simply, the directory entry
> would hvae to be reread every time it were needed, i.e. be uncached.

*ouch* Sure. Right. You just have to read it from scratch every time. How
would you make readdir work?

> If that presently is not possible, then I would like to think about
> making it possible.

Just please, tell us why.



Sincerely,
    Lars Marowsky-Brée <lmb@suse.de>

-- 
Immortality is an adequate definition of high availability for me.
	--- Gregory F. Pfister


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] mount flag "direct" (fwd)
  2002-09-03 18:53         ` Lars Marowsky-Bree
@ 2002-09-03 21:07           ` Peter T. Breuer
  2002-09-03 21:15             ` Andreas Dilger
                               ` (5 more replies)
  0 siblings, 6 replies; 28+ messages in thread
From: Peter T. Breuer @ 2002-09-03 21:07 UTC (permalink / raw)
  To: Lars Marowsky-Bree; +Cc: Peter T. Breuer, root, Rik van Riel, linux kernel

"A month of sundays ago Lars Marowsky-Bree wrote:"
> On 2002-09-03T18:29:02,
>    "Peter T. Breuer" <ptb@it.uc3m.es> said:
> 
> > > Lets say you have a perfect locking mechanism, a fake SCSI layer
> > OK.
> 
> BTW, I would like to see your perfect distributed locking mechanism.

That bit's easy and is done. The "trick" is NOT to distribute the lock,
but to have it in one place - on the driver that guards the remote
disk resource.

> > The directory entry would certainly have to be reread after a write
> > operation on disk that touched it - or more simply, the directory entry
> > would have to be reread every time it were needed, i.e. be uncached.
> 
> *ouch* Sure. Right. You just have to read it from scratch every time. How
> would you make readdir work?

Well, one has to read it from scratch. I'll set about seeing how to do.
CLues welcome.

> > If that presently is not possible, then I would like to think about
> > making it possible.
> 
> Just please, tell us why.

You don't really want the whole rationale. It concerns certain 
european (nay, world ..) scientific projects and the calculations of the
technologists about the progress in hardware over the next few years.
We/they foresee that we will have to move to multiple relatively small
distributed disks per node in order to keep the bandwidth per unit of
storage at the levels that they will have to be at to keep the farms
fed.  We are talking petabytes of data storage in thousands of nodes
moving over gigabit networks.

The "big view" calculations indicate that we must have distributed
shared writable data.

These calculations affect us all. They show us what way computing
will evolve under the price and technology pressures. The calculations
are only looking to 2006, but that's what they show. For example
if we think about a 5PB system made of 5000 disks of 1TB each in a GE
net, we calculate the aggregate bandwidth available in the topology as
50GB/s, which is less than we need in order to keep the nodes fed
at the rates they could be fed at (yes, a few % loss translates into
time and money).  To increase available bandwidth we must have more
channels to the disks, and more disks, ... well, you catch my drift.

So, start thinking about general mechanisms to do distributed storage.
Not particular FS solutions.

Peter

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] mount flag "direct" (fwd)
  2002-09-03 21:07           ` Peter T. Breuer
@ 2002-09-03 21:15             ` Andreas Dilger
  2002-09-03 21:15             ` Rik van Riel
                               ` (4 subsequent siblings)
  5 siblings, 0 replies; 28+ messages in thread
From: Andreas Dilger @ 2002-09-03 21:15 UTC (permalink / raw)
  To: Peter T. Breuer; +Cc: Lars Marowsky-Bree, root, Rik van Riel, linux kernel

On Sep 03, 2002  23:07 +0200, Peter T. Breuer wrote:
> You don't really want the whole rationale. It concerns certain 
> european (nay, world ..) scientific projects and the calculations of the
> technologists about the progress in hardware over the next few years.
> We/they foresee that we will have to move to multiple relatively small
> distributed disks per node in order to keep the bandwidth per unit of
> storage at the levels that they will have to be at to keep the farms
> fed.  We are talking petabytes of data storage in thousands of nodes
> moving over gigabit networks.
> 
> The "big view" calculations indicate that we must have distributed
> shared writable data.
> 
> These calculations affect us all. They show us what way computing
> will evolve under the price and technology pressures. The calculations
> are only looking to 2006, but that's what they show. For example
> if we think about a 5PB system made of 5000 disks of 1TB each in a GE
> net, we calculate the aggregate bandwidth available in the topology as
> 50GB/s, which is less than we need in order to keep the nodes fed
> at the rates they could be fed at (yes, a few % loss translates into
> time and money).  To increase available bandwidth we must have more
> channels to the disks, and more disks, ... well, you catch my drift.
> 
> So, start thinking about general mechanisms to do distributed storage.
> Not particular FS solutions.

Please see lustre.org.

Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] mount flag "direct" (fwd)
  2002-09-03 21:07           ` Peter T. Breuer
  2002-09-03 21:15             ` Andreas Dilger
@ 2002-09-03 21:15             ` Rik van Riel
  2002-09-03 21:54             ` Anton Altaparmakov
                               ` (3 subsequent siblings)
  5 siblings, 0 replies; 28+ messages in thread
From: Rik van Riel @ 2002-09-03 21:15 UTC (permalink / raw)
  To: Peter T. Breuer; +Cc: Lars Marowsky-Bree, root, linux kernel

On Tue, 3 Sep 2002, Peter T. Breuer wrote:

> The "big view" calculations indicate that we must have distributed
> shared writable data.

Agreed.  Note that the same big view also dictates that any such
solution must have good performance.

Do you need any more reasons for having special cluster filesystems
instead of trying to add clustering to already existing filesystems ?

regards,

Rik
-- 
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/		http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] mount flag "direct" (fwd)
  2002-09-03 21:07           ` Peter T. Breuer
  2002-09-03 21:15             ` Andreas Dilger
  2002-09-03 21:15             ` Rik van Riel
@ 2002-09-03 21:54             ` Anton Altaparmakov
  2002-09-03 22:46               ` Andreas Dilger
  2002-09-03 23:19               ` Daniel Phillips
  2002-09-04  7:16             ` Helge Hafting
                               ` (2 subsequent siblings)
  5 siblings, 2 replies; 28+ messages in thread
From: Anton Altaparmakov @ 2002-09-03 21:54 UTC (permalink / raw)
  To: ptb; +Cc: Lars Marowsky-Bree, Peter T. Breuer, root, Rik van Riel,
	linux kernel

At 22:07 03/09/02, Peter T. Breuer wrote:
>"A month of sundays ago Lars Marowsky-Bree wrote:"
> > On 2002-09-03T18:29:02,
> >    "Peter T. Breuer" <ptb@it.uc3m.es> said:
> > > If that presently is not possible, then I would like to think about
> > > making it possible.
> >
> > Just please, tell us why.
>
>You don't really want the whole rationale. It concerns certain
>european (nay, world ..) scientific projects and the calculations of the
>technologists about the progress in hardware over the next few years.
>We/they foresee that we will have to move to multiple relatively small
>distributed disks per node in order to keep the bandwidth per unit of
>storage at the levels that they will have to be at to keep the farms
>fed.  We are talking petabytes of data storage in thousands of nodes
>moving over gigabit networks.
>
>The "big view" calculations indicate that we must have distributed
>shared writable data.
>
>These calculations affect us all. They show us what way computing
>will evolve under the price and technology pressures. The calculations
>are only looking to 2006, but that's what they show. For example
>if we think about a 5PB system made of 5000 disks of 1TB each in a GE
>net, we calculate the aggregate bandwidth available in the topology as
>50GB/s, which is less than we need in order to keep the nodes fed
>at the rates they could be fed at (yes, a few % loss translates into
>time and money).  To increase available bandwidth we must have more
>channels to the disks, and more disks, ... well, you catch my drift.
>
>So, start thinking about general mechanisms to do distributed storage.
>Not particular FS solutions.

Hm, I believe you are barking up the wrong tree. Either you are omitting 
too much information in your statement above or you are contradicting 
yourself.

What you are looking for is _exactly_ particular FS solution(s)! And in 
particular you are looking for a truly distributed file system.

I just get the impression you are not fully aware what a distributed FS 
(call it DFS for short) actually is.

In my understanding a DFS offers exactly what you need: each node has disks 
and all disks on all nodes are part of the very same file system. Of course 
each node maintains the local disks, i.e. the local part of the file system 
and certain operations require that the nodes communicates with the "DFS 
master node(s)" in order for example to reserve blocks of disks or to 
create/rename files (need to make sure no duplicate filenames are 
instantiated for example). -- Sound familiar so far? You wanted to do 
exactly the same things but at the block layer and the VFS layer levels 
instead of the FS layer...

The difference between a DFS and your proposal is that a DFS maintains all 
the caching benefits of a normal FS at the local node level, while your 
proposal completely and entirely disables caching, which is debatably 
impossible (due to need to load things into ram to read them and to modify 
them and then write them back) and certainly no FS author will accept their 
FS driver to be crippled in such a way. The performance loss incurred by 
removing caching completely is going to make sure you will only be dreaming 
of those 50GiB/sec. More likely you will be getting a few bytes/sec... (OK, 
I exaggerate a bit.) The seek times on the disks together with the 
read/write timings are going to completely annihilate performance. A DFS 
maintains caching at local node level, so you can still keep open inodes in 
memory for example (just don't allow any other node to open the same file 
at the same time or you need to do some juggling via the "Master DFS node").

To give you an analogy, you can think of a DFS like a NUMA machine, where 
you have different access speeds to different parts of memory (for DFS the 
"storage device", same thing really) and where decision on where to store 
things are decided depending on the resource/time cost involved. Simplest 
example: A file created on node A, will be allocated/written to a disk (or 
multiple disks) located on node A, because accessing the local disks has a 
lower time cost compared to going to a different node over the slower wire.

Your time would be much better spent in creating the _one_ true DFS, or 
helping improve one of the existing ones instead of trying to hack up the 
VFS/block layers to pieces. It almost certainly will be a hell of a lot 
less work to implement a decent DFS in comparison to changing the block 
layer, the VFS, _and_ every single FS driver out there to comply with the 
block layer and VFS changes. And at the same time you get exactly the same 
features you wanted to have but with hugely boosted performance.

I hope my ramblings made some kind of sense...

Best regards,

         Anton

-- 
   "I've not lost my mind. It's backed up on tape somewhere." - Unknown
-- 
Anton Altaparmakov <aia21 at cantab.net> (replace at with @)
Linux NTFS Maintainer / IRC: #ntfs on irc.openprojects.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] mount flag "direct" (fwd)
  2002-09-03 21:54             ` Anton Altaparmakov
@ 2002-09-03 22:46               ` Andreas Dilger
  2002-09-03 23:19               ` Daniel Phillips
  1 sibling, 0 replies; 28+ messages in thread
From: Andreas Dilger @ 2002-09-03 22:46 UTC (permalink / raw)
  To: Anton Altaparmakov
  Cc: ptb, Lars Marowsky-Bree, root, Rik van Riel, linux kernel

On Sep 03, 2002  22:54 +0100, Anton Altaparmakov wrote:
> In my understanding a DFS offers exactly what you need: each node has disks 
> and all disks on all nodes are part of the very same file system. Of course 
> each node maintains the local disks, i.e. the local part of the file system 
> and certain operations require that the nodes communicates with the "DFS 
> master node(s)" in order for example to reserve blocks of disks or to 
> create/rename files (need to make sure no duplicate filenames are 
> instantiated for example). -- Sound familiar so far? You wanted to do 
> exactly the same things but at the block layer and the VFS layer levels 
> instead of the FS layer...
> 
> The difference between a DFS and your proposal is that a DFS maintains all 
> the caching benefits of a normal FS at the local node level, while your 
> proposal completely and entirely disables caching, which is debatably 
> impossible (due to need to load things into ram to read them and to modify 
> them and then write them back) and certainly no FS author will accept their 
> FS driver to be crippled in such a way. The performance loss incurred by 
> removing caching completely is going to make sure you will only be dreaming 
> of those 50GiB/sec. More likely you will be getting a few bytes/sec... (OK, 
> I exaggerate a bit.) The seek times on the disks together with the 
> read/write timings are going to completely annihilate performance. A DFS 
> maintains caching at local node level, so you can still keep open inodes in 
> memory for example (just don't allow any other node to open the same file 
> at the same time or you need to do some juggling via the "Master DFS node").
> 
> Your time would be much better spent in creating the _one_ true DFS, or 
> helping improve one of the existing ones instead of trying to hack up the 
> VFS/block layers to pieces. It almost certainly will be a hell of a lot 
> less work to implement a decent DFS in comparison to changing the block 
> layer, the VFS, _and_ every single FS driver out there to comply with the 
> block layer and VFS changes. And at the same time you get exactly the same 
> features you wanted to have but with hugely boosted performance.

This is exactly what Lustre is supposed to be.  Many nodes, each with
local storage, and clients are able to do I/O directly to the storage
nodes (for non-local storage, or if they have no local storage at all).

There is (currently) a single metadata server (MDS) which controls the
directory tree locking, and the storage nodes control the locking of
inodes (objects) local to their storage.

It's not quite in a robust state yet, but we're working on it.

Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] mount flag "direct" (fwd)
  2002-09-03 21:54             ` Anton Altaparmakov
  2002-09-03 22:46               ` Andreas Dilger
@ 2002-09-03 23:19               ` Daniel Phillips
  2002-09-04  0:18                 ` Anton Altaparmakov
  2002-09-04  5:23                 ` David Lang
  1 sibling, 2 replies; 28+ messages in thread
From: Daniel Phillips @ 2002-09-03 23:19 UTC (permalink / raw)
  To: Anton Altaparmakov, ptb
  Cc: Lars Marowsky-Bree, Peter T. Breuer, root, Rik van Riel,
	linux kernel

On Tuesday 03 September 2002 23:54, Anton Altaparmakov wrote:
> The difference between a DFS and your proposal is that a DFS maintains all 
> the caching benefits of a normal FS at the local node level, while your 
> proposal completely and entirely disables caching, which is debatably 
> impossible (due to need to load things into ram to read them and to modify 
> them and then write them back) and certainly no FS author will accept their 
> FS driver to be crippled in such a way. The performance loss incurred by 
> removing caching completely is going to make sure you will only be dreaming 
> of those 50GiB/sec. More likely you will be getting a few bytes/sec... (OK, 
> I exaggerate a bit.) The seek times on the disks together with the 
> read/write timings are going to completely annihilate performance. A DFS 
> maintains caching at local node level, so you can still keep open inodes in 
> memory for example (just don't allow any other node to open the same file 
> at the same time or you need to do some juggling via the "Master DFS node").

You're well wide of the mark here, in that you're relying on the assumption
that caching is important to the application he has in mind.  The raw transfer
bandwidth may well be sufficient, especially if it is unimpeded by being
funneled through a bottleneck like our vfs cache.

-- 
Daniel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] mount flag "direct" (fwd)
  2002-09-03 23:19               ` Daniel Phillips
@ 2002-09-04  0:18                 ` Anton Altaparmakov
  2002-09-04  5:23                 ` David Lang
  1 sibling, 0 replies; 28+ messages in thread
From: Anton Altaparmakov @ 2002-09-04  0:18 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: ptb, Lars Marowsky-Bree, Peter T. Breuer, root, Rik van Riel,
	linux kernel

At 00:19 04/09/02, Daniel Phillips wrote:
>On Tuesday 03 September 2002 23:54, Anton Altaparmakov wrote:
> > The difference between a DFS and your proposal is that a DFS maintains all
> > the caching benefits of a normal FS at the local node level, while your
> > proposal completely and entirely disables caching, which is debatably
> > impossible (due to need to load things into ram to read them and to modify
> > them and then write them back) and certainly no FS author will accept 
> their
> > FS driver to be crippled in such a way. The performance loss incurred by
> > removing caching completely is going to make sure you will only be 
> dreaming
> > of those 50GiB/sec. More likely you will be getting a few bytes/sec... 
> (OK,
> > I exaggerate a bit.) The seek times on the disks together with the
> > read/write timings are going to completely annihilate performance. A DFS
> > maintains caching at local node level, so you can still keep open 
> inodes in
> > memory for example (just don't allow any other node to open the same file
> > at the same time or you need to do some juggling via the "Master DFS 
> node").
>
>You're well wide of the mark here, in that you're relying on the assumption
>that caching is important to the application he has in mind.  The raw transfer
>bandwidth may well be sufficient, especially if it is unimpeded by being
>funneled through a bottleneck like our vfs cache.

I don't think I am. I think we just define "caching" differently. The "raw 
transfer bandwidth" will be close to zero if no caching happens at all. I 
agree with you if you define caching as data caching. But both Peter and I 
are talking about metadata caching + data caching. Sure, you can throw data 
caching out the window and actually gain performance. I would never dispute 
that. But if you throw away metadata caching you destroy performance. Maybe 
not on "simplistic" file systems like ext2 but certainly so on complex ones 
like ntfs... I described already what a single read in ntfs entails if no 
metadata caching happens. I doubt very much that there is a possible 
scenario where not doing any metadata caching would improve performance (on 
ntfs and at a guess many other fs). Even a sequential read or write from 
start of file to end of file would be really killed without caching of the 
logical to physical block mapping table for the inode being read/written on 
ntfs...

So we aren't in disagreement I think. (-:

Best regards,

         Anton

-- 
   "I've not lost my mind. It's backed up on tape somewhere." - Unknown
-- 
Anton Altaparmakov <aia21 at cantab.net> (replace at with @)
Linux NTFS Maintainer / IRC: #ntfs on irc.openprojects.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] mount flag "direct" (fwd)
  2002-09-03 23:19               ` Daniel Phillips
  2002-09-04  0:18                 ` Anton Altaparmakov
@ 2002-09-04  5:23                 ` David Lang
  1 sibling, 0 replies; 28+ messages in thread
From: David Lang @ 2002-09-04  5:23 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Anton Altaparmakov, Peter T. Breuer, Lars Marowsky-Bree, root,
	Rik van Riel, linux kernel

On Wed, 4 Sep 2002, Daniel Phillips wrote:

>
> You're well wide of the mark here, in that you're relying on the assumption
> that caching is important to the application he has in mind.  The raw transfer
> bandwidth may well be sufficient, especially if it is unimpeded by being
> funneled through a bottleneck like our vfs cache.
>

the fact that he is saying that this needs to run normal filesystems tells
us that.

if you need a filesystem to max out transfer rate and don't want to have
it cache things that is a VERY specialized thing and not something that
will match what NTFS/XFS/JFS/ReiserFS/ext2 etc are going to be used for.

either he has a very specialized need (in which case a specialized
filesystem is probably the best bet anyway) or he is trying to support
normal uses (in which case caching is important)

however the point is that the read-modify-write cycle is a form of cache,
it is only safe if you aquire a lock at the beginning of it and release it
at the end. A standard filesystem won't do this, this is what makes a DFS.

David Lang

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] mount flag "direct" (fwd)
  2002-09-03 21:07           ` Peter T. Breuer
                               ` (2 preceding siblings ...)
  2002-09-03 21:54             ` Anton Altaparmakov
@ 2002-09-04  7:16             ` Helge Hafting
  2002-09-04  8:39               ` Andreas Dilger
  2002-09-04  8:41               ` Peter T. Breuer
  2002-09-04  7:50             ` Joachim Breuer
  2002-09-04  9:26             ` Lars Marowsky-Bree
  5 siblings, 2 replies; 28+ messages in thread
From: Helge Hafting @ 2002-09-04  7:16 UTC (permalink / raw)
  To: ptb, linux-kernel

"Peter T. Breuer" wrote:
> 
> "A month of sundays ago Lars Marowsky-Bree wrote:"
> > On 2002-09-03T18:29:02,
> >    "Peter T. Breuer" <ptb@it.uc3m.es> said:
> >
> > > > Lets say you have a perfect locking mechanism, a fake SCSI layer
> > > OK.
> >
> > BTW, I would like to see your perfect distributed locking mechanism.
> 
> That bit's easy and is done. The "trick" is NOT to distribute the lock,
> but to have it in one place - on the driver that guards the remote
> disk resource.
> 
> > > The directory entry would certainly have to be reread after a write
> > > operation on disk that touched it - or more simply, the directory entry
> > > would have to be reread every time it were needed, i.e. be uncached.
> >
> > *ouch* Sure. Right. You just have to read it from scratch every time. How
> > would you make readdir work?
> 
> Well, one has to read it from scratch. I'll set about seeing how to do.
> CLues welcome.
> 
> > > If that presently is not possible, then I would like to think about
> > > making it possible.
> >
> > Just please, tell us why.
> 
> You don't really want the whole rationale. It concerns certain
> european (nay, world ..) scientific projects and the calculations of the
> technologists about the progress in hardware over the next few years.
> We/they foresee that we will have to move to multiple relatively small
> distributed disks per node in order to keep the bandwidth per unit of
> storage at the levels that they will have to be at to keep the farms
> fed.  We are talking petabytes of data storage in thousands of nodes
> moving over gigabit networks.
> 
> The "big view" calculations indicate that we must have distributed
> shared writable data.
> 
Increasing demands for performance may indeed force a need
for shared writeable data someday.  Several solutions for that is
being developed.
Your idea about re-reading stuff over and over isn't going to help 
because that sort of thing consumes much more bandwith. Caches help
because they _avoid_ data transfers.  So shared writeable data
will happen, and it will use some sort of cache coherency,
for performance reasons.

> These calculations affect us all. They show us what way computing
> will evolve under the price and technology pressures. The calculations
> are only looking to 2006, but that's what they show. For example
> if we think about a 5PB system made of 5000 disks of 1TB each in a GE
> net, we calculate the aggregate bandwidth available in the topology as
> 50GB/s, which is less than we need in order to keep the nodes fed
> at the rates they could be fed at (yes, a few % loss translates into
> time and money).  To increase available bandwidth we must have more
> channels to the disks, and more disks, ... well, you catch my drift.
> 
> So, start thinking about general mechanisms to do distributed storage.
> Not particular FS solutions.
Distributed systems will need somewhat different solutions, because
they are fundamentally different.  Existing fs'es like ext2 is built
around a single-node assumption.  I claim that making a new fs from
scratch for the distributed case is easier than tweaking ext2
and 10-20 other existing fs'es to work in such an environment. 
Making a new fs from scratch isn't such a big deal after all.

To make a historical parallel:
Data used to be stored on sequential media like tapes (or
even stacks of punched cards)  filesystems were developed
for tapes.  Then they made disks.  
Using a disk as a tape with the existing tape-fs'es
worked, but didn't give much benefit.  So we got something
new - block-based filesystems designed to take advantage
of the new random-access media.

The case of distributed storage is similiar, it is fundamentally
different from the one-node case just as random-access media
were different from sequential.

I think a new design that considers both the benefits and
problems of many nodes will be much better than trying to 
patch the existing fs'es.  An approach that starts with
throwing away the thousand-fold speedup provided by caching
isn't very convincing.  

If you merely proposed making the VFS and existing fs'es
cache-coherent,then I'd agree it might work well, but
it'd be a _lot_ of work.  Which is no problem
if you volunteer to do the work.  But simplification
by throwing away caching _will_ be too slow, it certainly
don't fit the idea of getting more bandwith.  

More bandwith won't help if you throw all of it and then some
away on massive re-reading of data. 

Wanting a generic mechanism instead of a special fs 
might be the way to go, but it'd be a generic mechanism
used by a bunch of new fs'es designed to work distributed.

There will probably be different needs for which people
will build different distributed fs'es.  So a 
"VDFS" makes sense for those fs'es, putting the common stuff
in one place.  But I am sure the VDFS will contain cache
coherency calls for dropping pages from cache when 
necessary, instead of dropping the cache unconditionally
in every case.

Helge Hafting

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] mount flag "direct" (fwd)
  2002-09-03 21:07           ` Peter T. Breuer
                               ` (3 preceding siblings ...)
  2002-09-04  7:16             ` Helge Hafting
@ 2002-09-04  7:50             ` Joachim Breuer
       [not found]               ` <3D75F8B0.8C7E974E@aitel.hist.no>
  2002-09-04  9:26             ` Lars Marowsky-Bree
  5 siblings, 1 reply; 28+ messages in thread
From: Joachim Breuer @ 2002-09-04  7:50 UTC (permalink / raw)
  To: ptb; +Cc: linux kernel

"Peter T. Breuer" <ptb@it.uc3m.es> writes:

> "A month of sundays ago Lars Marowsky-Bree wrote:"
>> On 2002-09-03T18:29:02,
>>    "Peter T. Breuer" <ptb@it.uc3m.es> said:
>>
>> > The directory entry would certainly have to be reread after a write
>> > operation on disk that touched it - or more simply, the directory entry
>> > would have to be reread every time it were needed, i.e. be uncached.
>> 
>> *ouch* Sure. Right. You just have to read it from scratch every time. How
>> would you make readdir work?
>
> Well, one has to read it from scratch. I'll set about seeing how to do.
> CLues welcome.

Just an idea, I don't know how well this works what with the 'IDE
can't do write barriers right' and related effects:

- Allow all nodes to cache as many blocks as they wish
- The atomic operation "update this block" includes "invalidate this
  block, if cached" broadcast to all nodes

Performance would certainly become an issue; depending on the
architecture bus sniffing as in certain MP cache consistency protocols
might be feasible (I, node 3, see a transfer from node 1 going to
block #42, which is in my cache; so I update my cache using the data
part of the block transfer as it comes by on the bus).


So long,
   Joe

-- 
"I use emacs, which might be thought of as a thermonuclear
 word processor."
-- Neal Stephenson, "In the beginning... was the command line"

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] mount flag "direct" (fwd)
  2002-09-04  7:16             ` Helge Hafting
@ 2002-09-04  8:39               ` Andreas Dilger
  2002-09-04 12:07                 ` Helge Hafting
  2002-09-04  8:41               ` Peter T. Breuer
  1 sibling, 1 reply; 28+ messages in thread
From: Andreas Dilger @ 2002-09-04  8:39 UTC (permalink / raw)
  To: Helge Hafting; +Cc: ptb, linux-kernel

On Sep 04, 2002  09:16 +0200, Helge Hafting wrote:
> Your idea about re-reading stuff over and over isn't going to help 
> because that sort of thing consumes much more bandwith. Caches help
> because they _avoid_ data transfers.  So shared writeable data
> will happen, and it will use some sort of cache coherency,
> for performance reasons.

You assume too much about the applications.  For example, Oracle
does not want _any_ cacheing to be done by the OS, because it
manages the cache itself, and would rather allocate the full amount
of RAM itself instead of the OS duplicating data it is cacheing
internally.

Similarly, there are many "write only" applications that are only
hindered by OS cache, such as any kind of high-speed data recording
(video, particle accelerators, scientific computing, etc) which is
using most of the RAM for internal structures and wants the data it
writes to go directly to disk at the highest possible speed.

> I claim that making a new fs from scratch for the distributed
> case is easier than tweaking ext2 and 10-20 other existing fs'es
> to work in such an environment.  Making a new fs from scratch
> isn't such a big deal after all.

The problem isn't making a new fs, the problem is making a _good_
new fs.  It takes at least several years of development, testing,
tuning, etc to get just a local fs right, if not longer (i.e.
reiserfs, JFS, XFS, ext3, etc).  Add in the complexity of the
network side of things and it just gets that much harder to do
it all well.

We have taken the approach that local filesystems do a good job
with the "one node" assumption, so just use them as-is to
do a job they are good at.  All of the network and locking code
for Lustre is outside of the filesystem, and the "local" filesystems
are used for storing either the directory structure + attributes
(for the metadata server), or file data (for the storage targets).

Local filesystems can do both of those jobs very well already, so
no need to re-invent the wheel.

See http://www.lustre.org/docs.html for lots of papers and
documentation on the design of Lustre.

Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] mount flag "direct" (fwd)
  2002-09-04  7:16             ` Helge Hafting
  2002-09-04  8:39               ` Andreas Dilger
@ 2002-09-04  8:41               ` Peter T. Breuer
  1 sibling, 0 replies; 28+ messages in thread
From: Peter T. Breuer @ 2002-09-04  8:41 UTC (permalink / raw)
  To: Helge Hafting; +Cc: ptb, linux-kernel

"A month of sundays ago Helge Hafting wrote:"
> > The "big view" calculations indicate that we must have distributed
> > shared writable data.
> > 
> Increasing demands for performance may indeed force a need
> for shared writeable data someday.  Several solutions for that is
> being developed.
> Your idea about re-reading stuff over and over isn't going to help 

I really don't see why you people don't get it. Rereading is a RARE
operation. Normally we write once and read once. That's all. Once
the data's in memory we use it.

And if we ever have to reread something, it will very very rarely be
metadata.

> because that sort of thing consumes much more bandwith. Caches help
> because they _avoid_ data transfers.  So shared writeable data

Tough. Data transfers are inevitable in this scenario. There's no
sense in trying to avoid them. Data comes in at A and goes out at B.
Ergo it's transfered.

> > So, start thinking about general mechanisms to do distributed storage.
> > Not particular FS solutions.
> Distributed systems will need somewhat different solutions, because
> they are fundamentally different.  Existing fs'es like ext2 is built
> around a single-node assumption.  I claim that making a new fs from

I am still getting afeel for the problem. Only avoiding directory
caching (and inode caching) has worried me. I looked at the name
lookup routines on the train and I don't see we onne can't force a
reread from root every time, or a reread every time there is a
"changed" bit set in the sb.

> scratch for the distributed case is easier than tweaking ext2

No tweak. But I'm looking.

> The case of distributed storage is similiar, it is fundamentally
> different from the one-node case just as random-access media

I agree. But the case of one FS accessed from different nodes is not
fundamentally different from the situation we have now. It requires
locking. It also requires either explicit sharing of cached
information, or no caching (which is the same thing :-). I merely
opine that the latter is easier to try first and may not be so bad.

> If you merely proposed making the VFS and existing fs'es
> cache-coherent,then I'd agree it might work well, but

I'm proposing making no caching _possible_. Not mandatory, but
_possible_. If you like, you can see it as a trivial case of cache
sharing.

> by throwing away caching _will_ be too slow, it certainly

Why? The only thing I've seen mentioned that might slow things down
is that at every open we have to trace the full path anew. So what?
OK, so there's also objections about what happens if one kernel frees
the data and anotehr adds to it. I'm thinking about what that implies.

> There will probably be different needs for which people
> will build different distributed fs'es.  So a 
> "VDFS" makes sense for those fs'es, putting the common stuff
> in one place.  But I am sure the VDFS will contain cache
> coherency calls for dropping pages from cache when 
> necessary, instead of dropping the cache unconditionally
> in every case.

That's possible, but right now I don't know any way of saying to the
kernel "I just stepped all over X on disk, please invalidate anything
you have cached that points to X". I'd like it very much in the
buffering layer too (i.e. vMs).

Peter

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] mount flag "direct" (fwd)
  2002-09-03 21:07           ` Peter T. Breuer
                               ` (4 preceding siblings ...)
  2002-09-04  7:50             ` Joachim Breuer
@ 2002-09-04  9:26             ` Lars Marowsky-Bree
  5 siblings, 0 replies; 28+ messages in thread
From: Lars Marowsky-Bree @ 2002-09-04  9:26 UTC (permalink / raw)
  To: Peter T. Breuer; +Cc: root, Rik van Riel, linux kernel

On 2002-09-03T23:07:01,
   "Peter T. Breuer" <ptb@it.uc3m.es> said:

> > *ouch* Sure. Right. You just have to read it from scratch every time. How
> > would you make readdir work?
> Well, one has to read it from scratch. I'll set about seeing how to do.
> CLues welcome.

Yes, use a distributed filesystem. There are _many_ out there; GFS, OCFS,
OpenGFS, Compaq has one as part of their SSI, Inter-Mezzo (sort of), Lustre,
PvFS etc.

Any of them will appreciate the good work of a bright fellow.

Noone appreciates reinventing the wheel another time, especially if - for
simplification - it starts out as a square.

> > Just please, tell us why.
> You don't really want the whole rationale.

Yes, I do.

You tell me why Distributed Filesystems are important. I fully agree.

You fail to give a convincing reason why that must be made to work with
"all" conventional filesystems, especially given the constraints this implies.

Conventional wisdom seems to be that this can much better be handled specially
by special filesystems, who can do finer grained locking etc because they
understand the on disk structures, can do distributed journal recovery etc.

What you are starting would need at least 3-5 years to catch up with what
people currently already can do, and they'll improve in this time too. 

I've seen your academic track record and it is surely impressive. I am not
saying that your approach won't work within the constraints. Given enough
thrust, pigs fly. I'm just saying that it would be nice to learn what reasons
you have for this, because I believe that "within the constraints" makes your
proposal essentially useless (see the other mails).

In particular, they make them useless for the requirements you seem to have. A
petabyte filesystem without journaling? A petabyte filesystem with a single
write lock? Gimme a break.

Please, do the research and tell us what features you desire to have which are
currently missing, and why implementing them essentially from scratch is
preferrable to extending existing solutions.

You are dancing around all the hard parts. "Don't have a distributed lock
manager, have one central lock." Yeah, right, has scaled _really_ well in the
past. Then you figure this one out, and come up with a lock-bitmap on the
device itself for locking subtrees of the fs. Next you are going to realize
that a single block is not scalable either because one needs exclusive write
lock to it, 'cause you can't just rewrite a single bit. You might then begin
to explore that a single bit won't cut it, because for recovery you'll need to
be able to pinpoint all locks a node had and recover them. Then you might
begin to think about the difficulties in distributed lock management and
recovery. ("Transaction processing" is an exceptionally good book on that I
believe)

I bet you a dinner that what you are going to come up with will look
frighteningly like one of the solutions which already exist; so why not
research them first in depth and start working on the one you like most,
instead of wasting time on an academic exercise?

> So, start thinking about general mechanisms to do distributed storage.
> Not particular FS solutions.

Distributed storage needs a way to access it; in the Unix paradigm,
"everything is a file", that implies a distributed filesystem. Other
approaches would include accessing raw blocks and doing the locking in the
application / via a DLM (ie, what Oracle RAC does).

Sincerely,
    Lars Marowsky-Brée <lmb@suse.de>

-- 
Immortality is an adequate definition of high availability for me.
	--- Gregory F. Pfister

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] mount flag "direct" (fwd)
  2002-09-04  8:39               ` Andreas Dilger
@ 2002-09-04 12:07                 ` Helge Hafting
  2002-09-04 13:03                   ` Hans Reiser
  0 siblings, 1 reply; 28+ messages in thread
From: Helge Hafting @ 2002-09-04 12:07 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: linux-kernel

Andreas Dilger wrote:
> 
> On Sep 04, 2002  09:16 +0200, Helge Hafting wrote:
> > Your idea about re-reading stuff over and over isn't going to help
> > because that sort of thing consumes much more bandwith. Caches help
> > because they _avoid_ data transfers.  So shared writeable data
> > will happen, and it will use some sort of cache coherency,
> > for performance reasons.
> 
> You assume too much about the applications.  For example, Oracle
> does not want _any_ cacheing to be done by the OS, because it
> manages the cache itself, and would rather allocate the full amount
> of RAM itself instead of the OS duplicating data it is cacheing
> internally.
> 
There are things like O_DIRECT for this.  A fine add-on for
some apps, and it don't break the fs for all those apps that
like caching.  

A uncached distributed fs is another story.  Having to void
all cache (or no cache at all) whenever some other machine
locks the fs might be just the ticket for some applications,
but I can't see that working for the generic case.  

Which is why
I think a special fs is in place here.  It could possibly start
off as a fork from ext2 (or ntfs or vfat or whatever seems
appropriate) but I cannot see how this sort of thing could be merged.
And why force it into _every_ existing fs?  This distributed
scheme really needs all of them?

> > I claim that making a new fs from scratch for the distributed
> > case is easier than tweaking ext2 and 10-20 other existing fs'es
> > to work in such an environment.  Making a new fs from scratch
> > isn't such a big deal after all.
> 
> The problem isn't making a new fs, the problem is making a _good_
> new fs.  It takes at least several years of development, testing,
> tuning, etc to get just a local fs right, if not longer (i.e.
> reiserfs, JFS, XFS, ext3, etc).  Add in the complexity of the
> network side of things and it just gets that much harder to do
> it all well.

Making a good new fs might take time, but changing all existing
fs'es to support "no caching when another guy has the lock"
is so invasive that I'd call it a set of new fs'es, and I think
he'll need some time to get that working _well_. 
I believe a special purpose fs for special needs is easier
in this case.
> 
> We have taken the approach that local filesystems do a good job
> with the "one node" assumption, so just use them as-is to
> do a job they are good at.  

A completely different approach, avoiding the trouble of
drastically altering something that works.

Helge Hafting

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] mount flag "direct" (fwd)
  2002-09-04 12:07                 ` Helge Hafting
@ 2002-09-04 13:03                   ` Hans Reiser
  0 siblings, 0 replies; 28+ messages in thread
From: Hans Reiser @ 2002-09-04 13:03 UTC (permalink / raw)
  To: linux-kernel

I think everyone agrees that you should start with doing it for a 
particular FS, and then after you have done it for one, you will know 
enough about what needs to be done that you can make your case that it 
should be done in VFS.  Frankly, I think that you should either share 
caches between nodes (NUMA), or (somehow, and there are so many ways...) 
divide the workload between the machines such that they don't access the 
same data except in response to failure.

Hans

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] mount flag "direct" (fwd)
       [not found]               ` <3D75F8B0.8C7E974E@aitel.hist.no>
@ 2002-09-04 21:26                 ` Joachim Breuer
  0 siblings, 0 replies; 28+ messages in thread
From: Joachim Breuer @ 2002-09-04 21:26 UTC (permalink / raw)
  To: Helge Hafting; +Cc: linux-kernel

Helge Hafting <helgehaf@aitel.hist.no> writes:

> Joachim Breuer wrote:
>> > Well, one has to read it from scratch. I'll set about seeing how to do.
>> > CLues welcome.
>> 
>> Just an idea, I don't know how well this works what with the 'IDE
>> can't do write barriers right' and related effects:
>> 
>> - Allow all nodes to cache as many blocks as they wish
>> - The atomic operation "update this block" includes "invalidate this
>>   block, if cached" broadcast to all nodes
>> 
> You can't just invalidate like that, you'll need synchronization.
> Something like "I want write access to this block - stop using it."
> And then you _wait_, until everybody else has released it.  This
> could take some time if someone was busy using it.

More or less what I meant - I assumed the requirement of mutual
exclusion was already agreed upon. I might have phrased it more
clearly as in "the locking protocol provides the "other" nodes with
sensible invalidation data". Something which a "budgeting" type
locking protocol would not normally do (allow each node a range of its
own for exclusive writes; everyone can read everywhere (with a bit
more locking to make the writes look atomic if they aren't already
perhaps)).

>> Performance would certainly become an issue; depending on the
>> architecture bus sniffing as in certain MP cache consistency protocols
>> might be feasible (I, node 3, see a transfer from node 1 going to
>> block #42, which is in my cache; so I update my cache using the data
>> part of the block transfer as it comes by on the bus).
>
> Possible only when there is a shared bus.  Of course today's
> IDE/SCSI don't do this.

Well, SCSI can be *made* to do it (actually, this is what I had in
mind when I wrote it up) - but I don't know whether any mainstream
controllers would be able to "sniff" sensibly.


Anyway, so long,
        Joe

-- 
"I use emacs, which might be thought of as a thermonuclear
 word processor."
-- Neal Stephenson, "In the beginning... was the command line"

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2002-09-04 21:21 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-09-03 15:39 [RFC] mount flag "direct" (fwd) Peter T. Breuer
2002-09-03 15:44 ` Rik van Riel
2002-09-03 15:50   ` Peter T. Breuer
2002-09-03 15:56     ` Chris Wedgwood
2002-09-03 15:59       ` Peter T. Breuer
2002-09-03 16:09     ` Richard B. Johnson
2002-09-03 16:29       ` Peter T. Breuer
2002-09-03 16:33         ` Rik van Riel
2002-09-03 17:32         ` Richard B. Johnson
2002-09-03 18:53         ` Lars Marowsky-Bree
2002-09-03 21:07           ` Peter T. Breuer
2002-09-03 21:15             ` Andreas Dilger
2002-09-03 21:15             ` Rik van Riel
2002-09-03 21:54             ` Anton Altaparmakov
2002-09-03 22:46               ` Andreas Dilger
2002-09-03 23:19               ` Daniel Phillips
2002-09-04  0:18                 ` Anton Altaparmakov
2002-09-04  5:23                 ` David Lang
2002-09-04  7:16             ` Helge Hafting
2002-09-04  8:39               ` Andreas Dilger
2002-09-04 12:07                 ` Helge Hafting
2002-09-04 13:03                   ` Hans Reiser
2002-09-04  8:41               ` Peter T. Breuer
2002-09-04  7:50             ` Joachim Breuer
     [not found]               ` <3D75F8B0.8C7E974E@aitel.hist.no>
2002-09-04 21:26                 ` Joachim Breuer
2002-09-04  9:26             ` Lars Marowsky-Bree
2002-09-03 16:58     ` Anton Altaparmakov
2002-09-03 17:26       ` Peter T. Breuer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.