Why must NFS access metadata in synchronous mode?

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Why must NFS access metadata in synchronous mode?
@ 2006-06-01  4:04 Xin Zhao
  2006-06-01  5:55 ` Trond Myklebust
  0 siblings, 1 reply; 7+ messages in thread
From: Xin Zhao @ 2006-06-01  4:04 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel

Until kernel 2.6.16, I think NFS still access metadata synchronously,
which may impact performance significantly. Several years ago, paper
"metadata update performance in file systems" already suggested using
asynchronous mode in metadata access.

I am curious why NFS does not adopt this suggestion? Can someone explain this?

Thanks!

Xin

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Why must NFS access metadata in synchronous mode?
  2006-06-01  4:04 Why must NFS access metadata in synchronous mode? Xin Zhao
@ 2006-06-01  5:55 ` Trond Myklebust
  2006-06-01 16:27   ` Xin Zhao
  0 siblings, 1 reply; 7+ messages in thread
From: Trond Myklebust @ 2006-06-01  5:55 UTC (permalink / raw)
  To: Xin Zhao; +Cc: linux-kernel, linux-fsdevel

On Thu, 2006-06-01 at 00:04 -0400, Xin Zhao wrote:
> Until kernel 2.6.16, I think NFS still access metadata synchronously,
> which may impact performance significantly. Several years ago, paper
> "metadata update performance in file systems" already suggested using
> asynchronous mode in metadata access.

...and how many NFS implementations have you seen based on that paper?

> I am curious why NFS does not adopt this suggestion? Can someone explain this?

a) NFS permissions are checked by the _server_, not the client.

b) Cache consistency requirements are _much_ more stringent for
asynchronous operation. Think for instance about an asynchronous
mkdir(): how should the client guarantee exclusive semantics (i.e. that
mkdir either creates a new directory or returns an EEXIST error)? how
should it guarantee that the server will have enough disk space to
satisfy your request? how should it guarantee that nobody will change
the permissions on the parent directory before the metadata was synced
to disk?,...

People are considering how to implement this sort of thing using the
NFSv4 concept of delegations and applying them to directories. It is not
yet obvious how all the details will be solved.

Trond

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Why must NFS access metadata in synchronous mode?
  2006-06-01  5:55 ` Trond Myklebust
@ 2006-06-01 16:27   ` Xin Zhao
  2006-06-01 17:26     ` Trond Myklebust
  2006-06-01 21:40     ` Andreas Dilger
  0 siblings, 2 replies; 7+ messages in thread
From: Xin Zhao @ 2006-06-01 16:27 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-kernel, linux-fsdevel

Question 1: ...and how many NFS implementations have you seen based on
that paper?
I don't know. I only read the NFS implementations distributed with
Linux kernel. But some paper mentioned that the soft update mechanism
suggested in that paper has been adopted by FreeBSD.

Question 2: NFS permissions are checked by the _server_, not the client.
That's true. But I was not saying that all metadata access must be
asynchronous. Even for permission checking, speculative execution
mechanism proposed in Ed Nightingale's "speculative execution ...."
paper published in SOSP 2005 can be used to avoid waiting. The basic
idea is that a NFS client speculatively assume permission checking
returns "OK" and set a checkpoint, then the client can go ahead to
send further requests. If the actual result turns out to be "OK", the
client can discard the checkpoint, otherwise, it rolls back to the
checking point. This can make waiting time overlap with the sending
time of subsequent requests.

Question 3: Cache consistency requirements are _much_ more stringent
for asynchronous operation.
I agree. But I am not sure how local file system like Ext3 handle this
problem. I don't think Ext3 must synchronously write metadata (I will
double check the ext3 code). If I remember correctly, when change
metadata, Ext3 just change it in memory and mark this page to be
dirty. The page will be flushed to disk afterward. If the server
exports an Ext3 code, it should be able to do the same thing. When a
client requests to change metadata, server writes to the mmaped
metadata page and then return to client instead of having to sync the
change to disk. With this mechanism, at least the client does not have
to wait for the disk flush time. Does it make sense? To prevent
interleave change on metadata before it is flushed to disk, the server
can even mark the metadata page to be read-only before it is flushed
to disk.

Xin
On 6/1/06, Trond Myklebust <trond.myklebust@fys.uio.no> wrote:
> On Thu, 2006-06-01 at 00:04 -0400, Xin Zhao wrote:
> > Until kernel 2.6.16, I think NFS still access metadata synchronously,
> > which may impact performance significantly. Several years ago, paper
> > "metadata update performance in file systems" already suggested using
> > asynchronous mode in metadata access.
>
> ...and how many NFS implementations have you seen based on that paper?
>
> > I am curious why NFS does not adopt this suggestion? Can someone explain this?
>
> a) NFS permissions are checked by the _server_, not the client.
>
> b) Cache consistency requirements are _much_ more stringent for
> asynchronous operation. Think for instance about an asynchronous
> mkdir(): how should the client guarantee exclusive semantics (i.e. that
> mkdir either creates a new directory or returns an EEXIST error)? how
> should it guarantee that the server will have enough disk space to
> satisfy your request? how should it guarantee that nobody will change
> the permissions on the parent directory before the metadata was synced
> to disk?,...
>
> People are considering how to implement this sort of thing using the
> NFSv4 concept of delegations and applying them to directories. It is not
> yet obvious how all the details will be solved.
>
> Trond
>
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Why must NFS access metadata in synchronous mode?
  2006-06-01 16:27   ` Xin Zhao
@ 2006-06-01 17:26     ` Trond Myklebust
  2006-06-02  3:42       ` Can Sar
  2006-06-01 21:40     ` Andreas Dilger
  1 sibling, 1 reply; 7+ messages in thread
From: Trond Myklebust @ 2006-06-01 17:26 UTC (permalink / raw)
  To: Xin Zhao; +Cc: linux-kernel, linux-fsdevel

On Thu, 2006-06-01 at 12:27 -0400, Xin Zhao wrote:
> Question 1: ...and how many NFS implementations have you seen based on
> that paper?
> I don't know. I only read the NFS implementations distributed with
> Linux kernel. But some paper mentioned that the soft update mechanism
> suggested in that paper has been adopted by FreeBSD.

FreeBSD does not use soft updates for NFS afaik.

> Question 2: NFS permissions are checked by the _server_, not the client.
> That's true. But I was not saying that all metadata access must be
> asynchronous. Even for permission checking, speculative execution
> mechanism proposed in Ed Nightingale's "speculative execution ...."
> paper published in SOSP 2005 can be used to avoid waiting. The basic
> idea is that a NFS client speculatively assume permission checking
> returns "OK" and set a checkpoint, then the client can go ahead to
> send further requests. If the actual result turns out to be "OK", the
> client can discard the checkpoint, otherwise, it rolls back to the
> checking point. This can make waiting time overlap with the sending
> time of subsequent requests.

...and how does that help the user that has been told the operation
succeeded?

> Question 3: Cache consistency requirements are _much_ more stringent
> for asynchronous operation.
> I agree. But I am not sure how local file system like Ext3 handle this
> problem. I don't think Ext3 must synchronously write metadata (I will
> double check the ext3 code). If I remember correctly, when change
> metadata, Ext3 just change it in memory and mark this page to be
> dirty. The page will be flushed to disk afterward. If the server
> exports an Ext3 code, it should be able to do the same thing. When a
> client requests to change metadata, server writes to the mmaped
> metadata page and then return to client instead of having to sync the
> change to disk. With this mechanism, at least the client does not have
> to wait for the disk flush time. Does it make sense? To prevent
> interleave change on metadata before it is flushed to disk, the server
> can even mark the metadata page to be read-only before it is flushed
> to disk.

'man 5 exports'. Read _carefully_ the entry on the "async" export
option, and see the NFS FAQ, nfs mailing list archives, etc... why it is
a bad idea.


Cheers,
  Trond


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Why must NFS access metadata in synchronous mode?
  2006-06-01 17:26     ` Trond Myklebust
@ 2006-06-02  3:42       ` Can Sar
  2006-06-02  4:06         ` Trond Myklebust
  0 siblings, 1 reply; 7+ messages in thread
From: Can Sar @ 2006-06-02  3:42 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Xin Zhao, linux-fsdevel


On Jun 1, 2006, at 10:26 AM, Trond Myklebust wrote:

> On Thu, 2006-06-01 at 12:27 -0400, Xin Zhao wrote:
>> Question 1: ...and how many NFS implementations have you seen  
>> based on
>> that paper?
>> I don't know. I only read the NFS implementations distributed with
>> Linux kernel. But some paper mentioned that the soft update mechanism
>> suggested in that paper has been adopted by FreeBSD.
>
> FreeBSD does not use soft updates for NFS afaik.
>
>> Question 2: NFS permissions are checked by the _server_, not the  
>> client.
>> That's true. But I was not saying that all metadata access must be
>> asynchronous. Even for permission checking, speculative execution
>> mechanism proposed in Ed Nightingale's "speculative execution ...."
>> paper published in SOSP 2005 can be used to avoid waiting. The basic
>> idea is that a NFS client speculatively assume permission checking
>> returns "OK" and set a checkpoint, then the client can go ahead to
>> send further requests. If the actual result turns out to be "OK", the
>> client can discard the checkpoint, otherwise, it rolls back to the
>> checking point. This can make waiting time overlap with the sending
>> time of subsequent requests.
>
> ...and how does that help the user that has been told the operation
> succeeded?

It wouldn't. For externally visible operations the kernel just waits  
until it actually has the right answer.
Contributions from that paper would definitely speed up NFS on Linux  
but they require extensive work, which is one of the reasons no one  
has actually implemented a production level version of that work yet  
(researchers generally move on to new papers instead).

>
>> Question 3: Cache consistency requirements are _much_ more stringent
>> for asynchronous operation.
>> I agree. But I am not sure how local file system like Ext3 handle  
>> this
>> problem. I don't think Ext3 must synchronously write metadata (I will
>> double check the ext3 code). If I remember correctly, when change
>> metadata, Ext3 just change it in memory and mark this page to be
>> dirty. The page will be flushed to disk afterward. If the server
>> exports an Ext3 code, it should be able to do the same thing. When a
>> client requests to change metadata, server writes to the mmaped
>> metadata page and then return to client instead of having to sync the
>> change to disk. With this mechanism, at least the client does not  
>> have
>> to wait for the disk flush time. Does it make sense? To prevent
>> interleave change on metadata before it is flushed to disk, the  
>> server
>> can even mark the metadata page to be read-only before it is flushed
>> to disk.
>
> 'man 5 exports'. Read _carefully_ the entry on the "async" export
> option, and see the NFS FAQ, nfs mailing list archives, etc... why  
> it is
> a bad idea.
>
>
> Cheers,
>   Trond
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux- 
> fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Why must NFS access metadata in synchronous mode?
  2006-06-02  3:42       ` Can Sar
@ 2006-06-02  4:06         ` Trond Myklebust
  0 siblings, 0 replies; 7+ messages in thread
From: Trond Myklebust @ 2006-06-02  4:06 UTC (permalink / raw)
  To: Can Sar; +Cc: Xin Zhao, linux-fsdevel

On Thu, 2006-06-01 at 20:42 -0700, Can Sar wrote:
> On Jun 1, 2006, at 10:26 AM, Trond Myklebust wrote:
> >
> > ...and how does that help the user that has been told the operation
> > succeeded?
> 
> It wouldn't. For externally visible operations the kernel just waits  
> until it actually has the right answer.
> Contributions from that paper would definitely speed up NFS on Linux  
> but they require extensive work, which is one of the reasons no one  
> has actually implemented a production level version of that work yet  
> (researchers generally move on to new papers instead).

Performance needs to be weighted against application expectations (i.e.
POSIX correctness). If the application is told that an operation has
succeeded, then you had better make damned sure that the operation
_will_ succeed (and within finite time, please!).

As I said, we are working within the IETF on a model for asynchronous
operations, but as Andreas Dilger suggested, this does require a
stateful model in which the client can reliably conclude whether or not
an operation will succeed in the future. Such a model is understandably
complex to design, and so we're introducing the framework in a
step-by-step manner: NFSv4.1 will include directory read delegations
(which allow you to cache directory operations that do not modify the
directory until a conflict occurs). I hope we will get round to
completing a model for write delegations in NFSv4.2.

Trond

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Why must NFS access metadata in synchronous mode?
  2006-06-01 16:27   ` Xin Zhao
  2006-06-01 17:26     ` Trond Myklebust
@ 2006-06-01 21:40     ` Andreas Dilger
  1 sibling, 0 replies; 7+ messages in thread
From: Andreas Dilger @ 2006-06-01 21:40 UTC (permalink / raw)
  To: Xin Zhao; +Cc: Trond Myklebust, linux-kernel, linux-fsdevel

On Jun 01, 2006  12:27 -0400, Xin Zhao wrote:
> > Question 3: Cache consistency requirements are _much_ more stringent
> > for asynchronous operation.
> I agree. But I am not sure how local file system like Ext3 handle this
> problem. I don't think Ext3 must synchronously write metadata (I will
> double check the ext3 code). If I remember correctly, when change
> metadata, Ext3 just change it in memory and mark this page to be
> dirty. The page will be flushed to disk afterward. If the server
> exports an Ext3 code, it should be able to do the same thing. When a
> client requests to change metadata, server writes to the mmaped
> metadata page and then return to client instead of having to sync the
> change to disk. With this mechanism, at least the client does not have
> to wait for the disk flush time. Does it make sense? To prevent
> interleave change on metadata before it is flushed to disk, the server
> can even mark the metadata page to be read-only before it is flushed
> to disk.

The difference between local filesystems and remote filesystems is that
if asynchronous operations on a local filesystem are lost due to node
failure, the application is also generally failed at the same time,
so when the node restarts the application it gets the old state from disk.
If applications care about on-disk consistency (say because the application
is itself sharing state with a remote system, like sendmail) then it will
fsync the file(s) before updating the remote state.

In the NFS case, a remote client keeps the "new" state, which is inconsistent
with what the server has on disk (if server is running asynchronously) so
there is no way to reconcile this.  In the case of Lustre the clients are
stateful and keep a record of all operations they do (until the server later
confirms that it is safe on disk).  In case of server failure the clients
replay uncommitted operations to the server after a failure.  This allows
the server filesystem to run asynchronously.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2006-06-02  4:07 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-06-01  4:04 Why must NFS access metadata in synchronous mode? Xin Zhao
2006-06-01  5:55 ` Trond Myklebust
2006-06-01 16:27   ` Xin Zhao
2006-06-01 17:26     ` Trond Myklebust
2006-06-02  3:42       ` Can Sar
2006-06-02  4:06         ` Trond Myklebust
2006-06-01 21:40     ` Andreas Dilger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).