Re: inode caching - Benny Halevy

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Benny Halevy <bhalevy@panasas.com>
To: Timo Sirainen <tss@iki.fi>
Cc: Peter Staubach <staubach@redhat.com>, linux-nfs@vger.kernel.org
Subject: Re: inode caching
Date: Wed, 28 May 2008 08:38:29 +0300	[thread overview]
Message-ID: <483CEFD5.8050507@panasas.com> (raw)
In-Reply-To: <0BF144BC-6CBB-49FA-8F49-D765FB58AF5E@iki.fi>

On May. 27, 2008, 22:13 +0300, Timo Sirainen <tss@iki.fi> wrote:
> On May 27, 2008, at 9:09 PM, Peter Staubach wrote:
> 
>>>>> So what I'd want to know is:
>>>>>
>>>>> a) Why does this happen only sometimes? I can't really figure out  
>>>>> from
>>>>> the code what invalidates the fd1 inode. Apparently the second  
>>>>> open()
>>>>> somehow, but since it uses the new "foo" file with a different  
>>>>> struct
>>>>> inode, where does the old struct inode get invalidated?
>>>>>
>>>>>
>>>> This will happen always, but you may see occasional successful
>>>> fstat() calls on the client due to attribute caching and/or
>>>> dentry caching.
>>>>
>>> I would understand if it always failed or always succeeded, but it  
>>> seems
>>> to be somewhat random now. And it's not "occational successful  
>>> fstat()",
>>> but it's "occational failed fstat()". The difference shouldn't be
>>> because of attribute caching, because I specify it explicitly to two
>>> seconds and run the test within that 2 second. So the test should  
>>> always
>>> hit the attribute cache, and according to you that should always  
>>> cause
>>> it to succeed (but it rarely does). I think dentry caching also  
>>> more or
>>> less depends on attribute cache timeout?
>>>
>>>
>> How did you specify the attribute cache to be 2 seconds?
> 
> mount -o actimeo=2
> 
>>>>> b) Can this be fixed? Or is it just luck that it works as well as  
>>>>> it
>>>>> does now?
>>>>>
>>>>>
>>>> This can be fixed, somewhat. I have some changes to address the
>>>> ESTALE situation in system calls that take filename as arguments,
>>>> but I need to work with some more people to get them included.
>>>> The system calls which do not take file names as arguments can not
>>>> be recovered from because the file they are referring is really
>>>> gone or at least not accessible anymore.
>>>>
>>>> The reuse of the inode number is just a fact of life and that way
>>>> that file systems work. I would suggest rethinking your application
>>>> in order to reduce or eliminate any dependence that it might have.
>>>>
>>> The problem I have is that I need to reliably find out if a file has
>>> been replaced with a new file. So I first flush the dentry cache
>>> (chowning parent directory), stat() the file and fstat() the opened
>>> file. If fstat() fails with ESTALE or if the inodes don't match, I  
>>> know
>>> that the file has been replaced and I need to re-open and re-read it.
>>> This seems to work nearly always.
>>>
>> This would seem to be quite implementation specific and also has
>> some timing dependencies built-in.  These would seem to me to be
>> dangerous assumptions and heuristics to be depending upon.
>>
>> Have you considered making the contents of the file itself versioned
>> in some fashion and thus, removing dependencies on how the NFS client
>> works and/or the file system on the NFS server?
> 
> I guess one possibility would be to link() the file elsewhere for "a  
> while", so that the inode wouldn't get reused until everyone's  
> attribute caches have become flushed. That feels a bit dirty solution  
> too though. (This is about handling Dovecot IMAP/POP3's metadata files.)

The NFS (v2/v3) server can't guarantee you traditional Unix semantics
where the inode is kept around until last close.  Hard linking it to keep
it around is the cleanest way you can go IMO.

> 
> I'd still like to understand why exactly this happens though. Maybe  
> there's a chance that this is just a bug in the current NFS  
> implementation so I could keep using my current code (which is  
> actually very difficult to break even with stress testing, so if this  
> doesn't get fixed on kernel side I'll probably just leave my code as  
> it is). I guess I'll start debugging the NFS code to find out what's  
> really going on.

My guess would be that the new incarnation of the inode generates the
same filehandle as the old one, not just the same inode number.

Benny

next prev parent reply	other threads:[~2008-05-28  5:39 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-05-26 20:58 inode caching Timo Sirainen
2008-05-27 12:48 ` Peter Staubach
2008-05-27 15:40   ` Timo Sirainen
2008-05-27 18:09     ` Peter Staubach
2008-05-27 19:13       ` Timo Sirainen
2008-05-28  5:38         ` Benny Halevy [this message]
2008-05-28 13:59           ` J. Bruce Fields
2008-05-28 15:20             ` Timo Sirainen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=483CEFD5.8050507@panasas.com \
    --to=bhalevy@panasas.com \
    --cc=linux-nfs@vger.kernel.org \
    --cc=staubach@redhat.com \
    --cc=tss@iki.fi \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.