Linux NFS development
 help / color / mirror / Atom feed
From: Chuck Lever III <chuck.lever@oracle.com>
To: "Andrew J. Romero" <romero@fnal.gov>
Cc: Jeff Layton <jlayton@kernel.org>,
	Linux NFS Mailing List <linux-nfs@vger.kernel.org>
Subject: Re: Zombie / Orphan open files
Date: Tue, 31 Jan 2023 15:31:48 +0000	[thread overview]
Message-ID: <DA731F17-4FF2-4013-B8F2-89461D72A14A@oracle.com> (raw)
In-Reply-To: <SA1PR09MB75528A7E45898F6A02EDF82EA7D09@SA1PR09MB7552.namprd09.prod.outlook.com>



> On Jan 31, 2023, at 9:42 AM, Andrew J. Romero <romero@fnal.gov> wrote:
> 
> 
> 
>> From: Jeff Layton <jlayton@kernel.org>
> 
>> What do you mean by "zombie / orphan" here? Do you mean files that have
>> been sillyrenamed [1] to ".nfsXXXXXXX" ? Or are you simply talking about
>> clients that are holding files open for a long time?
> 
> Hi Jeff
> 
> .... clients that are holding files open for a long time
> 
> Here's a complete summary:
> 
> On my NAS appliances , I noticed that average usage of the relevant memory pool
> never went down. I suspected some sort of "leak" or "file-stuck-open" scenario.
> 
> I hypothesized that if NFS-client to NFS-server communications were frequently disrupted,
> this would explain the memory-pool behavior I was seeing.
> I felt that Kerberos credential expiration was the most likely frequent disruptor.
> 
> I ran a simple python test script that (1) opened enough files that I could see an obvious jump
> in the relevant NAS memory pool metric, then (2) went to sleep for shorter than the
> Kerberos ticket lifetime, then (3) exited without explicitly closing the files.
> The result:  After the script exited,  usage of the relevant server-side memory pool decreased by
> the expected amount.
> 
> Then I ran a simple python test script that (1) opened enough files that I could see an obvious jump
> in the relevant NAS memory pool metric, then (2) went to sleep for longer than the
> Kerberos ticket lifetime, then (3) exited without explicitly closing the files.
> The result:  After the script exited,  usage of the relevant server-side memory pool did not decrease.
> ( the files opened by the script were permanently "stuck open" ... depleting the server-side pool resource)
> 
> In a large campus environment, usage of the relevant memory pool will eventually get so
> high that a server-side reboot will be needed.
> 
> I'm working with my NAS vendor ( who is very helpful ); however, if the NFS server and client specifications
> don't specify an official way to handle this very real problem, there is not much a NAS server vendor can safely / responsibly do.

Yes, there is: the NAS vendor can report the problem to the people
they get their server code from :-)


> If there currently is no formal/official way of handling this issue ( server-side pool exhaustion due to "disappearing" client )
> is this a problem worth solving ( at a level lower than the application level )?

Yes, this is IMO unwelcome behavior, and a real problem for large
scale deployment, as you describe above.

But let's be careful: a "disappearing client" should be handled
properly: its lease will expire and the server will eventually
close out any OPEN state that client was responsible for.

If the client continues to renew its state, and the appplication
doesn't quit or close its files, neither the client or server
can tell easily that there is a problem.

Moreover, ticket expiry is not necessarily an indication that the
application is done with a file.


> If client applications were all well behaved ( didn't leave files open for long periods of time ) we wouldn't have a significant issue.
> Assuming applications aren't going to be well behaved, are there good general ways of solving this on either the client or server side ?

The server needs to manage its resource pools appropriately,
otherwise it is exposed to DoS or DDoS attacks. That will
improve over time, but I'm not seeing an immediate way to
fairly address this on the server side. As Jeff said, the
server is just doing what clients are asking of it.

The client-side needs to clean up when it can, so we need to
explore that. Actually that might be where you have a little
more immediate control of this situation. The applications
need to either re-authenticate or close files they no longer
need. I think you'd have this problem with long-lived
applications running on one big system as well.


--
Chuck Lever




  parent reply	other threads:[~2023-01-31 15:32 UTC|newest]

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-01-23 16:31 Trying to reduce NFSv4 timeouts to a few seconds on an established connection Andrew Klaassen
2023-01-23 16:35 ` Chuck Lever III
2023-01-23 16:41   ` Andrew Klaassen
2023-01-26 15:31 ` Andrew Klaassen
2023-01-26 22:08   ` Andrew Klaassen
2023-01-27 13:33     ` Jeff Layton
2023-01-30 19:33       ` Andrew Klaassen
2023-01-30 19:55         ` Jeff Layton
2023-01-30 20:03           ` Andrew Klaassen
2023-01-30 20:31             ` Jeff Layton
2023-01-30 22:11               ` Zombie / Orphan open files Andrew J. Romero
2023-01-31  0:10                 ` Chuck Lever III
2023-01-31 13:27                 ` Jeff Layton
2023-01-31 14:42                   ` Andrew J. Romero
2023-01-31 15:24                     ` Jeff Layton
2023-01-31 15:31                     ` Chuck Lever III [this message]
2023-01-31 16:34                     ` Chuck Lever III
2023-01-31 16:59                       ` Andrew J. Romero
2023-01-31 18:05                         ` Chuck Lever III
2023-01-31 18:33                           ` Andrew J. Romero
2023-01-31 18:51                             ` Chuck Lever III
2023-01-31 19:32                               ` Andrew J. Romero
2023-01-31 19:08                             ` Olga Kornievskaia
2023-01-31 19:31                         ` Olga Kornievskaia
2023-01-31 19:54                           ` Andrew J. Romero
2023-01-31 22:14                             ` Olga Kornievskaia
2023-01-31 22:26                               ` Andrew J. Romero
2023-01-31 22:47                                 ` Olga Kornievskaia
2023-01-31 23:08                                   ` Andrew J. Romero
2023-02-01 14:28                                     ` Olga Kornievskaia
     [not found]                                       ` <SA1PR09MB755217D2B3E29E9486D4796FA7D19@SA1PR09MB7552.namprd09.prod.outlook.com>
     [not found]                                         ` <CAN-5tyGaX=Go+kwrM33K2EaY41sXmf4v1+2JO8MhbDuGTGG7zA@mail.gmail.com>
     [not found]                                           ` <SA1PR09MB755277F59EB463643BEBDD77A7D69@SA1PR09MB7552.namprd09.prod.outlook.com>
2023-02-02  0:53                                             ` Olga Kornievskaia
2023-01-31 22:28                               ` Jeff Layton
2023-01-31 18:13                       ` Jeff Layton
2023-01-31 16:26                 ` Olga Kornievskaia
2023-01-31 17:44                   ` Andrew J. Romero
2023-01-31 18:18                   ` Frank Filz
2023-01-31 19:19                     ` Olga Kornievskaia
2023-01-31 21:31                       ` Frank Filz
2023-01-31 21:46                         ` Andrew J. Romero
2023-02-02 18:16               ` Trying to reduce NFSv4 timeouts to a few seconds on an established connection Andrew Klaassen
2023-02-06 15:27                 ` Andrew Klaassen
2023-02-06 17:18                   ` Andrew Klaassen
2023-02-27 14:48                     ` Andrew Klaassen
2023-02-28 13:23                       ` Jeff Layton
2023-03-02 15:25                         ` Andrew Klaassen
2023-03-02 18:47                         ` Andrew Klaassen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=DA731F17-4FF2-4013-B8F2-89461D72A14A@oracle.com \
    --to=chuck.lever@oracle.com \
    --cc=jlayton@kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=romero@fnal.gov \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox