linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Lars Ellenberg <lars.ellenberg@linbit.com>
To: Linux-fsdevel <linux-fsdevel@vger.kernel.org>
Subject: umount times: 30 sec (open;close;unlink) vs 0.06 sec (open;unlink;close)
Date: Thu, 28 Jan 2016 14:56:05 +0100	[thread overview]
Message-ID: <20160128135605.GB32614@soda.linbit> (raw)


We had a report that after "some runtime" (days)
the time to umount some file system became "huge" (minutes).
Even though the file system was nearly "empty" (few thousand files).

While umount after only some "short" run time was sub-second.

Analysis and workaround (for this use case) follow.

The file system in question is mostly used as a "spool" area,
so basically lots of
	echo $stuff > $tempfile
	process $tempfile
	rm $tempfile

Investigation shows that this creates huge amounts of negative entries
in the dentry cache. There is no memory pressure, the directory is not
removed either, so they stay around.

Reproducer in shell:
    while true; do
        F=$RANDOM
        touch $F
        rm $F
    done
and then
watch 'cat /proc/sys/fs/dentry-state ;
        slabtop -o | grep dentry ;
        grep ^SReclaimable /proc/meminfo'

(Obviously in C, perl or python,
you can get orders of magnitutes higher iterations per second).

So this accumulates unused negative dentries quickly,
and after some time, given enough RAM, we have gigabytes worth
of dentry cache, but no inodes used.

Umount of that empty spool file system takes 30 seconds.
It will take minutes, if you let it run even longer.
In real-life, after days of real load, umounting the spool
file system (with ~30 GB of accumulated dentry cache, but only a few
thousand remaining inodes) took minutes, and produced soft lockups
"BUG: soft lockup - CPU... stuck for XYZ seconds".

The Workaround:
---------------

The interesting part is, this (almost) same reproducer
behaves completely different:
    while true; do
        F=$RANDOM
        touch $F
        rm $F <$F       #### mind the redirection ####
    done
(unlink before last close)

This does *not* accumulate negative dentries at all.
Which is how I'd expected the other case to behave as well.
If we look at vfs_unlink: there is a d_delete(dentry) in there.

Total dentries vs seconds runtime (with a python reproducer).
Upper, linearly increasing dots are "open;close;unlink".
Mind the log scale on the "total dentries" y-axis.
Flat dots are for "open;unlink;close". 

         ++-----+-------+------+------+-------+------+------+-------+-----++
         ++     +       +      +      +       +    ....................   ++
         ++                         ................                      ++
         ++               ...........                                     ++
   1e+07 +++       ........                                              +++
         ++    .....                                                      ++
         ++  ...                                                          ++
         ++...                                                            ++
         +..                                                              ++
   1e+06 +.+                                                             +++
         ..                                                               ++
         .+                                                               ++
         ...............................................................  ++
         .+                                                               ++
  100000 +++                                                             +++
         ++                                                               ++
         ++                                                               ++
         ++                                                               ++
         ++     +       +      +      +       +      +      +       +     ++
   10000 +-+----+-------+------+------+-------+------+------+-------+----+-+
         0    100     200    300    400     500    600    700     800    900


time umount after 15 minutes and 50 million iterations,
so about 50 million dentries: 30 seconds.

time umount after 15 minutes and same number of iterations,
but no "stale negative dentries": 60 *milli* seconds.

So what I suggest is to fix the "touch $F; rm $F" case to
have the same effect as the "touch $F; rm $F <$F" case:
drop the corresponding dentry.

Would some VFS people kindly look into this,
or explain why, in case it "works as designed",
respectively point out what I am missing?

Pretty please? :-)



Also,
obviously anyone can produce negative dentries
by just stat()ing non-existing file names.
This comes up repeatedly (I found posts going
back ~15 years and more) in various contexts.

Once the dentry_cache grows beyond what is a reasonable working
set size for the cache levels, inserting new (negative) dentries
costs about as much as it was supposed to save...

I think it would be useful to introduce different sets of tunables
for positive and negative dentry "expiry", or a limit on the
number of negative dentries (maybe as relative vs total), to avoid
such massive (several gigabytes of slab) buildup of dentry caches
on a mostly empty file system, without the indirection via global
(or cgroup) memory pressure.

Maybe instead of always allocating and inserting a new dentry,
prefer to recycle some existing negative dentry, which has
not been looked at since a long time.

Thanks,

   Lars


                 reply	other threads:[~2016-01-28 14:06 UTC|newest]

Thread overview: [no followups] expand[flat|nested]  mbox.gz  Atom feed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160128135605.GB32614@soda.linbit \
    --to=lars.ellenberg@linbit.com \
    --cc=linux-fsdevel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).