linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Khazhismel Kumykov <khazhy@google.com>
To: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	viro@zeniv.linux.org.uk, ebiederm@xmission.com,
	"miklos@szeredi.hu" <miklos@szeredi.hu>,
	akpm@linux-foundation.org
Subject: Re: Hang/soft lockup in d_invalidate with simultaneous calls
Date: Thu, 25 May 2017 15:31:23 -0700	[thread overview]
Message-ID: <CACGdZY+i5yFhz3c8rhdcyz_FQS7hdrAa-dv-u+Rgwu-0oNeCcQ@mail.gmail.com> (raw)
In-Reply-To: <CACGdZYJe6CSVmapnmc5sgUkL=OFsComLFZmhmrS_ifMLpz1+oA@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 5627 bytes --]

On Mon, May 22, 2017 at 11:18 AM, Khazhismel Kumykov <khazhy@google.com> wrote:
> On Wed, May 17, 2017 at 2:58 PM Khazhismel Kumykov <khazhy@google.com> wrote:
>>
>> On Mon, May 15, 2017 at 5:05 PM, Khazhismel Kumykov <khazhy@google.com> wrote:
>> > Hi,
>> >
>> > I'm seeing behavior in d_invalidate, if multiple threads call d_invalidate on
>> > the same tree at the same, behavior time blows up and all the calls hang with
>> > large enough trees/enough simultaneous callers. (e.g. a directory w/ 100k
>> > entries in d_subdir, and 5 or so threads calling d_invalidate was able to hang
>> > my test VMs)
>> >
>> > This seems to be due to thrashing on the dentry locks in multiple threads
>> > tightly looping calling d_walk waiting for a shrink_dentry_list call (also
>> > thrashing the locks) to complete. (d_invalidate loops calling d_walk so long as
>> > any dentry in the tree is in a dcache shrink list).
>> >
>> > There was a similar report recently "soft lockup in d_invalidate()" that
>> > proposed in the d_invalidate d_walk to ignore dentries marked as in a shrink
>> > list already, which does prevent this hang/lockup in this case as well, although
>> > I'm not sure it doesn't violate the contract for d_invalidate. (Although the
>> > entries in a shrink list should be dropped eventually, not necessarily by the
>> > time d_invalidate returns).
>> >
>> > If someone more familiar with the dcache could provide input I would appreciate.
>> >
>> > A reliable repro on mainline is:
>> >  - create a mountpoint with DCACHE_OP_REVALIDATE, e.g. fuse passthrough
>> >  - create a directory and populate with ~100k files with content to
>> > populate dcache
>> >  - create some background processes opening/reading files in this folder (5
>> >       while true; cat $file was enough to get an indefinite hang for me)
>> >  - cause the directory to need to be invalidated (e.g., make_bad_inode on the
>> >     directory)
>> >
>> > This results in the background processes all entering d_invalidate and hanging,
>> > while with just one process in d_invalidate (e.g., stat'ing a file in the dir)
>> > things go pretty quickly as expected.
>> >
>> >
>> > (The proposed patch from other thread)
>> >
>> > diff --git a/fs/dcache.c b/fs/dcache.c
>> > index 7b8feb6..3a3b0f37 100644
>> > --- a/fs/dcache.c
>> > +++ b/fs/dcache.c
>> > @@ -1364,7 +1364,7 @@ static enum d_walk_ret select_collect(void *_data,
>> > struct dentry *dentry)
>> >                  goto out;
>> >
>> >          if (dentry->d_flags & DCACHE_SHRINK_LIST) {
>> > -               data->found++;
>> > +               goto out;
>> >          } else {
>> >                  if (dentry->d_flags & DCACHE_LRU_LIST)
>> >                          d_lru_del(dentry);
>> >
>> >
>> > khazhy
>>
>> Would this change actually violate any guarantees? select_collect
>> looks like it used to ignore unused dentries that were part of a
>> shrink list before fe91522a7ba82ca1a51b07e19954b3825e4aaa22 (found
>> would not be incremented). Once the dentry is on a shrink list would
>> it be unreachable anyways, so returning from d_invalidate before all
>> the shrink lists finish processing would be ok?
>>
>> khazhy
>
> Pinging in case this got missed, would appreciate thoughts from someone more
> familiar with the dcache.
>
> My previous email wasn't completely correct, while before fe91522a7ba8
> ("don't remove from shrink list in select_collect()") d_invalidate
> would not busy
> wait for other workers calling shrink_list to compete, it would return -EBUSY,
> rather than success, so a change to return success without waiting would not
> be equivalent behavior before. Currently, we will loop calling d_walk repeatedly
> causing the extreme slowdown observed.
>
> I still want to understand better, in d_invalidate if we can return
> without pruning
> in-use dentries safely, would returning before unused dentries are
> pruned, so long
> as we know they will be pruned by another task (are on a shrink list),
> be safe as well?
> If not, would it make more sense to have have a mutual exclusion on calling
> d_invalidate on the same dentries simultaneously?
>
> khazhy

ping

Again to summarize:
With a directory with large number of children, which makes a dentry with a
large number dentries in d_subdir, simultaneous calls to d_invalidate on that
dentry take a very long time. As an example, a directory with 160k files,
where a single d_invalidate call took ~1.4s, having 6 tasks call d_invalidate
simultaneously took ~6.5 hours to resolve, about 10000x longer.

dir/
  dir/file-0
  ...
  dir/file-160000

This seems to be due to contention between d_walk and shrink_dentry_list,
and particularly d_invalidate tightly looping d_walk so long as any dentry
it finds is in a shrink list. With this particular directory
structure, d_walk will
hold d_lock for "dir" while walking over d_subdir, meanwhile
shrink_dentry_list will release and regrab "dir"s d_lock every iteration,
also throwing it at the back of the queue.

One proposed solution is in select_collect, do not increment the
number of found unused descendants when we find a dentry in a
shrink list. This proposal will result in d_invalidate and shrink_dcache_parent
returning before necessarily all unused children are pruned, but they will
be pruned at some time soon by the other task currently shrinking, which
I am not sure if that is OK or not.

A quick and dirty fix would be adding a mutex around the shrink loops,
so simultaneous tasks don't thrash with each other.

There are probably also other solutions here, I would appreciate a look
from someone more familiar with this area.

khazhy

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4843 bytes --]

  reply	other threads:[~2017-05-25 22:31 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-05-16  0:05 Hang/soft lockup in d_invalidate with simultaneous calls Khazhismel Kumykov
2017-05-17 21:58 ` Khazhismel Kumykov
2017-05-22 18:18   ` Khazhismel Kumykov
2017-05-25 22:31     ` Khazhismel Kumykov [this message]
2017-06-03  1:12   ` Al Viro
2017-06-03  5:22     ` Khazhismel Kumykov
2017-06-03  6:20       ` Al Viro
2017-06-03  6:47         ` Khazhismel Kumykov
2017-06-12 23:00           ` Khazhismel Kumykov
2017-06-15 10:50             ` Al Viro

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CACGdZY+i5yFhz3c8rhdcyz_FQS7hdrAa-dv-u+Rgwu-0oNeCcQ@mail.gmail.com \
    --to=khazhy@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=ebiederm@xmission.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=miklos@szeredi.hu \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).