linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Matthew Wilcox <willy@infradead.org>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Josef Bacik <josef@toxicpanda.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH] mm: drop mmap_sem before calling balance_dirty_pages() in write fault
Date: Tue, 24 Sep 2019 13:46:08 -0700	[thread overview]
Message-ID: <20190924204608.GI1855@bombadil.infradead.org> (raw)
In-Reply-To: <20190924194238.GA29030@cmpxchg.org>

On Tue, Sep 24, 2019 at 03:42:38PM -0400, Johannes Weiner wrote:
> > I'm not a fan of moving file_update_time() to _before_ the
> > balance_dirty_pages call.
> 
> Can you elaborate why? If the filesystem has a page_mkwrite op, it
> will have already called file_update_time() before this function is
> entered. If anything, this change makes the sequence more consistent.

Oh, that makes sense.  I thought it should be updated after all the data
was written, but it probably doesn't make much difference.

> > Also, this is now the third place that needs
> > maybe_unlock_mmap_for_io, see
> > https://lore.kernel.org/linux-mm/20190917120852.x6x3aypwvh573kfa@box/
> 
> Good idea, I moved the helper to internal.h and converted to it.
> 
> I left the shmem site alone, though. It doesn't require the file
> pinning, so it shouldn't pointlessly bump the file refcount and
> suggest such a dependency - that could cost somebody later quite a bit
> of time trying to understand the code.

The problem for shmem is this:

                        spin_unlock(&inode->i_lock);
                        schedule();

                        spin_lock(&inode->i_lock);
                        finish_wait(shmem_falloc_waitq, &shmem_fault_wait);
                        spin_unlock(&inode->i_lock);

While scheduled, the VMA can go away and the inode be reclaimed, making
this a use-after-free.  The initial suggestion was an increment on
the inode refcount, but since we already have a pattern which involves
pinning the file, I thought that was a better way to go.

> From: Johannes Weiner <jweiner@fb.com>
> Date: Wed, 8 May 2019 13:53:38 -0700
> Subject: [PATCH v2] mm: drop mmap_sem before calling balance_dirty_pages()
>  in write fault
> 
> One of our services is observing hanging ps/top/etc under heavy write
> IO, and the task states show this is an mmap_sem priority inversion:
> 
> A write fault is holding the mmap_sem in read-mode and waiting for
> (heavily cgroup-limited) IO in balance_dirty_pages():
> 
> [<0>] balance_dirty_pages+0x724/0x905
> [<0>] balance_dirty_pages_ratelimited+0x254/0x390
> [<0>] fault_dirty_shared_page.isra.96+0x4a/0x90
> [<0>] do_wp_page+0x33e/0x400
> [<0>] __handle_mm_fault+0x6f0/0xfa0
> [<0>] handle_mm_fault+0xe4/0x200
> [<0>] __do_page_fault+0x22b/0x4a0
> [<0>] page_fault+0x45/0x50
> [<0>] 0xffffffffffffffff
> 
> Somebody tries to change the address space, contending for the
> mmap_sem in write-mode:
> 
> [<0>] call_rwsem_down_write_failed_killable+0x13/0x20
> [<0>] do_mprotect_pkey+0xa8/0x330
> [<0>] SyS_mprotect+0xf/0x20
> [<0>] do_syscall_64+0x5b/0x100
> [<0>] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> [<0>] 0xffffffffffffffff
> 
> The waiting writer locks out all subsequent readers to avoid lock
> starvation, and several threads can be seen hanging like this:
> 
> [<0>] call_rwsem_down_read_failed+0x14/0x30
> [<0>] proc_pid_cmdline_read+0xa0/0x480
> [<0>] __vfs_read+0x23/0x140
> [<0>] vfs_read+0x87/0x130
> [<0>] SyS_read+0x42/0x90
> [<0>] do_syscall_64+0x5b/0x100
> [<0>] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> [<0>] 0xffffffffffffffff
> 
> To fix this, do what we do for cache read faults already: drop the
> mmap_sem before calling into anything IO bound, in this case the
> balance_dirty_pages() function, and return VM_FAULT_RETRY.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>


  reply	other threads:[~2019-09-24 20:46 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-09-24 17:15 [PATCH] mm: drop mmap_sem before calling balance_dirty_pages() in write fault Johannes Weiner
2019-09-24 17:48 ` Matthew Wilcox
2019-09-24 19:42   ` Johannes Weiner
2019-09-24 20:46     ` Matthew Wilcox [this message]
2019-09-24 21:43       ` Johannes Weiner
2019-09-26 13:49         ` Kirill A. Shutemov
2019-09-26 18:50           ` Matthew Wilcox
2019-09-26 19:26           ` Johannes Weiner
2019-09-27  8:39             ` Kirill A. Shutemov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190924204608.GI1855@bombadil.infradead.org \
    --to=willy@infradead.org \
    --cc=akpm@linux-foundation.org \
    --cc=hannes@cmpxchg.org \
    --cc=josef@toxicpanda.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).