From: Matthew Wilcox <willy@infradead.org>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
Josef Bacik <josef@toxicpanda.com>,
linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH] mm: drop mmap_sem before calling balance_dirty_pages() in write fault
Date: Tue, 24 Sep 2019 13:46:08 -0700 [thread overview]
Message-ID: <20190924204608.GI1855@bombadil.infradead.org> (raw)
In-Reply-To: <20190924194238.GA29030@cmpxchg.org>
On Tue, Sep 24, 2019 at 03:42:38PM -0400, Johannes Weiner wrote:
> > I'm not a fan of moving file_update_time() to _before_ the
> > balance_dirty_pages call.
>
> Can you elaborate why? If the filesystem has a page_mkwrite op, it
> will have already called file_update_time() before this function is
> entered. If anything, this change makes the sequence more consistent.
Oh, that makes sense. I thought it should be updated after all the data
was written, but it probably doesn't make much difference.
> > Also, this is now the third place that needs
> > maybe_unlock_mmap_for_io, see
> > https://lore.kernel.org/linux-mm/20190917120852.x6x3aypwvh573kfa@box/
>
> Good idea, I moved the helper to internal.h and converted to it.
>
> I left the shmem site alone, though. It doesn't require the file
> pinning, so it shouldn't pointlessly bump the file refcount and
> suggest such a dependency - that could cost somebody later quite a bit
> of time trying to understand the code.
The problem for shmem is this:
spin_unlock(&inode->i_lock);
schedule();
spin_lock(&inode->i_lock);
finish_wait(shmem_falloc_waitq, &shmem_fault_wait);
spin_unlock(&inode->i_lock);
While scheduled, the VMA can go away and the inode be reclaimed, making
this a use-after-free. The initial suggestion was an increment on
the inode refcount, but since we already have a pattern which involves
pinning the file, I thought that was a better way to go.
> From: Johannes Weiner <jweiner@fb.com>
> Date: Wed, 8 May 2019 13:53:38 -0700
> Subject: [PATCH v2] mm: drop mmap_sem before calling balance_dirty_pages()
> in write fault
>
> One of our services is observing hanging ps/top/etc under heavy write
> IO, and the task states show this is an mmap_sem priority inversion:
>
> A write fault is holding the mmap_sem in read-mode and waiting for
> (heavily cgroup-limited) IO in balance_dirty_pages():
>
> [<0>] balance_dirty_pages+0x724/0x905
> [<0>] balance_dirty_pages_ratelimited+0x254/0x390
> [<0>] fault_dirty_shared_page.isra.96+0x4a/0x90
> [<0>] do_wp_page+0x33e/0x400
> [<0>] __handle_mm_fault+0x6f0/0xfa0
> [<0>] handle_mm_fault+0xe4/0x200
> [<0>] __do_page_fault+0x22b/0x4a0
> [<0>] page_fault+0x45/0x50
> [<0>] 0xffffffffffffffff
>
> Somebody tries to change the address space, contending for the
> mmap_sem in write-mode:
>
> [<0>] call_rwsem_down_write_failed_killable+0x13/0x20
> [<0>] do_mprotect_pkey+0xa8/0x330
> [<0>] SyS_mprotect+0xf/0x20
> [<0>] do_syscall_64+0x5b/0x100
> [<0>] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> [<0>] 0xffffffffffffffff
>
> The waiting writer locks out all subsequent readers to avoid lock
> starvation, and several threads can be seen hanging like this:
>
> [<0>] call_rwsem_down_read_failed+0x14/0x30
> [<0>] proc_pid_cmdline_read+0xa0/0x480
> [<0>] __vfs_read+0x23/0x140
> [<0>] vfs_read+0x87/0x130
> [<0>] SyS_read+0x42/0x90
> [<0>] do_syscall_64+0x5b/0x100
> [<0>] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> [<0>] 0xffffffffffffffff
>
> To fix this, do what we do for cache read faults already: drop the
> mmap_sem before calling into anything IO bound, in this case the
> balance_dirty_pages() function, and return VM_FAULT_RETRY.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
next prev parent reply other threads:[~2019-09-24 20:46 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-09-24 17:15 [PATCH] mm: drop mmap_sem before calling balance_dirty_pages() in write fault Johannes Weiner
2019-09-24 17:48 ` Matthew Wilcox
2019-09-24 19:42 ` Johannes Weiner
2019-09-24 20:46 ` Matthew Wilcox [this message]
2019-09-24 21:43 ` Johannes Weiner
2019-09-26 13:49 ` Kirill A. Shutemov
2019-09-26 18:50 ` Matthew Wilcox
2019-09-26 19:26 ` Johannes Weiner
2019-09-27 8:39 ` Kirill A. Shutemov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190924204608.GI1855@bombadil.infradead.org \
--to=willy@infradead.org \
--cc=akpm@linux-foundation.org \
--cc=hannes@cmpxchg.org \
--cc=josef@toxicpanda.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).