linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jan Kara <jack@suse.cz>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Josef Bacik <josef@toxicpanda.com>,
	kernel-team@fb.com, hannes@cmpxchg.org,
	linux-kernel@vger.kernel.org, tj@kernel.org, david@fromorbit.com,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	riel@redhat.com, jack@suse.cz
Subject: Re: [PATCH 3/3] filemap: drop the mmap_sem for all blocking operations
Date: Wed, 12 Dec 2018 11:36:11 +0100	[thread overview]
Message-ID: <20181212103611.GC10902@quack2.suse.cz> (raw)
In-Reply-To: <20181211131519.8d9e91eac049f16dad7c2d1f@linux-foundation.org>

On Tue 11-12-18 13:15:19, Andrew Morton wrote:
> On Tue, 11 Dec 2018 12:38:01 -0500 Josef Bacik <josef@toxicpanda.com> wrote:
> 
> > Currently we only drop the mmap_sem if there is contention on the page
> > lock.  The idea is that we issue readahead and then go to lock the page
> > while it is under IO and we want to not hold the mmap_sem during the IO.
> > 
> > The problem with this is the assumption that the readahead does
> > anything.  In the case that the box is under extreme memory or IO
> > pressure we may end up not reading anything at all for readahead, which
> > means we will end up reading in the page under the mmap_sem.
> > 
> > Even if the readahead does something, it could get throttled because of
> > io pressure on the system and the process is in a lower priority cgroup.
> > 
> > Holding the mmap_sem while doing IO is problematic because it can cause
> > system-wide priority inversions.  Consider some large company that does
> > a lot of web traffic.  This large company has load balancing logic in
> > it's core web server, cause some engineer thought this was a brilliant
> > plan.  This load balancing logic gets statistics from /proc about the
> > system, which trip over processes mmap_sem for various reasons.  Now the
> > web server application is in a protected cgroup, but these other
> > processes may not be, and if they are being throttled while their
> > mmap_sem is held we'll stall, and cause this nice death spiral.
> > 
> > Instead rework filemap fault path to drop the mmap sem at any point that
> > we may do IO or block for an extended period of time.  This includes
> > while issuing readahead, locking the page, or needing to call ->readpage
> > because readahead did not occur.  Then once we have a fully uptodate
> > page we can return with VM_FAULT_RETRY and come back again to find our
> > nicely in-cache page that was gotten outside of the mmap_sem.
> > 
> > This patch also adds a new helper for locking the page with the mmap_sem
> > dropped.  This doesn't make sense currently as generally speaking if the
> > page is already locked it'll have been read in (unless there was an
> > error) before it was unlocked.  However a forthcoming patchset will
> > change this with the ability to abort read-ahead bio's if necessary,
> > making it more likely that we could contend for a page lock and still
> > have a not uptodate page.  This allows us to deal with this case by
> > grabbing the lock and issuing the IO without the mmap_sem held, and then
> > returning VM_FAULT_RETRY to come back around.
> > 
> > ...
...
> > @@ -2397,6 +2451,7 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
> >  {
> >  	int error;
> >  	struct file *file = vmf->vma->vm_file;
> > +	struct file *fpin = NULL;
> >  	struct address_space *mapping = file->f_mapping;
> >  	struct file_ra_state *ra = &file->f_ra;
> >  	struct inode *inode = mapping->host;
> > @@ -2418,10 +2473,10 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
> >  		 * We found the page, so try async readahead before
> >  		 * waiting for the lock.
> >  		 */
> > -		do_async_mmap_readahead(vmf, page);
> > +		fpin = do_async_mmap_readahead(vmf, page);
> >  	} else if (!page) {
> >  		/* No page in the page cache at all */
> > -		do_sync_mmap_readahead(vmf);
> > +		fpin = do_sync_mmap_readahead(vmf);
> >  		count_vm_event(PGMAJFAULT);
> >  		count_memcg_event_mm(vmf->vma->vm_mm, PGMAJFAULT);
> >  		ret = VM_FAULT_MAJOR;
> > @@ -2433,7 +2488,7 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
> >  			return vmf_error(-ENOMEM);
> 
> hm, how does this work.  We might have taken a ref on the file and that
> ref is recorded in fpin but an error here causes us to lose track of
> that elevated refcount?

Yeah, that looks like a bug to me as well.

> >  	}
> >  
> > -	if (!lock_page_or_retry(page, vmf->vma->vm_mm, vmf->flags)) {
> > +	if (!lock_page_maybe_drop_mmap(vmf, page, &fpin)) {
> >  		put_page(page);
> >  		return ret | VM_FAULT_RETRY;
> >  	}

And here can be the same problem. Generally if we went through 'goto
retry_find', we may have file ref already taken but some exit paths don't
drop that ref properly...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

  reply	other threads:[~2018-12-12 10:36 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-12-11 17:37 [PATCH 0/3][V5] drop the mmap_sem when doing IO in the fault path Josef Bacik
2018-12-11 17:37 ` [PATCH 1/3] filemap: kill page_cache_read usage in filemap_fault Josef Bacik
2018-12-11 17:38 ` [PATCH 2/3] filemap: pass vm_fault to the mmap ra helpers Josef Bacik
2018-12-12 10:10   ` Jan Kara
2018-12-11 17:38 ` [PATCH 3/3] filemap: drop the mmap_sem for all blocking operations Josef Bacik
2018-12-11 21:15   ` Andrew Morton
2018-12-12 10:36     ` Jan Kara [this message]
2018-12-12 15:27   ` [PATCH][v6] " Josef Bacik
2018-12-12 23:55     ` Andrew Morton
2018-12-13 16:01       ` Jan Kara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20181212103611.GC10902@quack2.suse.cz \
    --to=jack@suse.cz \
    --cc=akpm@linux-foundation.org \
    --cc=david@fromorbit.com \
    --cc=hannes@cmpxchg.org \
    --cc=josef@toxicpanda.com \
    --cc=kernel-team@fb.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=riel@redhat.com \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).