Re: [PATCH 1/2] shmem: fix faulting into a hole, not taking i_mutex

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Vlastimil Babka <vbabka@suse.cz>
To: Hugh Dickins <hughd@google.com>,
	Andrew Morton <akpm@linux-foundation.org>
Cc: Sasha Levin <sasha.levin@oracle.com>,
	Konstantin Khlebnikov <koct9i@gmail.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Michel Lespinasse <walken@google.com>,
	Lukas Czerner <lczerner@redhat.com>,
	Dave Jones <davej@redhat.com>,
	linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH 1/2] shmem: fix faulting into a hole, not taking i_mutex
Date: Tue, 15 Jul 2014 18:07:04 +0200	[thread overview]
Message-ID: <53C551A8.2040400@suse.cz> (raw)
In-Reply-To: <alpine.LSU.2.11.1407150329250.2584@eggly.anvils>

On 07/15/2014 12:31 PM, Hugh Dickins wrote:
> f00cdc6df7d7 ("shmem: fix faulting into a hole while it's punched") was
> buggy: Sasha sent a lockdep report to remind us that grabbing i_mutex in
> the fault path is a no-no (write syscall may already hold i_mutex while
> faulting user buffer).
>
> We tried a completely different approach (see following patch) but that
> proved inadequate: good enough for a rational workload, but not good
> enough against trinity - which forks off so many mappings of the object
> that contention on i_mmap_mutex while hole-puncher holds i_mutex builds
> into serious starvation when concurrent faults force the puncher to fall
> back to single-page unmap_mapping_range() searches of the i_mmap tree.
>
> So return to the original umbrella approach, but keep away from i_mutex
> this time.  We really don't want to bloat every shmem inode with a new
> mutex or completion, just to protect this unlikely case from trinity.
> So extend the original with wait_queue_head on stack at the hole-punch
> end, and wait_queue item on the stack at the fault end.

Hi, thanks a lot, I will definitely test it soon, although my reproducer 
is rather limited - it already works fine with the current kernel. 
Trinity will be more useful here. But there's something that caught my 
eye so I though I would raise the concern now.

> @@ -760,7 +760,7 @@ static int shmem_writepage(struct page *
>   			spin_lock(&inode->i_lock);
>   			shmem_falloc = inode->i_private;

Without ACCESS_ONCE, can shmem_falloc potentially become an alias on 
inode->i_private and later become re-read outside of the lock?

>   			if (shmem_falloc &&
> -			    !shmem_falloc->mode &&
> +			    !shmem_falloc->waitq &&
>   			    index >= shmem_falloc->start &&
>   			    index < shmem_falloc->next)
>   				shmem_falloc->nr_unswapped++;
> @@ -1248,38 +1248,58 @@ static int shmem_fault(struct vm_area_st
>   	 * Trinity finds that probing a hole which tmpfs is punching can
>   	 * prevent the hole-punch from ever completing: which in turn
>   	 * locks writers out with its hold on i_mutex.  So refrain from
> -	 * faulting pages into the hole while it's being punched, and
> -	 * wait on i_mutex to be released if vmf->flags permits.
> +	 * faulting pages into the hole while it's being punched.  Although
> +	 * shmem_undo_range() does remove the additions, it may be unable to
> +	 * keep up, as each new page needs its own unmap_mapping_range() call,
> +	 * and the i_mmap tree grows ever slower to scan if new vmas are added.
> +	 *
> +	 * It does not matter if we sometimes reach this check just before the
> +	 * hole-punch begins, so that one fault then races with the punch:
> +	 * we just need to make racing faults a rare case.
> +	 *
> +	 * The implementation below would be much simpler if we just used a
> +	 * standard mutex or completion: but we cannot take i_mutex in fault,
> +	 * and bloating every shmem inode for this unlikely case would be sad.
>   	 */
>   	if (unlikely(inode->i_private)) {
>   		struct shmem_falloc *shmem_falloc;
>
>   		spin_lock(&inode->i_lock);
>   		shmem_falloc = inode->i_private;

Same here.

> -		if (!shmem_falloc ||
> -		    shmem_falloc->mode != FALLOC_FL_PUNCH_HOLE ||
> -		    vmf->pgoff < shmem_falloc->start ||
> -		    vmf->pgoff >= shmem_falloc->next)
> -			shmem_falloc = NULL;
> -		spin_unlock(&inode->i_lock);
> -		/*
> -		 * i_lock has protected us from taking shmem_falloc seriously
> -		 * once return from shmem_fallocate() went back up that stack.
> -		 * i_lock does not serialize with i_mutex at all, but it does
> -		 * not matter if sometimes we wait unnecessarily, or sometimes
> -		 * miss out on waiting: we just need to make those cases rare.
> -		 */
> -		if (shmem_falloc) {
> +		if (shmem_falloc &&
> +		    shmem_falloc->waitq &&

Here it's operating outside of lock.

> +		    vmf->pgoff >= shmem_falloc->start &&
> +		    vmf->pgoff < shmem_falloc->next) {
> +			wait_queue_head_t *shmem_falloc_waitq;
> +			DEFINE_WAIT(shmem_fault_wait);
> +
> +			ret = VM_FAULT_NOPAGE;
>   			if ((vmf->flags & FAULT_FLAG_ALLOW_RETRY) &&
>   			   !(vmf->flags & FAULT_FLAG_RETRY_NOWAIT)) {
> +				/* It's polite to up mmap_sem if we can */
>   				up_read(&vma->vm_mm->mmap_sem);
> -				mutex_lock(&inode->i_mutex);
> -				mutex_unlock(&inode->i_mutex);
> -				return VM_FAULT_RETRY;
> +				ret = VM_FAULT_RETRY;
>   			}
> -			/* cond_resched? Leave that to GUP or return to user */
> -			return VM_FAULT_NOPAGE;
> +
> +			shmem_falloc_waitq = shmem_falloc->waitq;
> +			prepare_to_wait(shmem_falloc_waitq, &shmem_fault_wait,
> +					TASK_KILLABLE);
> +			spin_unlock(&inode->i_lock);
> +			schedule();
> +
> +			/*
> +			 * shmem_falloc_waitq points into the shmem_fallocate()
> +			 * stack of the hole-punching task: shmem_falloc_waitq
> +			 * is usually invalid by the time we reach here, but
> +			 * finish_wait() does not dereference it in that case;
> +			 * though i_lock needed lest racing with wake_up_all().
> +			 */
> +			spin_lock(&inode->i_lock);
> +			finish_wait(shmem_falloc_waitq, &shmem_fault_wait);
> +			spin_unlock(&inode->i_lock);
> +			return ret;
>   		}
> +		spin_unlock(&inode->i_lock);
>   	}
>
>   	error = shmem_getpage(inode, vmf->pgoff, &vmf->page, SGP_CACHE, &ret);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

WARNING: multiple messages have this Message-ID (diff)

From: Vlastimil Babka <vbabka@suse.cz>
To: Hugh Dickins <hughd@google.com>,
	Andrew Morton <akpm@linux-foundation.org>
Cc: Sasha Levin <sasha.levin@oracle.com>,
	Konstantin Khlebnikov <koct9i@gmail.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Michel Lespinasse <walken@google.com>,
	Lukas Czerner <lczerner@redhat.com>,
	Dave Jones <davej@redhat.com>,
	linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH 1/2] shmem: fix faulting into a hole, not taking i_mutex
Date: Tue, 15 Jul 2014 18:07:04 +0200	[thread overview]
Message-ID: <53C551A8.2040400@suse.cz> (raw)
In-Reply-To: <alpine.LSU.2.11.1407150329250.2584@eggly.anvils>

On 07/15/2014 12:31 PM, Hugh Dickins wrote:
> f00cdc6df7d7 ("shmem: fix faulting into a hole while it's punched") was
> buggy: Sasha sent a lockdep report to remind us that grabbing i_mutex in
> the fault path is a no-no (write syscall may already hold i_mutex while
> faulting user buffer).
>
> We tried a completely different approach (see following patch) but that
> proved inadequate: good enough for a rational workload, but not good
> enough against trinity - which forks off so many mappings of the object
> that contention on i_mmap_mutex while hole-puncher holds i_mutex builds
> into serious starvation when concurrent faults force the puncher to fall
> back to single-page unmap_mapping_range() searches of the i_mmap tree.
>
> So return to the original umbrella approach, but keep away from i_mutex
> this time.  We really don't want to bloat every shmem inode with a new
> mutex or completion, just to protect this unlikely case from trinity.
> So extend the original with wait_queue_head on stack at the hole-punch
> end, and wait_queue item on the stack at the fault end.

Hi, thanks a lot, I will definitely test it soon, although my reproducer 
is rather limited - it already works fine with the current kernel. 
Trinity will be more useful here. But there's something that caught my 
eye so I though I would raise the concern now.

> @@ -760,7 +760,7 @@ static int shmem_writepage(struct page *
>   			spin_lock(&inode->i_lock);
>   			shmem_falloc = inode->i_private;

Without ACCESS_ONCE, can shmem_falloc potentially become an alias on 
inode->i_private and later become re-read outside of the lock?

>   			if (shmem_falloc &&
> -			    !shmem_falloc->mode &&
> +			    !shmem_falloc->waitq &&
>   			    index >= shmem_falloc->start &&
>   			    index < shmem_falloc->next)
>   				shmem_falloc->nr_unswapped++;
> @@ -1248,38 +1248,58 @@ static int shmem_fault(struct vm_area_st
>   	 * Trinity finds that probing a hole which tmpfs is punching can
>   	 * prevent the hole-punch from ever completing: which in turn
>   	 * locks writers out with its hold on i_mutex.  So refrain from
> -	 * faulting pages into the hole while it's being punched, and
> -	 * wait on i_mutex to be released if vmf->flags permits.
> +	 * faulting pages into the hole while it's being punched.  Although
> +	 * shmem_undo_range() does remove the additions, it may be unable to
> +	 * keep up, as each new page needs its own unmap_mapping_range() call,
> +	 * and the i_mmap tree grows ever slower to scan if new vmas are added.
> +	 *
> +	 * It does not matter if we sometimes reach this check just before the
> +	 * hole-punch begins, so that one fault then races with the punch:
> +	 * we just need to make racing faults a rare case.
> +	 *
> +	 * The implementation below would be much simpler if we just used a
> +	 * standard mutex or completion: but we cannot take i_mutex in fault,
> +	 * and bloating every shmem inode for this unlikely case would be sad.
>   	 */
>   	if (unlikely(inode->i_private)) {
>   		struct shmem_falloc *shmem_falloc;
>
>   		spin_lock(&inode->i_lock);
>   		shmem_falloc = inode->i_private;

Same here.

> -		if (!shmem_falloc ||
> -		    shmem_falloc->mode != FALLOC_FL_PUNCH_HOLE ||
> -		    vmf->pgoff < shmem_falloc->start ||
> -		    vmf->pgoff >= shmem_falloc->next)
> -			shmem_falloc = NULL;
> -		spin_unlock(&inode->i_lock);
> -		/*
> -		 * i_lock has protected us from taking shmem_falloc seriously
> -		 * once return from shmem_fallocate() went back up that stack.
> -		 * i_lock does not serialize with i_mutex at all, but it does
> -		 * not matter if sometimes we wait unnecessarily, or sometimes
> -		 * miss out on waiting: we just need to make those cases rare.
> -		 */
> -		if (shmem_falloc) {
> +		if (shmem_falloc &&
> +		    shmem_falloc->waitq &&

Here it's operating outside of lock.

> +		    vmf->pgoff >= shmem_falloc->start &&
> +		    vmf->pgoff < shmem_falloc->next) {
> +			wait_queue_head_t *shmem_falloc_waitq;
> +			DEFINE_WAIT(shmem_fault_wait);
> +
> +			ret = VM_FAULT_NOPAGE;
>   			if ((vmf->flags & FAULT_FLAG_ALLOW_RETRY) &&
>   			   !(vmf->flags & FAULT_FLAG_RETRY_NOWAIT)) {
> +				/* It's polite to up mmap_sem if we can */
>   				up_read(&vma->vm_mm->mmap_sem);
> -				mutex_lock(&inode->i_mutex);
> -				mutex_unlock(&inode->i_mutex);
> -				return VM_FAULT_RETRY;
> +				ret = VM_FAULT_RETRY;
>   			}
> -			/* cond_resched? Leave that to GUP or return to user */
> -			return VM_FAULT_NOPAGE;
> +
> +			shmem_falloc_waitq = shmem_falloc->waitq;
> +			prepare_to_wait(shmem_falloc_waitq, &shmem_fault_wait,
> +					TASK_KILLABLE);
> +			spin_unlock(&inode->i_lock);
> +			schedule();
> +
> +			/*
> +			 * shmem_falloc_waitq points into the shmem_fallocate()
> +			 * stack of the hole-punching task: shmem_falloc_waitq
> +			 * is usually invalid by the time we reach here, but
> +			 * finish_wait() does not dereference it in that case;
> +			 * though i_lock needed lest racing with wake_up_all().
> +			 */
> +			spin_lock(&inode->i_lock);
> +			finish_wait(shmem_falloc_waitq, &shmem_fault_wait);
> +			spin_unlock(&inode->i_lock);
> +			return ret;
>   		}
> +		spin_unlock(&inode->i_lock);
>   	}
>
>   	error = shmem_getpage(inode, vmf->pgoff, &vmf->page, SGP_CACHE, &ret);

next prev parent reply	other threads:[~2014-07-15 16:07 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-07-15 10:28 [PATCH 0/2] shmem: fix faulting into a hole while it's punched, take 3 Hugh Dickins
2014-07-15 10:28 ` Hugh Dickins
2014-07-15 10:31 ` [PATCH 1/2] shmem: fix faulting into a hole, not taking i_mutex Hugh Dickins
2014-07-15 10:31   ` Hugh Dickins
2014-07-15 16:07   ` Vlastimil Babka [this message]
2014-07-15 16:07     ` Vlastimil Babka
2014-07-15 19:26     ` Hugh Dickins
2014-07-15 19:26       ` Hugh Dickins
2014-07-16  7:18       ` Vlastimil Babka
2014-07-16  7:18         ` Vlastimil Babka
2014-07-25 14:25   ` Michal Hocko
2014-07-25 14:25     ` Michal Hocko
2014-07-15 10:33 ` [PATCH 2/2] shmem: fix splicing from a hole while it's punched Hugh Dickins
2014-07-15 10:33   ` Hugh Dickins
2014-07-25 14:33   ` Michal Hocko
2014-07-25 14:33     ` Michal Hocko
2014-07-17 16:10 ` [PATCH 0/2] shmem: fix faulting into a hole while it's punched, take 3 Vlastimil Babka
2014-07-17 16:10   ` Vlastimil Babka
2014-07-17 16:12   ` Sasha Levin
2014-07-17 16:12     ` Sasha Levin
2014-07-18 10:44     ` Sasha Levin
2014-07-18 10:44       ` Sasha Levin
2014-07-19 23:44       ` Hugh Dickins
2014-07-19 23:44         ` Hugh Dickins
2014-07-22  3:24         ` Sasha Levin
2014-07-22  3:24           ` Sasha Levin
2014-07-22  8:07           ` Hugh Dickins
2014-07-22  8:07             ` Hugh Dickins
2014-07-22 10:06             ` Vlastimil Babka
2014-07-22 10:06               ` Vlastimil Babka
2014-07-22 12:09               ` Vlastimil Babka
2014-07-22 12:09                 ` Vlastimil Babka
2014-07-22 18:42                 ` Hugh Dickins
2014-07-22 18:42                   ` Hugh Dickins
2014-07-22 23:19             ` Sasha Levin
2014-07-22 23:19               ` Sasha Levin
2014-07-22 23:58               ` Hugh Dickins
2014-07-22 23:58                 ` Hugh Dickins
2014-07-17 23:34   ` Hugh Dickins
2014-07-17 23:34     ` Hugh Dickins
2014-07-18  8:05     ` Vlastimil Babka
2014-07-18  8:05       ` Vlastimil Babka

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=53C551A8.2040400@suse.cz \
    --to=vbabka@suse.cz \
    --cc=akpm@linux-foundation.org \
    --cc=davej@redhat.com \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=koct9i@gmail.com \
    --cc=lczerner@redhat.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=sasha.levin@oracle.com \
    --cc=walken@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.