All of lore.kernel.org
 help / color / mirror / Atom feed
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
To: Nick Piggin <npiggin@suse.de>
Cc: Linux Memory Management <linux-mm@kvack.org>,
	Neil Brown <neilb@suse.de>, Anton Altaparmakov <aia21@cam.ac.uk>,
	Chris Mason <chris.mason@oracle.com>,
	Linux Kernel <linux-kernel@vger.kernel.org>,
	Andrew Morton <akpm@osdl.org>
Subject: Re: [patch 6/6] mm: fix pagecache write deadlocks
Date: Sun, 15 Oct 2006 13:37:10 +0200	[thread overview]
Message-ID: <1160912230.5230.23.camel@lappy> (raw)
In-Reply-To: <20061013143616.15438.77140.sendpatchset@linux.site>

On Fri, 2006-10-13 at 18:44 +0200, Andrew Morton wrote:
> The idea is to modify the core write() code so that it won't take a pagefault
> while holding a lock on the pagecache page. There are a number of different
> deadlocks possible if we try to do such a thing:
> 
> 1.  generic_buffered_write
> 2.   lock_page
> 3.    prepare_write
> 4.     unlock_page+vmtruncate
> 5.     copy_from_user
> 6.      mmap_sem(r)
> 7.       handle_mm_fault
> 8.        lock_page (filemap_nopage)
> 9.    commit_write
> 1.   unlock_page
> 
> b. sys_munmap / sys_mlock / others
> c.  mmap_sem(w)
> d.   make_pages_present
> e.    get_user_pages
> f.     handle_mm_fault
> g.      lock_page (filemap_nopage)
> 
> 2,8	- recursive deadlock if page is same
> 2,8;2,7	- ABBA deadlock is page is different

2,8;2,8 I think you mean

> 2,6;c,g	- ABBA deadlock if page is same

> +
> +		/*
> +		 * Must not enter the pagefault handler here, because we hold
> +		 * the page lock, so we might recursively deadlock on the same
> +		 * lock, or get an ABBA deadlock against a different lock, or
> +		 * against the mmap_sem (which nests outside the page lock).
> +		 * So increment preempt count, and use _atomic usercopies.
> +		 */
> +		inc_preempt_count();
>  		if (likely(nr_segs == 1))
> -			copied = filemap_copy_from_user(page, offset,
> +			copied = filemap_copy_from_user_atomic(page, offset,
>  							buf, bytes);
>  		else
> -			copied = filemap_copy_from_user_iovec(page, offset,
> -						cur_iov, iov_offset, bytes);
> +			copied = filemap_copy_from_user_iovec_atomic(page,
> +						offset, cur_iov, iov_offset,
> +						bytes);
> +		dec_preempt_count();
> +

Why use raw {inc,dec}_preempt_count() and not
preempt_{disable,enable}()? Is the compiler barrier not needed here? And
do we really want to avoid the preempt_check_resched()?

> Index: linux-2.6/mm/filemap.h
> ===================================================================
> --- linux-2.6.orig/mm/filemap.h
> +++ linux-2.6/mm/filemap.h
> @@ -22,19 +22,19 @@ __filemap_copy_from_user_iovec_inatomic(
>  
>  /*
>   * Copy as much as we can into the page and return the number of bytes which
> - * were sucessfully copied.  If a fault is encountered then clear the page
> - * out to (offset+bytes) and return the number of bytes which were copied.
> + * were sucessfully copied.  If a fault is encountered then return the number of
> + * bytes which were copied.
>   *
> - * NOTE: For this to work reliably we really want copy_from_user_inatomic_nocache
> - * to *NOT* zero any tail of the buffer that it failed to copy.  If it does,
> - * and if the following non-atomic copy succeeds, then there is a small window
> - * where the target page contains neither the data before the write, nor the
> - * data after the write (it contains zero).  A read at this time will see
> - * data that is inconsistent with any ordering of the read and the write.
> - * (This has been detected in practice).
> + * NOTE: For this to work reliably we really want
> + * copy_from_user_inatomic_nocache to *NOT* zero any tail of the buffer that it
> + * failed to copy.  If it does, and if the following non-atomic copy succeeds,
> + * then there is a small window where the target page contains neither the data
> + * before the write, nor the data after the write (it contains zero).  A read at
> + * this time will see data that is inconsistent with any ordering of the read
> + * and the write.  (This has been detected in practice).
>   */
>  static inline size_t
> -filemap_copy_from_user(struct page *page, unsigned long offset,
> +filemap_copy_from_user_atomic(struct page *page, unsigned long offset,
>  			const char __user *buf, unsigned bytes)
>  {
>  	char *kaddr;
> @@ -44,23 +44,32 @@ filemap_copy_from_user(struct page *page
>  	left = __copy_from_user_inatomic_nocache(kaddr + offset, buf, bytes);
>  	kunmap_atomic(kaddr, KM_USER0);
>  
> -	if (left != 0) {
> -		/* Do it the slow way */
> -		kaddr = kmap(page);
> -		left = __copy_from_user_nocache(kaddr + offset, buf, bytes);
> -		kunmap(page);
> -	}
> +	return bytes - left;
> +}
> +
> +static inline size_t
> +filemap_copy_from_user_nonatomic(struct page *page, unsigned long offset,
> +			const char __user *buf, unsigned bytes)
> +{
> +	char *kaddr;
> +	int left;
> +
> +	kaddr = kmap(page);
> +	left = __copy_from_user_nocache(kaddr + offset, buf, bytes);
> +	kunmap(page);
> +
>  	return bytes - left;
>  }
>  
>  /*
> - * This has the same sideeffects and return value as filemap_copy_from_user().
> + * This has the same sideeffects and return value as
> + * filemap_copy_from_user_atomic().
>   * The difference is that on a fault we need to memset the remainder of the
>   * page (out to offset+bytes), to emulate filemap_copy_from_user()'s
>   * single-segment behaviour.
>   */
>  static inline size_t
> -filemap_copy_from_user_iovec(struct page *page, unsigned long offset,
> +filemap_copy_from_user_iovec_atomic(struct page *page, unsigned long offset,
>  			const struct iovec *iov, size_t base, size_t bytes)
>  {
>  	char *kaddr;
> @@ -70,14 +79,27 @@ filemap_copy_from_user_iovec(struct page
>  	copied = __filemap_copy_from_user_iovec_inatomic(kaddr + offset, iov,
>  							 base, bytes);
>  	kunmap_atomic(kaddr, KM_USER0);
> -	if (copied != bytes) {
> -		kaddr = kmap(page);
> -		copied = __filemap_copy_from_user_iovec_inatomic(kaddr + offset, iov,
> -								 base, bytes);
> -		if (bytes - copied)
> -			memset(kaddr + offset + copied, 0, bytes - copied);
> -		kunmap(page);
> -	}
> +	return copied;
> +}
> +
> +/*
> + * This has the same sideeffects and return value as
> + * filemap_copy_from_user_nonatomic().
> + * The difference is that on a fault we need to memset the remainder of the
> + * page (out to offset+bytes), to emulate filemap_copy_from_user_nonatomic()'s
> + * single-segment behaviour.
> + */
> +static inline size_t
> +filemap_copy_from_user_iovec_nonatomic(struct page *page, unsigned long offset,
> +			const struct iovec *iov, size_t base, size_t bytes)
> +{
> +	char *kaddr;
> +	size_t copied;
> +
> +	kaddr = kmap(page);
> +	copied = __filemap_copy_from_user_iovec_inatomic(kaddr + offset, iov,
> +							 base, bytes);
> +	kunmap(page);
>  	return copied;
>  }
>  

Why create the _nonatomic versions? There are no users.



WARNING: multiple messages have this Message-ID (diff)
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
To: Nick Piggin <npiggin@suse.de>
Cc: Linux Memory Management <linux-mm@kvack.org>,
	Neil Brown <neilb@suse.de>, Anton Altaparmakov <aia21@cam.ac.uk>,
	Chris Mason <chris.mason@oracle.com>,
	Linux Kernel <linux-kernel@vger.kernel.org>,
	Andrew Morton <akpm@osdl.org>
Subject: Re: [patch 6/6] mm: fix pagecache write deadlocks
Date: Sun, 15 Oct 2006 13:37:10 +0200	[thread overview]
Message-ID: <1160912230.5230.23.camel@lappy> (raw)
In-Reply-To: <20061013143616.15438.77140.sendpatchset@linux.site>

On Fri, 2006-10-13 at 18:44 +0200, Andrew Morton wrote:
> The idea is to modify the core write() code so that it won't take a pagefault
> while holding a lock on the pagecache page. There are a number of different
> deadlocks possible if we try to do such a thing:
> 
> 1.  generic_buffered_write
> 2.   lock_page
> 3.    prepare_write
> 4.     unlock_page+vmtruncate
> 5.     copy_from_user
> 6.      mmap_sem(r)
> 7.       handle_mm_fault
> 8.        lock_page (filemap_nopage)
> 9.    commit_write
> 1.   unlock_page
> 
> b. sys_munmap / sys_mlock / others
> c.  mmap_sem(w)
> d.   make_pages_present
> e.    get_user_pages
> f.     handle_mm_fault
> g.      lock_page (filemap_nopage)
> 
> 2,8	- recursive deadlock if page is same
> 2,8;2,7	- ABBA deadlock is page is different

2,8;2,8 I think you mean

> 2,6;c,g	- ABBA deadlock if page is same

> +
> +		/*
> +		 * Must not enter the pagefault handler here, because we hold
> +		 * the page lock, so we might recursively deadlock on the same
> +		 * lock, or get an ABBA deadlock against a different lock, or
> +		 * against the mmap_sem (which nests outside the page lock).
> +		 * So increment preempt count, and use _atomic usercopies.
> +		 */
> +		inc_preempt_count();
>  		if (likely(nr_segs == 1))
> -			copied = filemap_copy_from_user(page, offset,
> +			copied = filemap_copy_from_user_atomic(page, offset,
>  							buf, bytes);
>  		else
> -			copied = filemap_copy_from_user_iovec(page, offset,
> -						cur_iov, iov_offset, bytes);
> +			copied = filemap_copy_from_user_iovec_atomic(page,
> +						offset, cur_iov, iov_offset,
> +						bytes);
> +		dec_preempt_count();
> +

Why use raw {inc,dec}_preempt_count() and not
preempt_{disable,enable}()? Is the compiler barrier not needed here? And
do we really want to avoid the preempt_check_resched()?

> Index: linux-2.6/mm/filemap.h
> ===================================================================
> --- linux-2.6.orig/mm/filemap.h
> +++ linux-2.6/mm/filemap.h
> @@ -22,19 +22,19 @@ __filemap_copy_from_user_iovec_inatomic(
>  
>  /*
>   * Copy as much as we can into the page and return the number of bytes which
> - * were sucessfully copied.  If a fault is encountered then clear the page
> - * out to (offset+bytes) and return the number of bytes which were copied.
> + * were sucessfully copied.  If a fault is encountered then return the number of
> + * bytes which were copied.
>   *
> - * NOTE: For this to work reliably we really want copy_from_user_inatomic_nocache
> - * to *NOT* zero any tail of the buffer that it failed to copy.  If it does,
> - * and if the following non-atomic copy succeeds, then there is a small window
> - * where the target page contains neither the data before the write, nor the
> - * data after the write (it contains zero).  A read at this time will see
> - * data that is inconsistent with any ordering of the read and the write.
> - * (This has been detected in practice).
> + * NOTE: For this to work reliably we really want
> + * copy_from_user_inatomic_nocache to *NOT* zero any tail of the buffer that it
> + * failed to copy.  If it does, and if the following non-atomic copy succeeds,
> + * then there is a small window where the target page contains neither the data
> + * before the write, nor the data after the write (it contains zero).  A read at
> + * this time will see data that is inconsistent with any ordering of the read
> + * and the write.  (This has been detected in practice).
>   */
>  static inline size_t
> -filemap_copy_from_user(struct page *page, unsigned long offset,
> +filemap_copy_from_user_atomic(struct page *page, unsigned long offset,
>  			const char __user *buf, unsigned bytes)
>  {
>  	char *kaddr;
> @@ -44,23 +44,32 @@ filemap_copy_from_user(struct page *page
>  	left = __copy_from_user_inatomic_nocache(kaddr + offset, buf, bytes);
>  	kunmap_atomic(kaddr, KM_USER0);
>  
> -	if (left != 0) {
> -		/* Do it the slow way */
> -		kaddr = kmap(page);
> -		left = __copy_from_user_nocache(kaddr + offset, buf, bytes);
> -		kunmap(page);
> -	}
> +	return bytes - left;
> +}
> +
> +static inline size_t
> +filemap_copy_from_user_nonatomic(struct page *page, unsigned long offset,
> +			const char __user *buf, unsigned bytes)
> +{
> +	char *kaddr;
> +	int left;
> +
> +	kaddr = kmap(page);
> +	left = __copy_from_user_nocache(kaddr + offset, buf, bytes);
> +	kunmap(page);
> +
>  	return bytes - left;
>  }
>  
>  /*
> - * This has the same sideeffects and return value as filemap_copy_from_user().
> + * This has the same sideeffects and return value as
> + * filemap_copy_from_user_atomic().
>   * The difference is that on a fault we need to memset the remainder of the
>   * page (out to offset+bytes), to emulate filemap_copy_from_user()'s
>   * single-segment behaviour.
>   */
>  static inline size_t
> -filemap_copy_from_user_iovec(struct page *page, unsigned long offset,
> +filemap_copy_from_user_iovec_atomic(struct page *page, unsigned long offset,
>  			const struct iovec *iov, size_t base, size_t bytes)
>  {
>  	char *kaddr;
> @@ -70,14 +79,27 @@ filemap_copy_from_user_iovec(struct page
>  	copied = __filemap_copy_from_user_iovec_inatomic(kaddr + offset, iov,
>  							 base, bytes);
>  	kunmap_atomic(kaddr, KM_USER0);
> -	if (copied != bytes) {
> -		kaddr = kmap(page);
> -		copied = __filemap_copy_from_user_iovec_inatomic(kaddr + offset, iov,
> -								 base, bytes);
> -		if (bytes - copied)
> -			memset(kaddr + offset + copied, 0, bytes - copied);
> -		kunmap(page);
> -	}
> +	return copied;
> +}
> +
> +/*
> + * This has the same sideeffects and return value as
> + * filemap_copy_from_user_nonatomic().
> + * The difference is that on a fault we need to memset the remainder of the
> + * page (out to offset+bytes), to emulate filemap_copy_from_user_nonatomic()'s
> + * single-segment behaviour.
> + */
> +static inline size_t
> +filemap_copy_from_user_iovec_nonatomic(struct page *page, unsigned long offset,
> +			const struct iovec *iov, size_t base, size_t bytes)
> +{
> +	char *kaddr;
> +	size_t copied;
> +
> +	kaddr = kmap(page);
> +	copied = __filemap_copy_from_user_iovec_inatomic(kaddr + offset, iov,
> +							 base, bytes);
> +	kunmap(page);
>  	return copied;
>  }
>  

Why create the _nonatomic versions? There are no users.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  parent reply	other threads:[~2006-10-15 11:37 UTC|newest]

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-10-13 16:43 [rfc] buffered write deadlock fix Nick Piggin
2006-10-13 16:43 ` Nick Piggin
2006-10-13 16:44 ` [patch 1/6] mm: revert "generic_file_buffered_write(): handle zero length iovec segments" Nick Piggin
2006-10-13 16:44   ` Nick Piggin, Andrew Morton
2006-10-13 16:44 ` [patch 2/6] mm: revert "generic_file_buffered_write(): deadlock on vectored write" Nick Piggin
2006-10-13 16:44   ` Nick Piggin, Andrew Morton
2006-10-13 16:44 ` [patch 3/6] mm: generic_file_buffered_write cleanup Nick Piggin
2006-10-13 16:44   ` Nick Piggin, Andrew Morton
2006-10-13 16:44 ` [patch 4/6] mm: comment mmap_sem / lock_page lockorder Nick Piggin
2006-10-13 16:44   ` Nick Piggin
2006-10-13 16:44 ` [patch 5/6] mm: debug write deadlocks Nick Piggin
2006-10-13 16:44   ` Nick Piggin
2006-10-13 16:44 ` [patch 6/6] mm: fix pagecache " Nick Piggin
2006-10-13 16:44   ` Nick Piggin, Andrew Morton
2006-10-13 22:14   ` Andrew Morton
2006-10-13 22:14     ` Andrew Morton
2006-10-14  4:19     ` Nick Piggin
2006-10-14  4:19       ` Nick Piggin
2006-10-14  4:30       ` Nick Piggin
2006-10-14  4:30         ` Nick Piggin
2006-10-15 11:35       ` Peter Zijlstra
2006-10-15 11:35         ` Peter Zijlstra
2006-10-14  5:04   ` Nick Piggin
2006-10-14  5:04     ` Nick Piggin
2006-10-15 11:37   ` Peter Zijlstra [this message]
2006-10-15 11:37     ` Peter Zijlstra
2006-10-15 11:56     ` Nick Piggin
2006-10-15 11:56       ` Nick Piggin
2006-10-15 13:51       ` Peter Zijlstra
2006-10-15 13:51         ` Peter Zijlstra
2006-10-15 14:19         ` SPAM: " Nick Piggin
2006-10-15 14:19           ` Nick Piggin
2006-10-15 15:47           ` Peter Zijlstra
2006-10-15 15:47             ` Peter Zijlstra
2006-10-15 15:57             ` RRe: " Nick Piggin
2006-10-15 15:57               ` Nick Piggin
2006-10-15 16:13               ` Peter Zijlstra
2006-10-15 16:13                 ` Peter Zijlstra
2006-10-16 15:24                 ` pagefault_disable (was Re: [patch 6/6] mm: fix pagecache write deadlocks) Nick Piggin
2006-10-16 15:24                   ` Nick Piggin
2006-10-16 16:05                   ` Peter Zijlstra
2006-10-16 16:05                     ` Peter Zijlstra
2006-10-16 16:12                     ` Nick Piggin
2006-10-16 16:12                       ` Nick Piggin
2006-10-18 14:25   ` [patch 6/6] mm: fix pagecache write deadlocks Chris Mason
2006-10-18 14:25     ` Chris Mason

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1160912230.5230.23.camel@lappy \
    --to=a.p.zijlstra@chello.nl \
    --cc=aia21@cam.ac.uk \
    --cc=akpm@osdl.org \
    --cc=chris.mason@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=neilb@suse.de \
    --cc=npiggin@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.