Re: [PATCH] Fix vmtruncate race and distributed filesystem race

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Andrea Arcangeli <andrea@suse.de>
To: Dave McCracken <dmccr@us.ibm.com>
Cc: Andrew Morton <akpm@digeo.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH] Fix vmtruncate race and distributed filesystem race
Date: Fri, 20 Jun 2003 02:17:43 +0200	[thread overview]
Message-ID: <20030620001743.GI18317@dualathlon.random> (raw)
In-Reply-To: <184910000.1055458610@baldur.austin.ibm.com>

Hi,

On Thu, Jun 12, 2003 at 05:56:50PM -0500, Dave McCracken wrote:
> --- 2.5.70-mm8/./mm/memory.c	2003-06-12 13:37:31.000000000 -0500
> +++ 2.5.70-mm8-trunc/./mm/memory.c	2003-06-12 17:51:55.000000000 -0500
> @@ -1138,6 +1138,8 @@ invalidate_mmap_range(struct address_spa
>  			hlen = ULONG_MAX - hba + 1;
>  	}
>  	down(&mapping->i_shared_sem);
> +	/* Protect against page fault */
> +	atomic_inc(&mapping->truncate_count);
>  	if (unlikely(!list_empty(&mapping->i_mmap)))
>  		invalidate_mmap_range_list(&mapping->i_mmap, hba, hlen);
>  	if (unlikely(!list_empty(&mapping->i_mmap_shared)))
> @@ -1390,8 +1392,10 @@ do_no_page(struct mm_struct *mm, struct 
>  	unsigned long address, int write_access, pte_t *page_table, pmd_t *pmd)
>  {
>  	struct page * new_page;
> +	struct address_space *mapping;
>  	pte_t entry;
>  	struct pte_chain *pte_chain;
> +	unsigned sequence;
>  	int ret;
>  
>  	if (!vma->vm_ops || !vma->vm_ops->nopage)
> @@ -1400,6 +1404,9 @@ do_no_page(struct mm_struct *mm, struct 
>  	pte_unmap(page_table);
>  	spin_unlock(&mm->page_table_lock);
>  
> +	mapping = vma->vm_file->f_dentry->d_inode->i_mapping;
> +retry:
> +	sequence = atomic_read(&mapping->truncate_count);
>  	new_page = vma->vm_ops->nopage(vma, address & PAGE_MASK, 0);
>  
>  	/* no page was available -- either SIGBUS or OOM */
> @@ -1428,6 +1435,16 @@ do_no_page(struct mm_struct *mm, struct 
>  	}
>  
>  	spin_lock(&mm->page_table_lock);
> +	/*
> +	 * For a file-backed vma, someone could have truncated or otherwise
> +	 * invalidated this page.  If invalidate_mmap_range got called,
> +	 * retry getting the page.
> +	 */
> +	if (unlikely(sequence != atomic_read(&mapping->truncate_count))) {
> +		spin_unlock(&mm->page_table_lock);
> +		page_cache_release(new_page);
> +		goto retry;
> +	}
>  	page_table = pte_offset_map(pmd, address);

maybe I'm missing something silly but this fixes nothing IMHO. It's not
a coincidence I used the seq_lock (aka frlock in 2.4-aa) in my fix, a
single counter increment isn't nearly enough, you definitely need _both_
an entry and exit point in do_truncate or you'll never know if
vmtruncate has been running under you. The first increment is like the
down_read, the second increment is the up_read. Both are necessary to
trap any vmtruncate during the do_no_page.

Your patch traps this timing case:

	CPU 0			CPU 1
	----------		-----------
				do_no_page
	truncate
				read counter

	increment counter
	vmtruncate
				->nopage
				read counter again -> different so retry


but you can't trap this with a single counter increment in do_truncate:

	CPU 0			CPU 1
	----------		-----------
				do_no_page
	truncate
	increment counter
				read counter
				->nopage
	vmtruncate
				read counter again -> different so retry

thanks to the second counter increment after vmtruncate in my fix, the
above race couldn't happen.

About the down(&inode->i_sem); up(), that you dropped under Andrew's
suggestion, while that maybe ugly, it will have a chance to save cpu,
and since it's a slow path such goto, it's definitely worthwhile to keep
it IMHO. Otherwise one cpu will keep scheduling in a loop until truncate
returns, and it can take time since it may have to do I/O or wait on
some I/O semaphore. It wouldn't be DoSable, because the
ret-from-exception will check need_resched, but still it's bad for cpu
utilization and such a waste can be avoided trivially as in my fix.

I was chatting with Daniel about those hooks a few minutes ago, and he
suggested to make do_no_page a callback itself (instead of having
do_no_page call into a ->nopage further callback). And to provide by
default a generic implementation that would be equivalent to the current
do_no_page. As far as I can tell that will save both the new pointer to
function for the DSM hook (that IMHO has to be taken twice, both before
->nopage and after ->nopage, not only before the ->nopage, for the
reason explained above) and the ->nopage hook itself. So maybe that
could be a cleaner solution to avoid the DSM hooks enterely, so we don't
have more hooks but less, and a library call. This sounds the best for
performance and flexibility. (talking only about 2.5 of course, 2.4 I
think is just fine with my ""production"" 8) fix here:

	http://www.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.21rc8aa1/2.4.21rc8aa1/9999_truncate-nopage-race-1

)

Andrea

WARNING: multiple messages have this Message-ID (diff)

From: Andrea Arcangeli <andrea@suse.de>
To: Dave McCracken <dmccr@us.ibm.com>
Cc: Andrew Morton <akpm@digeo.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH] Fix vmtruncate race and distributed filesystem race
Date: Fri, 20 Jun 2003 02:17:43 +0200	[thread overview]
Message-ID: <20030620001743.GI18317@dualathlon.random> (raw)
In-Reply-To: <184910000.1055458610@baldur.austin.ibm.com>

Hi,

On Thu, Jun 12, 2003 at 05:56:50PM -0500, Dave McCracken wrote:
> --- 2.5.70-mm8/./mm/memory.c	2003-06-12 13:37:31.000000000 -0500
> +++ 2.5.70-mm8-trunc/./mm/memory.c	2003-06-12 17:51:55.000000000 -0500
> @@ -1138,6 +1138,8 @@ invalidate_mmap_range(struct address_spa
>  			hlen = ULONG_MAX - hba + 1;
>  	}
>  	down(&mapping->i_shared_sem);
> +	/* Protect against page fault */
> +	atomic_inc(&mapping->truncate_count);
>  	if (unlikely(!list_empty(&mapping->i_mmap)))
>  		invalidate_mmap_range_list(&mapping->i_mmap, hba, hlen);
>  	if (unlikely(!list_empty(&mapping->i_mmap_shared)))
> @@ -1390,8 +1392,10 @@ do_no_page(struct mm_struct *mm, struct 
>  	unsigned long address, int write_access, pte_t *page_table, pmd_t *pmd)
>  {
>  	struct page * new_page;
> +	struct address_space *mapping;
>  	pte_t entry;
>  	struct pte_chain *pte_chain;
> +	unsigned sequence;
>  	int ret;
>  
>  	if (!vma->vm_ops || !vma->vm_ops->nopage)
> @@ -1400,6 +1404,9 @@ do_no_page(struct mm_struct *mm, struct 
>  	pte_unmap(page_table);
>  	spin_unlock(&mm->page_table_lock);
>  
> +	mapping = vma->vm_file->f_dentry->d_inode->i_mapping;
> +retry:
> +	sequence = atomic_read(&mapping->truncate_count);
>  	new_page = vma->vm_ops->nopage(vma, address & PAGE_MASK, 0);
>  
>  	/* no page was available -- either SIGBUS or OOM */
> @@ -1428,6 +1435,16 @@ do_no_page(struct mm_struct *mm, struct 
>  	}
>  
>  	spin_lock(&mm->page_table_lock);
> +	/*
> +	 * For a file-backed vma, someone could have truncated or otherwise
> +	 * invalidated this page.  If invalidate_mmap_range got called,
> +	 * retry getting the page.
> +	 */
> +	if (unlikely(sequence != atomic_read(&mapping->truncate_count))) {
> +		spin_unlock(&mm->page_table_lock);
> +		page_cache_release(new_page);
> +		goto retry;
> +	}
>  	page_table = pte_offset_map(pmd, address);

maybe I'm missing something silly but this fixes nothing IMHO. It's not
a coincidence I used the seq_lock (aka frlock in 2.4-aa) in my fix, a
single counter increment isn't nearly enough, you definitely need _both_
an entry and exit point in do_truncate or you'll never know if
vmtruncate has been running under you. The first increment is like the
down_read, the second increment is the up_read. Both are necessary to
trap any vmtruncate during the do_no_page.

Your patch traps this timing case:

	CPU 0			CPU 1
	----------		-----------
				do_no_page
	truncate
				read counter

	increment counter
	vmtruncate
				->nopage
				read counter again -> different so retry


but you can't trap this with a single counter increment in do_truncate:

	CPU 0			CPU 1
	----------		-----------
				do_no_page
	truncate
	increment counter
				read counter
				->nopage
	vmtruncate
				read counter again -> different so retry

thanks to the second counter increment after vmtruncate in my fix, the
above race couldn't happen.

About the down(&inode->i_sem); up(), that you dropped under Andrew's
suggestion, while that maybe ugly, it will have a chance to save cpu,
and since it's a slow path such goto, it's definitely worthwhile to keep
it IMHO. Otherwise one cpu will keep scheduling in a loop until truncate
returns, and it can take time since it may have to do I/O or wait on
some I/O semaphore. It wouldn't be DoSable, because the
ret-from-exception will check need_resched, but still it's bad for cpu
utilization and such a waste can be avoided trivially as in my fix.

I was chatting with Daniel about those hooks a few minutes ago, and he
suggested to make do_no_page a callback itself (instead of having
do_no_page call into a ->nopage further callback). And to provide by
default a generic implementation that would be equivalent to the current
do_no_page. As far as I can tell that will save both the new pointer to
function for the DSM hook (that IMHO has to be taken twice, both before
->nopage and after ->nopage, not only before the ->nopage, for the
reason explained above) and the ->nopage hook itself. So maybe that
could be a cleaner solution to avoid the DSM hooks enterely, so we don't
have more hooks but less, and a library call. This sounds the best for
performance and flexibility. (talking only about 2.5 of course, 2.4 I
think is just fine with my ""production"" 8) fix here:

	http://www.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.21rc8aa1/2.4.21rc8aa1/9999_truncate-nopage-race-1

)

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

next prev parent reply	other threads:[~2003-06-20  0:03 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2003-06-12 20:16 [PATCH] Fix vmtruncate race and distributed filesystem race Dave McCracken
2003-06-12 20:49 ` Andrew Morton
2003-06-12 20:49   ` Andrew Morton
2003-06-12 21:00   ` Andrew Morton
2003-06-12 21:00     ` Andrew Morton
2003-06-12 21:08     ` Dave McCracken
2003-06-12 21:08       ` Dave McCracken
2003-06-12 21:44       ` Andrew Morton
2003-06-12 21:44         ` Andrew Morton
2003-06-12 22:56         ` Dave McCracken
2003-06-12 23:07           ` Andrew Morton
2003-06-12 23:07             ` Andrew Morton
2003-06-20  0:17           ` Andrea Arcangeli [this message]
2003-06-20  0:17             ` Andrea Arcangeli
2003-06-23  3:28             ` Paul E. McKenney
2003-06-23  3:28               ` Paul E. McKenney
2003-06-23  6:29               ` Andrea Arcangeli
2003-06-23  6:29                 ` Andrea Arcangeli
2003-06-23  6:32               ` Andrew Morton
2003-06-23  6:32                 ` Andrew Morton
2003-06-23  7:43                 ` Andrea Arcangeli
2003-06-23  7:43                   ` Andrea Arcangeli
2003-06-23  7:56                   ` Andrew Morton
2003-06-23  7:56                     ` Andrew Morton
2003-06-23  8:10                     ` Andrea Arcangeli
2003-06-23  8:10                       ` Andrea Arcangeli
2003-06-24  1:37                       ` Paul E. McKenney
2003-06-24  1:37                         ` Paul E. McKenney

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20030620001743.GI18317@dualathlon.random \
    --to=andrea@suse.de \
    --cc=akpm@digeo.com \
    --cc=dmccr@us.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.