[PATCH 0/7] swapin/shmem patches

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/7] swapin/shmem patches
@ 2007-10-06 20:35 Hugh Dickins
  2007-10-06 20:38 ` [PATCH 1/7] swapin_readahead: excise NUMA bogosity Hugh Dickins
                   ` (6 more replies)
  0 siblings, 7 replies; 27+ messages in thread
From: Hugh Dickins @ 2007-10-06 20:35 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm

Here's my belated set of swapin/shmem patches, which I hope might still
be allowed into 2.6.24 after a trial in -mm.

[PATCH 1/7] swapin_readahead: excise NUMA bogosity 
[PATCH 2/7] swapin_readahead: move and rearrange args
[PATCH 3/7] swapin needs gfp_mask for loop on tmpfs
[PATCH 4/7] shmem: SGP_QUICK and SGP_FAULT redundant
[PATCH 5/7] shmem_getpage return page locked
[PATCH 6/7] shmem_file_write is redundant
[PATCH 7/7] swapin: fix valid_swaphandles defect

They're based on 2.6.23-rc8-mm2, but most apply to 2.6.23-rc9 plus
mm-clarify-__add_to_swap_cache-locking.patch
mm-clarify-__add_to_swap_cache-locking-fix.patch
mm-shmemc-make-3-functions-static.patch

The exceptions are 5/7 and 6/7, which assume Nick's aops mods to shmem.c:
implement-simple-fs-aops.patch
implement-simple-fs-aops-fix.patch
3/7 fixes a hang made visible by those mods, but does not depend on them.

 include/linux/swap.h |   19 +--
 mm/memory.c          |   65 -------------
 mm/shmem.c           |  200 +++++++----------------------------------
 mm/swap_state.c      |   59 ++++++++++--
 mm/swapfile.c        |   52 +++++++---
 5 files changed, 135 insertions(+), 260 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 1/7] swapin_readahead: excise NUMA bogosity
  2007-10-06 20:35 [PATCH 0/7] swapin/shmem patches Hugh Dickins
@ 2007-10-06 20:38 ` Hugh Dickins
  2007-10-06 22:43   ` Rik van Riel
                     ` (2 more replies)
  2007-10-06 20:39 ` [PATCH 2/7] swapin_readahead: move and rearrange args Hugh Dickins
                   ` (5 subsequent siblings)
  6 siblings, 3 replies; 27+ messages in thread
From: Hugh Dickins @ 2007-10-06 20:38 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andi Kleen, Christoph Lameter, Lee Schermerhorn, linux-mm

For three years swapin_readahead has been cluttered with fanciful
CONFIG_NUMA code, advancing addr, and stepping on to the next vma
at the boundary, to line up the mempolicy for each page allocation.

It _might_ be a good idea to allocate swap more according to vma
layout; but the fact is, that's not how we do it at all, 2.6 even
less than 2.4: swap is allocated as needed for pages as they sink
to the bottom of the inactive LRUs.  Sometimes that may match vma
layout, but not so often that it's worth going to these misleading
vma->vm_next lengths: rip all that out.

Originally I intended to retain the incrementation of addr, but
correct its initial value: valid_swaphandles generally supplies
an offset below the target addr (this is readaround rather than
readahead), but addr has not been adjusted accordingly, so in the
interleave case it has usually been allocating the target page
from the "wrong" node (though that may not matter very much).

But look at the equivalent shmem_swapin code: either by oversight
or by design, though it has all the apparatus for choosing a new
mempolicy per page, it uses the same idx throughout, choosing the
same mempolicy and interleave node for each page of the cluster.

Which is actually a much better strategy: each node has its own
LRUs and its own kswapd, so if you're betting on any particular
relationship between swap and node, the best bet is that nearby
swap entries belong to pages from the same node - even when the
mempolicy of the target page is to interleave.  And examining a
map of nodes corresponding to swap entries on a numa=fake system
bears this out.  (We could later tweak swap allocation to make it
even more likely, but this patch is merely about removing cruft.)

So, neither adjust nor increment addr in swapin_readahead, and
then shmem_swapin can use it too; the pseudo-vma to pass policy
need only be set up once per cluster, and so few fields of pvma
are used, let's skip the memset - from shmem_alloc_page also.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
---

 mm/memory.c |   47 ++++++++++++++---------------------------------
 mm/shmem.c  |   43 ++++++++++++-------------------------------
 2 files changed, 26 insertions(+), 64 deletions(-)

--- 2.6.23-rc8-mm2/mm/memory.c	2007-09-27 11:28:39.000000000 +0100
+++ patch1/mm/memory.c	2007-10-04 19:24:31.000000000 +0100
@@ -2011,45 +2011,26 @@ int vmtruncate_range(struct inode *inode
  */
 void swapin_readahead(swp_entry_t entry, unsigned long addr,struct vm_area_struct *vma)
 {
-#ifdef CONFIG_NUMA
-	struct vm_area_struct *next_vma = vma ? vma->vm_next : NULL;
-#endif
-	int i, num;
-	struct page *new_page;
+	int nr_pages;
+	struct page *page;
 	unsigned long offset;
+	unsigned long end_offset;
 
 	/*
-	 * Get the number of handles we should do readahead io to.
+	 * Get starting offset for readaround, and number of pages to read.
+	 * Adjust starting address by readbehind (for NUMA interleave case)?
+	 * No, it's very unlikely that swap layout would follow vma layout,
+	 * more likely that neighbouring swap pages came from the same node:
+	 * so use the same "addr" to choose the same node for each swap read.
 	 */
-	num = valid_swaphandles(entry, &offset);
-	for (i = 0; i < num; offset++, i++) {
+	nr_pages = valid_swaphandles(entry, &offset);
+	for (end_offset = offset + nr_pages; offset < end_offset; offset++) {
 		/* Ok, do the async read-ahead now */
-		new_page = read_swap_cache_async(swp_entry(swp_type(entry),
-							   offset), vma, addr);
-		if (!new_page)
+		page = read_swap_cache_async(swp_entry(swp_type(entry), offset),
+						vma, addr);
+		if (!page)
 			break;
-		page_cache_release(new_page);
-#ifdef CONFIG_NUMA
-		/*
-		 * Find the next applicable VMA for the NUMA policy.
-		 */
-		addr += PAGE_SIZE;
-		if (addr == 0)
-			vma = NULL;
-		if (vma) {
-			if (addr >= vma->vm_end) {
-				vma = next_vma;
-				next_vma = vma ? vma->vm_next : NULL;
-			}
-			if (vma && addr < vma->vm_start)
-				vma = NULL;
-		} else {
-			if (next_vma && addr >= next_vma->vm_start) {
-				vma = next_vma;
-				next_vma = vma->vm_next;
-			}
-		}
-#endif
+		page_cache_release(page);
 	}
 	lru_add_drain();	/* Push any new pages onto the LRU now */
 }
--- 2.6.23-rc8-mm2/mm/shmem.c	2007-09-27 11:28:39.000000000 +0100
+++ patch1/mm/shmem.c	2007-10-04 19:24:31.000000000 +0100
@@ -1010,53 +1010,34 @@ out:
 	return err;
 }
 
-static struct page *shmem_swapin_async(struct shared_policy *p,
+static struct page *shmem_swapin(struct shmem_inode_info *info,
 				       swp_entry_t entry, unsigned long idx)
 {
-	struct page *page;
 	struct vm_area_struct pvma;
+	struct page *page;
 
 	/* Create a pseudo vma that just contains the policy */
-	memset(&pvma, 0, sizeof(struct vm_area_struct));
-	pvma.vm_end = PAGE_SIZE;
+	pvma.vm_start = 0;
 	pvma.vm_pgoff = idx;
-	pvma.vm_policy = mpol_shared_policy_lookup(p, idx);
+	pvma.vm_ops = NULL;
+	pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, idx);
+	swapin_readahead(entry, 0, &pvma);
 	page = read_swap_cache_async(entry, &pvma, 0);
 	mpol_free(pvma.vm_policy);
 	return page;
 }
 
-static struct page *shmem_swapin(struct shmem_inode_info *info,
-				 swp_entry_t entry, unsigned long idx)
-{
-	struct shared_policy *p = &info->policy;
-	int i, num;
-	struct page *page;
-	unsigned long offset;
-
-	num = valid_swaphandles(entry, &offset);
-	for (i = 0; i < num; offset++, i++) {
-		page = shmem_swapin_async(p,
-				swp_entry(swp_type(entry), offset), idx);
-		if (!page)
-			break;
-		page_cache_release(page);
-	}
-	lru_add_drain();	/* Push any new pages onto the LRU now */
-	return shmem_swapin_async(p, entry, idx);
-}
-
-static struct page *
-shmem_alloc_page(gfp_t gfp, struct shmem_inode_info *info,
-		 unsigned long idx)
+static struct page *shmem_alloc_page(gfp_t gfp, struct shmem_inode_info *info,
+					unsigned long idx)
 {
 	struct vm_area_struct pvma;
 	struct page *page;
 
-	memset(&pvma, 0, sizeof(struct vm_area_struct));
-	pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, idx);
+	/* Create a pseudo vma that just contains the policy */
+	pvma.vm_start = 0;
 	pvma.vm_pgoff = idx;
-	pvma.vm_end = PAGE_SIZE;
+	pvma.vm_ops = NULL;
+	pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, idx);
 	page = alloc_page_vma(gfp | __GFP_ZERO, &pvma, 0);
 	mpol_free(pvma.vm_policy);
 	return page;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/7] swapin_readahead: excise NUMA bogosity
  2007-10-06 20:38 ` [PATCH 1/7] swapin_readahead: excise NUMA bogosity Hugh Dickins
@ 2007-10-06 22:43   ` Rik van Riel
  2007-10-07 22:05   ` Andi Kleen
  2007-10-08 17:31   ` Christoph Lameter
  2 siblings, 0 replies; 27+ messages in thread
From: Rik van Riel @ 2007-10-06 22:43 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Andi Kleen, Christoph Lameter, Lee Schermerhorn,
	linux-mm

On Sat, 6 Oct 2007 21:38:52 +0100 (BST)
Hugh Dickins <hugh@veritas.com> wrote:

> For three years swapin_readahead has been cluttered with fanciful
> CONFIG_NUMA code, advancing addr, and stepping on to the next vma
> at the boundary, to line up the mempolicy for each page allocation.
> 
> It _might_ be a good idea to allocate swap more according to vma
> layout; but the fact is, that's not how we do it at all,

Indeed, it looks as if the swapin_readahead() that is upstream
at the moment was rewritten by somebody who never understood
what the original code did.

Lets rip that junk out.

Acked-by: Rik van Riel <riel@redhat.com>

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/7] swapin_readahead: excise NUMA bogosity
  2007-10-06 20:38 ` [PATCH 1/7] swapin_readahead: excise NUMA bogosity Hugh Dickins
  2007-10-06 22:43   ` Rik van Riel
@ 2007-10-07 22:05   ` Andi Kleen
  2007-10-07 22:37     ` Rik van Riel
  2007-10-08 17:31   ` Christoph Lameter
  2 siblings, 1 reply; 27+ messages in thread
From: Andi Kleen @ 2007-10-07 22:05 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Andi Kleen, Christoph Lameter, Lee Schermerhorn,
	linux-mm

On Sat, Oct 06, 2007 at 09:38:52PM +0100, Hugh Dickins wrote:
> For three years swapin_readahead has been cluttered with fanciful
> CONFIG_NUMA code, advancing addr, and stepping on to the next vma
> at the boundary, to line up the mempolicy for each page allocation.

Ok. I guess i was naive when I wrote that and didn't consider
how badly swap fragments.  It's ok to remove. I remember you complaining
about it some time ago, but somehow it never got changed.

In theory it could be fixed by migrating the page later,
but that would be somewhat more involved.

I suspect the real fix for this mess would be probably to never
swap in smaller than 1-2MB blocks of continuous memory and then don't 
do any readahead. That would likely fix the swap problems that were
discussed at KS too.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/7] swapin_readahead: excise NUMA bogosity
  2007-10-07 22:05   ` Andi Kleen
@ 2007-10-07 22:37     ` Rik van Riel
  0 siblings, 0 replies; 27+ messages in thread
From: Rik van Riel @ 2007-10-07 22:37 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Hugh Dickins, Andrew Morton, Christoph Lameter, Lee Schermerhorn,
	linux-mm

On Mon, 8 Oct 2007 00:05:29 +0200
Andi Kleen <ak@suse.de> wrote:

> I suspect the real fix for this mess would be probably to never
> swap in smaller than 1-2MB blocks of continuous memory and then don't 
> do any readahead. That would likely fix the swap problems that were
> discussed at KS too.

I suspect internal fragmentation may make that idea worse.

Malloc and free really don't try to keep related data near
each other in virtual memory.  On the contrary, the anti
fragmentation code in malloc libraries tends to do something
like slab and have quite the opposite result.

Swapping in somewhat large chunks (128kB? 256kB?) is probably
a good idea, but we should probably not expect really large 
blocks to contain related userspace data.

Large readahead works for files because the data is related
and sequential access is common.  Doing something like readahead
(dynamic chunk sizes?) on anonymous memory should be possible
though - we just need to keep track of some things on a per
VMA basis.

After all, we can measure how many of the read-around pages
actually get used by the VMA and how many don't.  From that
data we can adjust the swapin chunk size on the fly.

For swapout we can simply look at which of the linearly nearby
pages from the VMA are also on the inactive list.  If we find
a bunch of pages on the inactive list while some others are on
the active list, we can probably assume that the pages still on
the active list are probably part of another access pattern.

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/7] swapin_readahead: excise NUMA bogosity
  2007-10-06 20:38 ` [PATCH 1/7] swapin_readahead: excise NUMA bogosity Hugh Dickins
  2007-10-06 22:43   ` Rik van Riel
  2007-10-07 22:05   ` Andi Kleen
@ 2007-10-08 17:31   ` Christoph Lameter
  2007-10-08 17:35     ` Rik van Riel
  2 siblings, 1 reply; 27+ messages in thread
From: Christoph Lameter @ 2007-10-08 17:31 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, Andi Kleen, Lee Schermerhorn, linux-mm

On Sat, 6 Oct 2007, Hugh Dickins wrote:

> For three years swapin_readahead has been cluttered with fanciful
> CONFIG_NUMA code, advancing addr, and stepping on to the next vma
> at the boundary, to line up the mempolicy for each page allocation.

Hmmm.. I thought that was restricted to shmem which has lots of other 
issues due to shared memory policies that may then into issues with 
cpusets restriction. I never looked at it. Likely due to us not caring too
much about swap.

Readahead for the page cache should work as an allocation in the context 
of the currently running task following the tasks memory policy not the 
vma memory policy. Thus there is no need to put a policy in there. So we 
currently do not obey vma memory policy for page cache reads. VMA policies 
are applied to anonymous pages. But if they go via swap then we have a 
strange type of page here that is both. So the method of following task 
policy could be a problem.

Maybe Lee can sort semantics out a bit better? I still think that this 
whole area needs a fundamental overhaul so that policies work in a way 
that does not have all these exceptions and strange side effects.

> But look at the equivalent shmem_swapin code: either by oversight
> or by design, though it has all the apparatus for choosing a new
> mempolicy per page, it uses the same idx throughout, choosing the
> same mempolicy and interleave node for each page of the cluster.

More confirmation that the shmem shared memory policy stuff is not that
up to snuff...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/7] swapin_readahead: excise NUMA bogosity
  2007-10-08 17:31   ` Christoph Lameter
@ 2007-10-08 17:35     ` Rik van Riel
  2007-10-08 17:41       ` Christoph Lameter
  0 siblings, 1 reply; 27+ messages in thread
From: Rik van Riel @ 2007-10-08 17:35 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Hugh Dickins, Andrew Morton, Andi Kleen, Lee Schermerhorn,
	linux-mm

On Mon, 8 Oct 2007 10:31:06 -0700 (PDT)
Christoph Lameter <clameter@sgi.com> wrote:

> On Sat, 6 Oct 2007, Hugh Dickins wrote:
> 
> > For three years swapin_readahead has been cluttered with fanciful
> > CONFIG_NUMA code, advancing addr, and stepping on to the next vma
> > at the boundary, to line up the mempolicy for each page allocation.
> 
> Hmmm.. I thought that was restricted to shmem which has lots of other 
> issues due to shared memory policies that may then into issues with 
> cpusets restriction. I never looked at it. Likely due to us not
> caring too much about swap.
> 
> Readahead for the page cache should work as an allocation in the
> context of the currently running task following the tasks memory
> policy not the vma memory policy. Thus there is no need to put a
> policy in there. 

Due to the way swapin_readahead works (and how swapout works),
it can easily end up pulling in another task's memory with the
current task's NUMA allocation policy.

If that is an issue, we may want to change swapin_readahead to
access nearby ptes and divine swap entries from those, only
pulling in memory that really belongs to the current process.

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/7] swapin_readahead: excise NUMA bogosity
  2007-10-08 17:35     ` Rik van Riel
@ 2007-10-08 17:41       ` Christoph Lameter
  2007-10-08 17:47         ` Rik van Riel
  0 siblings, 1 reply; 27+ messages in thread
From: Christoph Lameter @ 2007-10-08 17:41 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Hugh Dickins, Andrew Morton, Andi Kleen, Lee Schermerhorn,
	linux-mm

On Mon, 8 Oct 2007, Rik van Riel wrote:

> Due to the way swapin_readahead works (and how swapout works),
> it can easily end up pulling in another task's memory with the
> current task's NUMA allocation policy.

I am not sure what you mean by "another task's memory"? How does memory 
become owned by a task? 

Having a variety of NUMA allocation strategies applied to the pages of one 
file in memory is common for shared mmapped files like executables 
already.
 
> If that is an issue, we may want to change swapin_readahead to
> access nearby ptes and divine swap entries from those, only
> pulling in memory that really belongs to the current process.

Well lets keep it simple. The association of pages to a process is not 
that easy to establish if a page is shared.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/7] swapin_readahead: excise NUMA bogosity
  2007-10-08 17:41       ` Christoph Lameter
@ 2007-10-08 17:47         ` Rik van Riel
  2007-10-08 17:52           ` Christoph Lameter
  0 siblings, 1 reply; 27+ messages in thread
From: Rik van Riel @ 2007-10-08 17:47 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Hugh Dickins, Andrew Morton, Andi Kleen, Lee Schermerhorn,
	linux-mm

On Mon, 8 Oct 2007 10:41:26 -0700 (PDT)
Christoph Lameter <clameter@sgi.com> wrote:

> On Mon, 8 Oct 2007, Rik van Riel wrote:
> 
> > Due to the way swapin_readahead works (and how swapout works),
> > it can easily end up pulling in another task's memory with the
> > current task's NUMA allocation policy.
> 
> I am not sure what you mean by "another task's memory"? How does
> memory become owned by a task? 

Swapin_readahead simply reads in all swap pages that are physically
close to the desired one from the swap area, without taking into
account whether or not the swap entry belongs to the current task
or others.

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/7] swapin_readahead: excise NUMA bogosity
  2007-10-08 17:47         ` Rik van Riel
@ 2007-10-08 17:52           ` Christoph Lameter
  2007-10-08 18:48             ` Rik van Riel
  0 siblings, 1 reply; 27+ messages in thread
From: Christoph Lameter @ 2007-10-08 17:52 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Hugh Dickins, Andrew Morton, Andi Kleen, Lee Schermerhorn,
	linux-mm

On Mon, 8 Oct 2007, Rik van Riel wrote:

> > I am not sure what you mean by "another task's memory"? How does
> > memory become owned by a task? 
> 
> Swapin_readahead simply reads in all swap pages that are physically
> close to the desired one from the swap area, without taking into
> account whether or not the swap entry belongs to the current task
> or others.

That is the same approach used by regular readahead. Lee will only get 
back later in the week. He should take a look at this since he is trying 
to sort out the memory policy issues.

But that is more at the level of conceptual stuff that needs cleaning up

The patch is okay from what I can see.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/7] swapin_readahead: excise NUMA bogosity
  2007-10-08 17:52           ` Christoph Lameter
@ 2007-10-08 18:48             ` Rik van Riel
  0 siblings, 0 replies; 27+ messages in thread
From: Rik van Riel @ 2007-10-08 18:48 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Hugh Dickins, Andrew Morton, Andi Kleen, Lee Schermerhorn,
	linux-mm

On Mon, 8 Oct 2007 10:52:35 -0700 (PDT)
Christoph Lameter <clameter@sgi.com> wrote:

> On Mon, 8 Oct 2007, Rik van Riel wrote:
> 
> > > I am not sure what you mean by "another task's memory"? How does
> > > memory become owned by a task? 
> > 
> > Swapin_readahead simply reads in all swap pages that are physically
> > close to the desired one from the swap area, without taking into
> > account whether or not the swap entry belongs to the current task
> > or others.
> 
> That is the same approach used by regular readahead.

Yes, but in regular readahead you can generally assume that the data
within one file will be related.  You can make no such assumption with
swap.

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 2/7] swapin_readahead: move and rearrange args
  2007-10-06 20:35 [PATCH 0/7] swapin/shmem patches Hugh Dickins
  2007-10-06 20:38 ` [PATCH 1/7] swapin_readahead: excise NUMA bogosity Hugh Dickins
@ 2007-10-06 20:39 ` Hugh Dickins
  2007-10-07  2:26   ` Rik van Riel
  2007-10-06 20:43 ` [PATCH 3/7] swapin needs gfp_mask for loop on tmpfs Hugh Dickins
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 27+ messages in thread
From: Hugh Dickins @ 2007-10-06 20:39 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm

swapin_readahead has never sat well in mm/memory.c: move it to
mm/swap_state.c beside its kindred read_swap_cache_async.  Why
were its args in a different order? rearrange them.  And since
it was always followed by a read_swap_cache_async of the target
page, fold that in and return struct page*.  Then CONFIG_SWAP=n
no longer needs valid_swaphandles and read_swap_cache_async stubs.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
---

 include/linux/swap.h |   19 ++++++----------
 mm/memory.c          |   45 ---------------------------------------
 mm/shmem.c           |    6 +----
 mm/swap_state.c      |   47 +++++++++++++++++++++++++++++++++++++++++
 4 files changed, 57 insertions(+), 60 deletions(-)

--- patch1/include/linux/swap.h	2007-09-27 11:28:37.000000000 +0100
+++ patch2/include/linux/swap.h	2007-10-04 19:24:33.000000000 +0100
@@ -159,9 +159,6 @@ struct swap_list_t {
 /* Swap 50% full? Release swapcache more aggressively.. */
 #define vm_swap_full() (nr_swap_pages*2 < total_swap_pages)
 
-/* linux/mm/memory.c */
-extern void swapin_readahead(swp_entry_t, unsigned long, struct vm_area_struct *);
-
 /* linux/mm/page_alloc.c */
 extern unsigned long totalram_pages;
 extern unsigned long totalreserve_pages;
@@ -235,9 +232,12 @@ extern int move_from_swap_cache(struct p
 		struct address_space *);
 extern void free_page_and_swap_cache(struct page *);
 extern void free_pages_and_swap_cache(struct page **, int);
-extern struct page * lookup_swap_cache(swp_entry_t);
-extern struct page * read_swap_cache_async(swp_entry_t, struct vm_area_struct *vma,
-					   unsigned long addr);
+extern struct page *lookup_swap_cache(swp_entry_t);
+extern struct page *read_swap_cache_async(swp_entry_t,
+			struct vm_area_struct *vma, unsigned long addr);
+extern struct page *swapin_readahead(swp_entry_t,
+			struct vm_area_struct *vma, unsigned long addr);
+
 /* linux/mm/swapfile.c */
 extern long total_swap_pages;
 extern unsigned int nr_swapfiles;
@@ -311,7 +311,7 @@ static inline void swap_free(swp_entry_t
 {
 }
 
-static inline struct page *read_swap_cache_async(swp_entry_t swp,
+static inline struct page *swapin_readahead(swp_entry_t swp,
 			struct vm_area_struct *vma, unsigned long addr)
 {
 	return NULL;
@@ -322,11 +322,6 @@ static inline struct page *lookup_swap_c
 	return NULL;
 }
 
-static inline int valid_swaphandles(swp_entry_t entry, unsigned long *offset)
-{
-	return 0;
-}
-
 #define can_share_swap_page(p)			(page_mapcount(p) == 1)
 
 static inline int move_to_swap_cache(struct page *page, swp_entry_t entry)
--- patch1/mm/memory.c	2007-10-04 19:24:31.000000000 +0100
+++ patch2/mm/memory.c	2007-10-04 19:24:33.000000000 +0100
@@ -1993,48 +1993,6 @@ int vmtruncate_range(struct inode *inode
 	return 0;
 }
 
-/**
- * swapin_readahead - swap in pages in hope we need them soon
- * @entry: swap entry of this memory
- * @addr: address to start
- * @vma: user vma this addresses belong to
- *
- * Primitive swap readahead code. We simply read an aligned block of
- * (1 << page_cluster) entries in the swap area. This method is chosen
- * because it doesn't cost us any seek time.  We also make sure to queue
- * the 'original' request together with the readahead ones...
- *
- * This has been extended to use the NUMA policies from the mm triggering
- * the readahead.
- *
- * Caller must hold down_read on the vma->vm_mm if vma is not NULL.
- */
-void swapin_readahead(swp_entry_t entry, unsigned long addr,struct vm_area_struct *vma)
-{
-	int nr_pages;
-	struct page *page;
-	unsigned long offset;
-	unsigned long end_offset;
-
-	/*
-	 * Get starting offset for readaround, and number of pages to read.
-	 * Adjust starting address by readbehind (for NUMA interleave case)?
-	 * No, it's very unlikely that swap layout would follow vma layout,
-	 * more likely that neighbouring swap pages came from the same node:
-	 * so use the same "addr" to choose the same node for each swap read.
-	 */
-	nr_pages = valid_swaphandles(entry, &offset);
-	for (end_offset = offset + nr_pages; offset < end_offset; offset++) {
-		/* Ok, do the async read-ahead now */
-		page = read_swap_cache_async(swp_entry(swp_type(entry), offset),
-						vma, addr);
-		if (!page)
-			break;
-		page_cache_release(page);
-	}
-	lru_add_drain();	/* Push any new pages onto the LRU now */
-}
-
 /*
  * We enter with non-exclusive mmap_sem (to exclude vma changes,
  * but allow concurrent faults), and pte mapped but not yet locked.
@@ -2062,8 +2020,7 @@ static int do_swap_page(struct mm_struct
 	page = lookup_swap_cache(entry);
 	if (!page) {
 		grab_swap_token(); /* Contend for token _before_ read-in */
- 		swapin_readahead(entry, address, vma);
- 		page = read_swap_cache_async(entry, vma, address);
+		page = swapin_readahead(entry, vma, address);
 		if (!page) {
 			/*
 			 * Back out if somebody else faulted in this pte
--- patch1/mm/shmem.c	2007-10-04 19:24:31.000000000 +0100
+++ patch2/mm/shmem.c	2007-10-04 19:24:33.000000000 +0100
@@ -1021,8 +1021,7 @@ static struct page *shmem_swapin(struct 
 	pvma.vm_pgoff = idx;
 	pvma.vm_ops = NULL;
 	pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, idx);
-	swapin_readahead(entry, 0, &pvma);
-	page = read_swap_cache_async(entry, &pvma, 0);
+	page = swapin_readahead(entry, &pvma, 0);
 	mpol_free(pvma.vm_policy);
 	return page;
 }
@@ -1052,8 +1051,7 @@ static inline int shmem_parse_mpol(char 
 static inline struct page *
 shmem_swapin(struct shmem_inode_info *info,swp_entry_t entry,unsigned long idx)
 {
-	swapin_readahead(entry, 0, NULL);
-	return read_swap_cache_async(entry, NULL, 0);
+	return swapin_readahead(entry, NULL, 0);
 }
 
 static inline struct page *
--- patch1/mm/swap_state.c	2007-09-27 11:28:39.000000000 +0100
+++ patch2/mm/swap_state.c	2007-10-04 19:24:33.000000000 +0100
@@ -10,6 +10,7 @@
 #include <linux/mm.h>
 #include <linux/kernel_stat.h>
 #include <linux/swap.h>
+#include <linux/swapops.h>
 #include <linux/init.h>
 #include <linux/pagemap.h>
 #include <linux/buffer_head.h>
@@ -379,3 +380,49 @@ struct page *read_swap_cache_async(swp_e
 		page_cache_release(new_page);
 	return found_page;
 }
+
+/**
+ * swapin_readahead - swap in pages in hope we need them soon
+ * @entry: swap entry of this memory
+ * @vma: user vma this address belongs to
+ * @addr: target address for mempolicy
+ *
+ * Returns the struct page for entry and addr, after queueing swapin.
+ *
+ * Primitive swap readahead code. We simply read an aligned block of
+ * (1 << page_cluster) entries in the swap area. This method is chosen
+ * because it doesn't cost us any seek time.  We also make sure to queue
+ * the 'original' request together with the readahead ones...
+ *
+ * This has been extended to use the NUMA policies from the mm triggering
+ * the readahead.
+ *
+ * Caller must hold down_read on the vma->vm_mm if vma is not NULL.
+ */
+struct page *swapin_readahead(swp_entry_t entry,
+			struct vm_area_struct *vma, unsigned long addr)
+{
+	int nr_pages;
+	struct page *page;
+	unsigned long offset;
+	unsigned long end_offset;
+
+	/*
+	 * Get starting offset for readaround, and number of pages to read.
+	 * Adjust starting address by readbehind (for NUMA interleave case)?
+	 * No, it's very unlikely that swap layout would follow vma layout,
+	 * more likely that neighbouring swap pages came from the same node:
+	 * so use the same "addr" to choose the same node for each swap read.
+	 */
+	nr_pages = valid_swaphandles(entry, &offset);
+	for (end_offset = offset + nr_pages; offset < end_offset; offset++) {
+		/* Ok, do the async read-ahead now */
+		page = read_swap_cache_async(swp_entry(swp_type(entry), offset),
+						vma, addr);
+		if (!page)
+			break;
+		page_cache_release(page);
+	}
+	lru_add_drain();	/* Push any new pages onto the LRU now */
+	return read_swap_cache_async(entry, vma, addr);
+}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/7] swapin_readahead: move and rearrange args
  2007-10-06 20:39 ` [PATCH 2/7] swapin_readahead: move and rearrange args Hugh Dickins
@ 2007-10-07  2:26   ` Rik van Riel
  0 siblings, 0 replies; 27+ messages in thread
From: Rik van Riel @ 2007-10-07  2:26 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, linux-mm

On Sat, 6 Oct 2007 21:39:44 +0100 (BST)
Hugh Dickins <hugh@veritas.com> wrote:

> swapin_readahead has never sat well in mm/memory.c: move it to
> mm/swap_state.c beside its kindred read_swap_cache_async.  Why
> were its args in a different order? rearrange them.  And since
> it was always followed by a read_swap_cache_async of the target
> page, fold that in and return struct page*.  Then CONFIG_SWAP=n
> no longer needs valid_swaphandles and read_swap_cache_async stubs.
> 
> Signed-off-by: Hugh Dickins <hugh@veritas.com>

Acked-by: Rik van Riel <riel@redhat.com>


-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 3/7] swapin needs gfp_mask for loop on tmpfs
  2007-10-06 20:35 [PATCH 0/7] swapin/shmem patches Hugh Dickins
  2007-10-06 20:38 ` [PATCH 1/7] swapin_readahead: excise NUMA bogosity Hugh Dickins
  2007-10-06 20:39 ` [PATCH 2/7] swapin_readahead: move and rearrange args Hugh Dickins
@ 2007-10-06 20:43 ` Hugh Dickins
  2007-10-07 23:23   ` Rik van Riel
  2007-10-08 13:52   ` Peter Zijlstra
  2007-10-06 20:45 ` [PATCH 4/7] shmem: SGP_QUICK and SGP_FAULT redundant Hugh Dickins
                   ` (3 subsequent siblings)
  6 siblings, 2 replies; 27+ messages in thread
From: Hugh Dickins @ 2007-10-06 20:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nick Piggin, Peter Zijlstra, Miklos Szeredi, Fengguang Wu,
	linux-mm

Building in a filesystem on a loop device on a tmpfs file can hang when
swapping, the loop thread caught in that infamous throttle_vm_writeout.

In theory this is a long standing problem, which I've either never seen
in practice, or long ago suppressed the recollection, after discounting
my load and my tmpfs size as unrealistically high.  But now, with the
new aops, it has become easy to hang on one machine.

Loop used to grab_cache_page before the old prepare_write to tmpfs,
which seems to have been enough to free up some memory for any swapin
needed; but the new write_begin lets tmpfs find or allocate the page
(much nicer, since grab_cache_page missed tmpfs pages in swapcache).

When allocating a fresh page, tmpfs respects loop's mapping_gfp_mask,
which has __GFP_IO|__GFP_FS stripped off, and throttle_vm_writeout is
designed to break out when __GFP_IO or GFP_FS is unset; but when tmfps
swaps in, read_swap_cache_async allocates with GFP_HIGHUSER_MOVABLE
regardless of the mapping_gfp_mask - hence the hang.

So, pass gfp_mask down the line from shmem_getpage to shmem_swapin
to swapin_readahead to read_swap_cache_async to add_to_swap_cache.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
---
I did once see what looked like this hang on 2.6.23-rc3-git10,
and on some -mm's just _before_ the new aops went in; but those
cases were much harder to reproduce, and I've not seen it on any
2.6.23-rc since then, so haven't worried about it for 2.6.23.

I see there's currently lots of interest in throttle_vm_writeout,
though I've not been following in detail.  CC'ed interested parties.
It's possible that forthcoming changes might fix my hang differently,
but there's currently so much ferment that I've not tried more than
Fengguang's moving the gfp_mask test (which unsurprisingly didn't help).
I'm sending this out now because I want the fix, and it does seem wrong
to have been ignoring mapping_gfp_mask here before (I think I was aware
of the deficiency when I first did loop on tmpfs, but reluctant to ask
for changes outside shmem.c when no problem was seen).

 include/linux/swap.h |    6 +++---
 mm/memory.c          |    3 ++-
 mm/shmem.c           |   28 ++++++++++++++--------------
 mm/swap_state.c      |   18 +++++++++---------
 mm/swapfile.c        |    3 ++-
 5 files changed, 30 insertions(+), 28 deletions(-)

--- patch2/include/linux/swap.h	2007-10-04 19:24:33.000000000 +0100
+++ patch3/include/linux/swap.h	2007-10-04 19:24:36.000000000 +0100
@@ -233,9 +233,9 @@ extern int move_from_swap_cache(struct p
 extern void free_page_and_swap_cache(struct page *);
 extern void free_pages_and_swap_cache(struct page **, int);
 extern struct page *lookup_swap_cache(swp_entry_t);
-extern struct page *read_swap_cache_async(swp_entry_t,
+extern struct page *read_swap_cache_async(swp_entry_t, gfp_t,
 			struct vm_area_struct *vma, unsigned long addr);
-extern struct page *swapin_readahead(swp_entry_t,
+extern struct page *swapin_readahead(swp_entry_t, gfp_t,
 			struct vm_area_struct *vma, unsigned long addr);
 
 /* linux/mm/swapfile.c */
@@ -311,7 +311,7 @@ static inline void swap_free(swp_entry_t
 {
 }
 
-static inline struct page *swapin_readahead(swp_entry_t swp,
+static inline struct page *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
 			struct vm_area_struct *vma, unsigned long addr)
 {
 	return NULL;
--- patch2/mm/memory.c	2007-10-04 19:24:33.000000000 +0100
+++ patch3/mm/memory.c	2007-10-04 19:24:36.000000000 +0100
@@ -2020,7 +2020,8 @@ static int do_swap_page(struct mm_struct
 	page = lookup_swap_cache(entry);
 	if (!page) {
 		grab_swap_token(); /* Contend for token _before_ read-in */
-		page = swapin_readahead(entry, vma, address);
+		page = swapin_readahead(entry,
+					GFP_HIGHUSER_MOVABLE, vma, address);
 		if (!page) {
 			/*
 			 * Back out if somebody else faulted in this pte
--- patch2/mm/shmem.c	2007-10-04 19:24:33.000000000 +0100
+++ patch3/mm/shmem.c	2007-10-04 19:24:36.000000000 +0100
@@ -1010,8 +1010,8 @@ out:
 	return err;
 }
 
-static struct page *shmem_swapin(struct shmem_inode_info *info,
-				       swp_entry_t entry, unsigned long idx)
+static struct page *shmem_swapin(swp_entry_t entry, gfp_t gfp,
+			struct shmem_inode_info *info, unsigned long idx)
 {
 	struct vm_area_struct pvma;
 	struct page *page;
@@ -1021,13 +1021,13 @@ static struct page *shmem_swapin(struct 
 	pvma.vm_pgoff = idx;
 	pvma.vm_ops = NULL;
 	pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, idx);
-	page = swapin_readahead(entry, &pvma, 0);
+	page = swapin_readahead(entry, gfp, &pvma, 0);
 	mpol_free(pvma.vm_policy);
 	return page;
 }
 
-static struct page *shmem_alloc_page(gfp_t gfp, struct shmem_inode_info *info,
-					unsigned long idx)
+static struct page *shmem_alloc_page(gfp_t gfp,
+			struct shmem_inode_info *info, unsigned long idx)
 {
 	struct vm_area_struct pvma;
 	struct page *page;
@@ -1048,14 +1048,14 @@ static inline int shmem_parse_mpol(char 
 	return 1;
 }
 
-static inline struct page *
-shmem_swapin(struct shmem_inode_info *info,swp_entry_t entry,unsigned long idx)
+static inline struct page *shmem_swapin(swp_entry_t entry, gfp_t gfp,
+			struct shmem_inode_info *info, unsigned long idx)
 {
-	return swapin_readahead(entry, NULL, 0);
+	return swapin_readahead(entry, gfp, NULL, 0);
 }
 
-static inline struct page *
-shmem_alloc_page(gfp_t gfp,struct shmem_inode_info *info, unsigned long idx)
+static inline struct page *shmem_alloc_page(gfp_t gfp,
+			struct shmem_inode_info *info, unsigned long idx)
 {
 	return alloc_page(gfp | __GFP_ZERO);
 }
@@ -1078,6 +1078,7 @@ static int shmem_getpage(struct inode *i
 	struct page *swappage;
 	swp_entry_t *entry;
 	swp_entry_t swap;
+	gfp_t gfp;
 	int error;
 
 	if (idx >= SHMEM_MAX_INDEX)
@@ -1102,6 +1103,7 @@ repeat:
 	error = 0;
 	if (sgp == SGP_QUICK)
 		goto failed;
+	gfp = mapping_gfp_mask(mapping);
 
 	spin_lock(&info->lock);
 	shmem_recalc_inode(inode);
@@ -1124,7 +1126,7 @@ repeat:
 				*type |= VM_FAULT_MAJOR;
 			}
 			spin_unlock(&info->lock);
-			swappage = shmem_swapin(info, swap, idx);
+			swappage = shmem_swapin(swap, gfp, info, idx);
 			if (!swappage) {
 				spin_lock(&info->lock);
 				entry = shmem_swp_alloc(info, idx, sgp);
@@ -1236,9 +1238,7 @@ repeat:
 
 		if (!filepage) {
 			spin_unlock(&info->lock);
-			filepage = shmem_alloc_page(mapping_gfp_mask(mapping),
-						    info,
-						    idx);
+			filepage = shmem_alloc_page(gfp, info, idx);
 			if (!filepage) {
 				shmem_unacct_blocks(info->flags, 1);
 				shmem_free_blocks(inode, 1);
--- patch2/mm/swap_state.c	2007-10-04 19:24:33.000000000 +0100
+++ patch3/mm/swap_state.c	2007-10-04 19:24:36.000000000 +0100
@@ -106,7 +106,8 @@ out:
 	return error;
 }
 
-static int add_to_swap_cache(struct page *page, swp_entry_t entry)
+static int add_to_swap_cache(struct page *page, swp_entry_t entry,
+				gfp_t gfp_mask)
 {
 	int error;
 
@@ -116,7 +117,7 @@ static int add_to_swap_cache(struct page
 		return -ENOENT;
 	}
 	SetPageLocked(page);
-	error = __add_to_swap_cache(page, entry, GFP_KERNEL);
+	error = __add_to_swap_cache(page, entry, gfp_mask & GFP_KERNEL);
 	/*
 	 * Anon pages are already on the LRU, we don't run lru_cache_add here.
 	 */
@@ -329,7 +330,7 @@ struct page * lookup_swap_cache(swp_entr
  * A failure return means that either the page allocation failed or that
  * the swap entry is no longer in use.
  */
-struct page *read_swap_cache_async(swp_entry_t entry,
+struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 			struct vm_area_struct *vma, unsigned long addr)
 {
 	struct page *found_page, *new_page = NULL;
@@ -349,8 +350,7 @@ struct page *read_swap_cache_async(swp_e
 		 * Get a new page to read into from swap.
 		 */
 		if (!new_page) {
-			new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE,
-								vma, addr);
+			new_page = alloc_page_vma(gfp_mask, vma, addr);
 			if (!new_page)
 				break;		/* Out of memory */
 		}
@@ -365,7 +365,7 @@ struct page *read_swap_cache_async(swp_e
 		 * the just freed swap entry for an existing page.
 		 * May fail (-ENOMEM) if radix-tree node allocation failed.
 		 */
-		err = add_to_swap_cache(new_page, entry);
+		err = add_to_swap_cache(new_page, entry, gfp_mask);
 		if (!err) {
 			/*
 			 * Initiate read into locked page and return.
@@ -399,7 +399,7 @@ struct page *read_swap_cache_async(swp_e
  *
  * Caller must hold down_read on the vma->vm_mm if vma is not NULL.
  */
-struct page *swapin_readahead(swp_entry_t entry,
+struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
 			struct vm_area_struct *vma, unsigned long addr)
 {
 	int nr_pages;
@@ -418,11 +418,11 @@ struct page *swapin_readahead(swp_entry_
 	for (end_offset = offset + nr_pages; offset < end_offset; offset++) {
 		/* Ok, do the async read-ahead now */
 		page = read_swap_cache_async(swp_entry(swp_type(entry), offset),
-						vma, addr);
+						gfp_mask, vma, addr);
 		if (!page)
 			break;
 		page_cache_release(page);
 	}
 	lru_add_drain();	/* Push any new pages onto the LRU now */
-	return read_swap_cache_async(entry, vma, addr);
+	return read_swap_cache_async(entry, gfp_mask, vma, addr);
 }
--- patch2/mm/swapfile.c	2007-09-27 11:28:39.000000000 +0100
+++ patch3/mm/swapfile.c	2007-10-04 19:24:36.000000000 +0100
@@ -737,7 +737,8 @@ static int try_to_unuse(unsigned int typ
 		 */
 		swap_map = &si->swap_map[i];
 		entry = swp_entry(type, i);
-		page = read_swap_cache_async(entry, NULL, 0);
+		page = read_swap_cache_async(entry,
+					GFP_HIGHUSER_MOVABLE, NULL, 0);
 		if (!page) {
 			/*
 			 * Either swap_duplicate() failed because entry

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/7] swapin needs gfp_mask for loop on tmpfs
  2007-10-06 20:43 ` [PATCH 3/7] swapin needs gfp_mask for loop on tmpfs Hugh Dickins
@ 2007-10-07 23:23   ` Rik van Riel
  2007-10-08 13:52   ` Peter Zijlstra
  1 sibling, 0 replies; 27+ messages in thread
From: Rik van Riel @ 2007-10-07 23:23 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Nick Piggin, Peter Zijlstra, Miklos Szeredi,
	Fengguang Wu, linux-mm

On Sat, 6 Oct 2007 21:43:36 +0100 (BST)
Hugh Dickins <hugh@veritas.com> wrote:

> So, pass gfp_mask down the line from shmem_getpage to shmem_swapin
> to swapin_readahead to read_swap_cache_async to add_to_swap_cache.
> 
> Signed-off-by: Hugh Dickins <hugh@veritas.com>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/7] swapin needs gfp_mask for loop on tmpfs
  2007-10-06 20:43 ` [PATCH 3/7] swapin needs gfp_mask for loop on tmpfs Hugh Dickins
  2007-10-07 23:23   ` Rik van Riel
@ 2007-10-08 13:52   ` Peter Zijlstra
  1 sibling, 0 replies; 27+ messages in thread
From: Peter Zijlstra @ 2007-10-08 13:52 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Nick Piggin, Miklos Szeredi, Fengguang Wu,
	linux-mm

[-- Attachment #1: Type: text/plain, Size: 1383 bytes --]

On Sat, 2007-10-06 at 21:43 +0100, Hugh Dickins wrote:
> Building in a filesystem on a loop device on a tmpfs file can hang when
> swapping, the loop thread caught in that infamous throttle_vm_writeout.
> 
> In theory this is a long standing problem, which I've either never seen
> in practice, or long ago suppressed the recollection, after discounting
> my load and my tmpfs size as unrealistically high.  But now, with the
> new aops, it has become easy to hang on one machine.
> 
> Loop used to grab_cache_page before the old prepare_write to tmpfs,
> which seems to have been enough to free up some memory for any swapin
> needed; but the new write_begin lets tmpfs find or allocate the page
> (much nicer, since grab_cache_page missed tmpfs pages in swapcache).
> 
> When allocating a fresh page, tmpfs respects loop's mapping_gfp_mask,
> which has __GFP_IO|__GFP_FS stripped off, and throttle_vm_writeout is
> designed to break out when __GFP_IO or GFP_FS is unset; but when tmfps
> swaps in, read_swap_cache_async allocates with GFP_HIGHUSER_MOVABLE
> regardless of the mapping_gfp_mask - hence the hang.
> 
> So, pass gfp_mask down the line from shmem_getpage to shmem_swapin
> to swapin_readahead to read_swap_cache_async to add_to_swap_cache.
> 
> Signed-off-by: Hugh Dickins <hugh@veritas.com>

Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 4/7] shmem: SGP_QUICK and SGP_FAULT redundant
  2007-10-06 20:35 [PATCH 0/7] swapin/shmem patches Hugh Dickins
                   ` (2 preceding siblings ...)
  2007-10-06 20:43 ` [PATCH 3/7] swapin needs gfp_mask for loop on tmpfs Hugh Dickins
@ 2007-10-06 20:45 ` Hugh Dickins
  2007-10-07 23:23   ` Rik van Riel
  2007-10-06 20:46 ` [PATCH 5/7] shmem_getpage return page locked Hugh Dickins
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 27+ messages in thread
From: Hugh Dickins @ 2007-10-06 20:45 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Nick Piggin, linux-mm

Remove SGP_QUICK from the sgp_type enum: it was for shmem_populate and
has no users now.  Remove SGP_FAULT from the enum: SGP_CACHE does just
as well (and shmem_getpage is about to return with page always locked).

Signed-off-by: Hugh Dickins <hugh@veritas.com>
---

 mm/shmem.c |    8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

--- patch3/mm/shmem.c	2007-10-04 19:24:36.000000000 +0100
+++ patch4/mm/shmem.c	2007-10-04 19:24:39.000000000 +0100
@@ -78,11 +78,9 @@
 
 /* Flag allocation requirements to shmem_getpage and shmem_swp_alloc */
 enum sgp_type {
-	SGP_QUICK,	/* don't try more than file page cache lookup */
 	SGP_READ,	/* don't exceed i_size, don't allocate page */
 	SGP_CACHE,	/* don't exceed i_size, may allocate page */
 	SGP_WRITE,	/* may exceed i_size, may allocate page */
-	SGP_FAULT,	/* same as SGP_CACHE, return with page locked */
 };
 
 static int shmem_getpage(struct inode *inode, unsigned long idx,
@@ -1101,8 +1099,6 @@ repeat:
 	if (filepage && PageUptodate(filepage))
 		goto done;
 	error = 0;
-	if (sgp == SGP_QUICK)
-		goto failed;
 	gfp = mapping_gfp_mask(mapping);
 
 	spin_lock(&info->lock);
@@ -1276,7 +1272,7 @@ repeat:
 done:
 	if (*pagep != filepage) {
 		*pagep = filepage;
-		if (sgp != SGP_FAULT)
+		if (sgp != SGP_CACHE)
 			unlock_page(filepage);
 
 	}
@@ -1299,7 +1295,7 @@ static int shmem_fault(struct vm_area_st
 	if (((loff_t)vmf->pgoff << PAGE_CACHE_SHIFT) >= i_size_read(inode))
 		return VM_FAULT_SIGBUS;
 
-	error = shmem_getpage(inode, vmf->pgoff, &vmf->page, SGP_FAULT, &ret);
+	error = shmem_getpage(inode, vmf->pgoff, &vmf->page, SGP_CACHE, &ret);
 	if (error)
 		return ((error == -ENOMEM) ? VM_FAULT_OOM : VM_FAULT_SIGBUS);
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 4/7] shmem: SGP_QUICK and SGP_FAULT redundant
  2007-10-06 20:45 ` [PATCH 4/7] shmem: SGP_QUICK and SGP_FAULT redundant Hugh Dickins
@ 2007-10-07 23:23   ` Rik van Riel
  0 siblings, 0 replies; 27+ messages in thread
From: Rik van Riel @ 2007-10-07 23:23 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, Nick Piggin, linux-mm

On Sat, 6 Oct 2007 21:45:12 +0100 (BST)
Hugh Dickins <hugh@veritas.com> wrote:

> Remove SGP_QUICK from the sgp_type enum: it was for shmem_populate and
> has no users now.  Remove SGP_FAULT from the enum: SGP_CACHE does just
> as well (and shmem_getpage is about to return with page always
> locked).
> 
> Signed-off-by: Hugh Dickins <hugh@veritas.com>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 5/7] shmem_getpage return page locked
  2007-10-06 20:35 [PATCH 0/7] swapin/shmem patches Hugh Dickins
                   ` (3 preceding siblings ...)
  2007-10-06 20:45 ` [PATCH 4/7] shmem: SGP_QUICK and SGP_FAULT redundant Hugh Dickins
@ 2007-10-06 20:46 ` Hugh Dickins
  2007-10-07  8:01   ` Nick Piggin
  2007-10-08  0:44   ` Rik van Riel
  2007-10-06 20:47 ` [PATCH 6/7] shmem_file_write is redundant Hugh Dickins
  2007-10-06 20:48 ` [PATCH 7/7] swapin: fix valid_swaphandles defect Hugh Dickins
  6 siblings, 2 replies; 27+ messages in thread
From: Hugh Dickins @ 2007-10-06 20:46 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Nick Piggin, linux-mm

In the new aops, write_begin is supposed to return the page locked:
though I've seen no ill effects, that's been overlooked in the case
of shmem_write_begin, and should be fixed.  Then shmem_write_end must
unlock the page: do so _after_ updating i_size, as we found to be
important in other filesystems (though since shmem pages don't go
the usual writeback route, they never suffered from that corruption).

For shmem_write_begin to return the page locked, we need shmem_getpage
to return the page locked in SGP_WRITE case as well as SGP_CACHE case:
let's simplify the interface and return it locked even when SGP_READ.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
---

 mm/shmem.c |   22 +++++++++++++---------
 1 file changed, 13 insertions(+), 9 deletions(-)

--- patch4/mm/shmem.c	2007-10-04 19:24:39.000000000 +0100
+++ patch5/mm/shmem.c	2007-10-04 19:24:41.000000000 +0100
@@ -729,6 +729,8 @@ static int shmem_notify_change(struct de
 				(void) shmem_getpage(inode,
 					attr->ia_size>>PAGE_CACHE_SHIFT,
 						&page, SGP_READ, NULL);
+				if (page)
+					unlock_page(page);
 			}
 			/*
 			 * Reset SHMEM_PAGEIN flag so that shmem_truncate can
@@ -1270,12 +1272,7 @@ repeat:
 		SetPageUptodate(filepage);
 	}
 done:
-	if (*pagep != filepage) {
-		*pagep = filepage;
-		if (sgp != SGP_CACHE)
-			unlock_page(filepage);
-
-	}
+	*pagep = filepage;
 	return 0;
 
 failed:
@@ -1453,12 +1450,13 @@ shmem_write_end(struct file *file, struc
 {
 	struct inode *inode = mapping->host;
 
+	if (pos + copied > inode->i_size)
+		i_size_write(inode, pos + copied);
+
+	unlock_page(page);
 	set_page_dirty(page);
 	page_cache_release(page);
 
-	if (pos+copied > inode->i_size)
-		i_size_write(inode, pos+copied);
-
 	return copied;
 }
 
@@ -1513,6 +1511,7 @@ shmem_file_write(struct file *file, cons
 		if (err)
 			break;
 
+		unlock_page(page);
 		left = bytes;
 		if (PageHighMem(page)) {
 			volatile unsigned char dummy;
@@ -1594,6 +1593,8 @@ static void do_shmem_file_read(struct fi
 				desc->error = 0;
 			break;
 		}
+		if (page)
+			unlock_page(page);
 
 		/*
 		 * We must evaluate after, since reads (unlike writes)
@@ -1883,6 +1884,7 @@ static int shmem_symlink(struct inode *d
 			iput(inode);
 			return error;
 		}
+		unlock_page(page);
 		inode->i_op = &shmem_symlink_inode_operations;
 		kaddr = kmap_atomic(page, KM_USER0);
 		memcpy(kaddr, symname, len);
@@ -1910,6 +1912,8 @@ static void *shmem_follow_link(struct de
 	struct page *page = NULL;
 	int res = shmem_getpage(dentry->d_inode, 0, &page, SGP_READ, NULL);
 	nd_set_link(nd, res ? ERR_PTR(res) : kmap(page));
+	if (page)
+		unlock_page(page);
 	return page;
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 5/7] shmem_getpage return page locked
  2007-10-06 20:46 ` [PATCH 5/7] shmem_getpage return page locked Hugh Dickins
@ 2007-10-07  8:01   ` Nick Piggin
  2007-10-08 12:05     ` Hugh Dickins
  2007-10-08  0:44   ` Rik van Riel
  1 sibling, 1 reply; 27+ messages in thread
From: Nick Piggin @ 2007-10-07  8:01 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, linux-mm

On Sunday 07 October 2007 06:46, Hugh Dickins wrote:
> In the new aops, write_begin is supposed to return the page locked:
> though I've seen no ill effects, that's been overlooked in the case
> of shmem_write_begin, and should be fixed.  Then shmem_write_end must
> unlock the page: do so _after_ updating i_size, as we found to be
> important in other filesystems (though since shmem pages don't go
> the usual writeback route, they never suffered from that corruption).

I guess my thinking on this is that write_begin doesn't actually _have_
to return the page locked, it just has to return the page in a state where
it may be written into.

Generic callers obviously cannot assume that the page *isn't* locked,
but I can't think it would be too helpful for them to be able to assume
the page is locked (they already have a ref, which prevents reclaim;
and i_mutex, which prevents truncate).

However, this does make tmpfs apis a little simpler and in general is more
like other filesystems, so I have absolutely no problems with it.

I think the other patches are pretty fine too, and really like that you were
able to remove shmem_file_write!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 5/7] shmem_getpage return page locked
  2007-10-07  8:01   ` Nick Piggin
@ 2007-10-08 12:05     ` Hugh Dickins
  2007-10-08  7:08       ` Nick Piggin
  0 siblings, 1 reply; 27+ messages in thread
From: Hugh Dickins @ 2007-10-08 12:05 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, linux-mm

On Sun, 7 Oct 2007, Nick Piggin wrote:
> On Sunday 07 October 2007 06:46, Hugh Dickins wrote:
> > In the new aops, write_begin is supposed to return the page locked:
> > though I've seen no ill effects, that's been overlooked in the case
> > of shmem_write_begin, and should be fixed.  Then shmem_write_end must
> > unlock the page: do so _after_ updating i_size, as we found to be
> > important in other filesystems (though since shmem pages don't go
> > the usual writeback route, they never suffered from that corruption).
> 
> I guess my thinking on this is that write_begin doesn't actually _have_
> to return the page locked, it just has to return the page in a state where
> it may be written into.

Ah, I hadn't appreciated that you were being intentionally permissive
on that: I'm not sure whether that's a good idea or not.  Were there
any other filesystems than tmpfs in which the write_begin did not
return with the page locked?

> Generic callers obviously cannot assume that the page *isn't* locked,
> but I can't think it would be too helpful for them to be able to assume
> the page is locked (they already have a ref, which prevents reclaim;
> and i_mutex, which prevents truncate).

Well, we found before that __mpage_writepage is liable to erase data
just written at end of file, unless i_size was raised while still
holding the page lock held across the writing.  tmpfs doesn't go
that way, but most(?) filesystems do.

> However, this does make tmpfs apis a little simpler and in general is more
> like other filesystems, so I have absolutely no problems with it.

I do feel more comfortable with tmpfs doing that like the
majority.  It's true that it was happy to write without holding
the page lock when it went its own way, but now that it's using
write_begin and write_end with generic code above and between
them, I feel safer doing the common thing.

> I think the other patches are pretty fine too, and really like that you were
> able to remove shmem_file_write!

Thanks, I was pleased with the diffstat.  I'm hoping that in due
course you'll find good reason to come up with a replacement for the
readpage aop, one that doesn't demand the struct page be passed in:
then I can remove shmem_file_read too, and the nastiest part of
shmem_getpage, where it sometimes has to memcpy from swapcache
page to the page passed in; maybe more.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 5/7] shmem_getpage return page locked
  2007-10-08 12:05     ` Hugh Dickins
@ 2007-10-08  7:08       ` Nick Piggin
  0 siblings, 0 replies; 27+ messages in thread
From: Nick Piggin @ 2007-10-08  7:08 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, linux-mm

On Monday 08 October 2007 22:05, Hugh Dickins wrote:
> On Sun, 7 Oct 2007, Nick Piggin wrote:
> > On Sunday 07 October 2007 06:46, Hugh Dickins wrote:
> > > In the new aops, write_begin is supposed to return the page locked:
> > > though I've seen no ill effects, that's been overlooked in the case
> > > of shmem_write_begin, and should be fixed.  Then shmem_write_end must
> > > unlock the page: do so _after_ updating i_size, as we found to be
> > > important in other filesystems (though since shmem pages don't go
> > > the usual writeback route, they never suffered from that corruption).
> >
> > I guess my thinking on this is that write_begin doesn't actually _have_
> > to return the page locked, it just has to return the page in a state
> > where it may be written into.
>
> Ah, I hadn't appreciated that you were being intentionally permissive
> on that: I'm not sure whether that's a good idea or not.  Were there
> any other filesystems than tmpfs in which the write_begin did not
> return with the page locked?

I don't believe so... I wasn't trying to be particularly tricky when doing
the conversion, just noticed the comment that you had no need for the
page lock over the copy.


> > Generic callers obviously cannot assume that the page *isn't* locked,
> > but I can't think it would be too helpful for them to be able to assume
> > the page is locked (they already have a ref, which prevents reclaim;
> > and i_mutex, which prevents truncate).
>
> Well, we found before that __mpage_writepage is liable to erase data
> just written at end of file, unless i_size was raised while still
> holding the page lock held across the writing.  tmpfs doesn't go
> that way, but most(?) filesystems do.

True. When you get into real filesystem territory, there are other things
the page lock is needed for. But as far as the VM goes, there isn't so
much.


> > However, this does make tmpfs apis a little simpler and in general is
> > more like other filesystems, so I have absolutely no problems with it.
>
> I do feel more comfortable with tmpfs doing that like the
> majority.  It's true that it was happy to write without holding
> the page lock when it went its own way, but now that it's using
> write_begin and write_end with generic code above and between
> them, I feel safer doing the common thing.
>
> > I think the other patches are pretty fine too, and really like that you
> > were able to remove shmem_file_write!
>
> Thanks, I was pleased with the diffstat.  I'm hoping that in due
> course you'll find good reason to come up with a replacement for the
> readpage aop, one that doesn't demand the struct page be passed in:
> then I can remove shmem_file_read too, and the nastiest part of
> shmem_getpage, where it sometimes has to memcpy from swapcache
> page to the page passed in; maybe more.

Yeah, readpage can get replaced in a similar way. I think several other
filesystems would be pretty happy with that too...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 5/7] shmem_getpage return page locked
  2007-10-06 20:46 ` [PATCH 5/7] shmem_getpage return page locked Hugh Dickins
  2007-10-07  8:01   ` Nick Piggin
@ 2007-10-08  0:44   ` Rik van Riel
  1 sibling, 0 replies; 27+ messages in thread
From: Rik van Riel @ 2007-10-08  0:44 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, Nick Piggin, linux-mm

On Sat, 6 Oct 2007 21:46:33 +0100 (BST)
Hugh Dickins <hugh@veritas.com> wrote:

> Signed-off-by: Hugh Dickins <hugh@veritas.com>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 6/7] shmem_file_write is redundant
  2007-10-06 20:35 [PATCH 0/7] swapin/shmem patches Hugh Dickins
                   ` (4 preceding siblings ...)
  2007-10-06 20:46 ` [PATCH 5/7] shmem_getpage return page locked Hugh Dickins
@ 2007-10-06 20:47 ` Hugh Dickins
  2007-10-08  0:46   ` Rik van Riel
  2007-10-06 20:48 ` [PATCH 7/7] swapin: fix valid_swaphandles defect Hugh Dickins
  6 siblings, 1 reply; 27+ messages in thread
From: Hugh Dickins @ 2007-10-06 20:47 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Nick Piggin, linux-mm

With the old aops, writing to a tmpfs file had to use its own special
method: the generic method would pass in a fresh page to prepare_write
when the right page was there in swapcache - which was inefficient to
handle, even once we'd concocted the code to handle it.

With the new aops, the generic method uses shmem_write_end, which lets
shmem_getpage find the right page: so now abandon shmem_file_write in
favour of the generic method.  Yes, that does do several things that
tmpfs hasn't really needed (notably balance_dirty_pages_ratelimited,
which ramfs also calls); but more use of common code is preferable.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
---

 mm/shmem.c |  109 +--------------------------------------------------
 1 file changed, 3 insertions(+), 106 deletions(-)

--- patch5/mm/shmem.c	2007-10-04 19:24:41.000000000 +0100
+++ patch6/mm/shmem.c	2007-10-04 19:24:44.000000000 +0100
@@ -1091,7 +1091,7 @@ static int shmem_getpage(struct inode *i
 	 * Normally, filepage is NULL on entry, and either found
 	 * uptodate immediately, or allocated and zeroed, or read
 	 * in under swappage, which is then assigned to filepage.
-	 * But shmem_readpage and shmem_write_begin pass in a locked
+	 * But shmem_readpage (required for splice) passes in a locked
 	 * filepage, which may be found not uptodate by other callers
 	 * too, and may need to be copied from the swappage read in.
 	 */
@@ -1460,110 +1460,6 @@ shmem_write_end(struct file *file, struc
 	return copied;
 }
 
-static ssize_t
-shmem_file_write(struct file *file, const char __user *buf, size_t count, loff_t *ppos)
-{
-	struct inode	*inode = file->f_path.dentry->d_inode;
-	loff_t		pos;
-	unsigned long	written;
-	ssize_t		err;
-
-	if ((ssize_t) count < 0)
-		return -EINVAL;
-
-	if (!access_ok(VERIFY_READ, buf, count))
-		return -EFAULT;
-
-	mutex_lock(&inode->i_mutex);
-
-	pos = *ppos;
-	written = 0;
-
-	err = generic_write_checks(file, &pos, &count, 0);
-	if (err || !count)
-		goto out;
-
-	err = remove_suid(file->f_path.dentry);
-	if (err)
-		goto out;
-
-	inode->i_ctime = inode->i_mtime = CURRENT_TIME;
-
-	do {
-		struct page *page = NULL;
-		unsigned long bytes, index, offset;
-		char *kaddr;
-		int left;
-
-		offset = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */
-		index = pos >> PAGE_CACHE_SHIFT;
-		bytes = PAGE_CACHE_SIZE - offset;
-		if (bytes > count)
-			bytes = count;
-
-		/*
-		 * We don't hold page lock across copy from user -
-		 * what would it guard against? - so no deadlock here.
-		 * But it still may be a good idea to prefault below.
-		 */
-
-		err = shmem_getpage(inode, index, &page, SGP_WRITE, NULL);
-		if (err)
-			break;
-
-		unlock_page(page);
-		left = bytes;
-		if (PageHighMem(page)) {
-			volatile unsigned char dummy;
-			__get_user(dummy, buf);
-			__get_user(dummy, buf + bytes - 1);
-
-			kaddr = kmap_atomic(page, KM_USER0);
-			left = __copy_from_user_inatomic(kaddr + offset,
-							buf, bytes);
-			kunmap_atomic(kaddr, KM_USER0);
-		}
-		if (left) {
-			kaddr = kmap(page);
-			left = __copy_from_user(kaddr + offset, buf, bytes);
-			kunmap(page);
-		}
-
-		written += bytes;
-		count -= bytes;
-		pos += bytes;
-		buf += bytes;
-		if (pos > inode->i_size)
-			i_size_write(inode, pos);
-
-		flush_dcache_page(page);
-		set_page_dirty(page);
-		mark_page_accessed(page);
-		page_cache_release(page);
-
-		if (left) {
-			pos -= left;
-			written -= left;
-			err = -EFAULT;
-			break;
-		}
-
-		/*
-		 * Our dirty pages are not counted in nr_dirty,
-		 * and we do not attempt to balance dirty pages.
-		 */
-
-		cond_resched();
-	} while (count);
-
-	*ppos = pos;
-	if (written)
-		err = written;
-out:
-	mutex_unlock(&inode->i_mutex);
-	return err;
-}
-
 static void do_shmem_file_read(struct file *filp, loff_t *ppos, read_descriptor_t *desc, read_actor_t actor)
 {
 	struct inode *inode = filp->f_path.dentry->d_inode;
@@ -2338,7 +2234,8 @@ static const struct file_operations shme
 #ifdef CONFIG_TMPFS
 	.llseek		= generic_file_llseek,
 	.read		= shmem_file_read,
-	.write		= shmem_file_write,
+	.write		= do_sync_write,
+	.aio_write	= generic_file_aio_write,
 	.fsync		= simple_sync_file,
 	.splice_read	= generic_file_splice_read,
 	.splice_write	= generic_file_splice_write,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 6/7] shmem_file_write is redundant
  2007-10-06 20:47 ` [PATCH 6/7] shmem_file_write is redundant Hugh Dickins
@ 2007-10-08  0:46   ` Rik van Riel
  0 siblings, 0 replies; 27+ messages in thread
From: Rik van Riel @ 2007-10-08  0:46 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, Nick Piggin, linux-mm

On Sat, 6 Oct 2007 21:47:48 +0100 (BST)
Hugh Dickins <hugh@veritas.com> wrote:

> With the new aops, the generic method uses shmem_write_end, which lets
> shmem_getpage find the right page: so now abandon shmem_file_write in
> favour of the generic method.  Yes, that does do several things that
> tmpfs hasn't really needed (notably balance_dirty_pages_ratelimited,
> which ramfs also calls); but more use of common code is preferable.
> 
> Signed-off-by: Hugh Dickins <hugh@veritas.com>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 7/7] swapin: fix valid_swaphandles defect
  2007-10-06 20:35 [PATCH 0/7] swapin/shmem patches Hugh Dickins
                   ` (5 preceding siblings ...)
  2007-10-06 20:47 ` [PATCH 6/7] shmem_file_write is redundant Hugh Dickins
@ 2007-10-06 20:48 ` Hugh Dickins
  2007-10-08  1:14   ` Rik van Riel
  6 siblings, 1 reply; 27+ messages in thread
From: Hugh Dickins @ 2007-10-06 20:48 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm

valid_swaphandles is supposed to do a quick pass over the swap map
entries neigbouring the entry which swapin_readahead is targetting,
to determine for it a range worth reading all together.  But since
it always starts its search from the beginning of the swap "cluster",
a reject (free entry) there immediately curtails the readaround, and
every swapin_readahead from that cluster is for just a single page.
Instead scan forwards and backwards around the target entry.

Use better names for some variables: a swap_info pointer is usually
called "si" not "swapdev".  And at the end, if only the target page
should be read, return count of 0 to disable readaround, to avoid
the unnecessarily repeated call to read_swap_cache_async.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
---

 mm/swapfile.c |   49 ++++++++++++++++++++++++++++++++----------------
 1 file changed, 33 insertions(+), 16 deletions(-)

--- patch6/mm/swapfile.c	2007-10-04 19:24:36.000000000 +0100
+++ patch7/mm/swapfile.c	2007-10-04 19:24:46.000000000 +0100
@@ -1776,31 +1776,48 @@ get_swap_info_struct(unsigned type)
  */
 int valid_swaphandles(swp_entry_t entry, unsigned long *offset)
 {
+	struct swap_info_struct *si;
 	int our_page_cluster = page_cluster;
-	int ret = 0, i = 1 << our_page_cluster;
-	unsigned long toff;
-	struct swap_info_struct *swapdev = swp_type(entry) + swap_info;
+	pgoff_t target, toff;
+	pgoff_t base, end;
+	int nr_pages = 0;
 
 	if (!our_page_cluster)	/* no readahead */
 		return 0;
-	toff = (swp_offset(entry) >> our_page_cluster) << our_page_cluster;
-	if (!toff)		/* first page is swap header */
-		toff++, i--;
-	*offset = toff;
+
+	si = &swap_info[swp_type(entry)];
+	target = swp_offset(entry);
+	base = (target >> our_page_cluster) << our_page_cluster;
+	end = base + (1 << our_page_cluster);
+	if (!base)		/* first page is swap header */
+		base++;
 
 	spin_lock(&swap_lock);
-	do {
-		/* Don't read-ahead past the end of the swap area */
-		if (toff >= swapdev->max)
+	if (end > si->max)	/* don't go beyond end of map */
+		end = si->max;
+
+	/* Count contiguous allocated slots above our target */
+	for (toff = target; ++toff < end; nr_pages++) {
+		/* Don't read in free or bad pages */
+		if (!si->swap_map[toff])
 			break;
+		if (si->swap_map[toff] == SWAP_MAP_BAD)
+			break;
+	}
+	/* Count contiguous allocated slots below our target */
+	for (toff = target; --toff >= base; nr_pages++) {
 		/* Don't read in free or bad pages */
-		if (!swapdev->swap_map[toff])
+		if (!si->swap_map[toff])
 			break;
-		if (swapdev->swap_map[toff] == SWAP_MAP_BAD)
+		if (si->swap_map[toff] == SWAP_MAP_BAD)
 			break;
-		toff++;
-		ret++;
-	} while (--i);
+	}
 	spin_unlock(&swap_lock);
-	return ret;
+
+	/*
+	 * Indicate starting offset, and return number of pages to get:
+	 * if only 1, say 0, since there's then no readahead to be done.
+	 */
+	*offset = ++toff;
+	return nr_pages? ++nr_pages: 0;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 7/7] swapin: fix valid_swaphandles defect
  2007-10-06 20:48 ` [PATCH 7/7] swapin: fix valid_swaphandles defect Hugh Dickins
@ 2007-10-08  1:14   ` Rik van Riel
  0 siblings, 0 replies; 27+ messages in thread
From: Rik van Riel @ 2007-10-08  1:14 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, linux-mm

On Sat, 6 Oct 2007 21:48:41 +0100 (BST)
Hugh Dickins <hugh@veritas.com> wrote:

> Signed-off-by: Hugh Dickins <hugh@veritas.com>

Acked-by: Rik van Riel <riel@surriel.com>

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2007-10-08 18:48 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-10-06 20:35 [PATCH 0/7] swapin/shmem patches Hugh Dickins
2007-10-06 20:38 ` [PATCH 1/7] swapin_readahead: excise NUMA bogosity Hugh Dickins
2007-10-06 22:43   ` Rik van Riel
2007-10-07 22:05   ` Andi Kleen
2007-10-07 22:37     ` Rik van Riel
2007-10-08 17:31   ` Christoph Lameter
2007-10-08 17:35     ` Rik van Riel
2007-10-08 17:41       ` Christoph Lameter
2007-10-08 17:47         ` Rik van Riel
2007-10-08 17:52           ` Christoph Lameter
2007-10-08 18:48             ` Rik van Riel
2007-10-06 20:39 ` [PATCH 2/7] swapin_readahead: move and rearrange args Hugh Dickins
2007-10-07  2:26   ` Rik van Riel
2007-10-06 20:43 ` [PATCH 3/7] swapin needs gfp_mask for loop on tmpfs Hugh Dickins
2007-10-07 23:23   ` Rik van Riel
2007-10-08 13:52   ` Peter Zijlstra
2007-10-06 20:45 ` [PATCH 4/7] shmem: SGP_QUICK and SGP_FAULT redundant Hugh Dickins
2007-10-07 23:23   ` Rik van Riel
2007-10-06 20:46 ` [PATCH 5/7] shmem_getpage return page locked Hugh Dickins
2007-10-07  8:01   ` Nick Piggin
2007-10-08 12:05     ` Hugh Dickins
2007-10-08  7:08       ` Nick Piggin
2007-10-08  0:44   ` Rik van Riel
2007-10-06 20:47 ` [PATCH 6/7] shmem_file_write is redundant Hugh Dickins
2007-10-08  0:46   ` Rik van Riel
2007-10-06 20:48 ` [PATCH 7/7] swapin: fix valid_swaphandles defect Hugh Dickins
2007-10-08  1:14   ` Rik van Riel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).