VM subsystem bug in 2.4.0 ?

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* VM subsystem bug in 2.4.0 ?
@ 2001-01-08  8:46 Sergey E. Volkov
  2001-01-08 18:00 ` Rik van Riel
  0 siblings, 1 reply; 22+ messages in thread
From: Sergey E. Volkov @ 2001-01-08  8:46 UTC (permalink / raw)
  To: linux-kernel

Hi all!

I have a problem with 2.4.0

I'm testing Informix IIF-2000 database server running on dual Intel
Pentium II - 233. When I run 'make -j30 bzImage' in the kernel source,
my
Linux box hangs without any messages. This occurs when Informix is
running. When I stoped Informix and tryed to do the same, all passed ok!

I think this is bug in kernel ( VM subsystem ) code.

Informix allocate about to 50% of memory as LOCKED shared memory
segments. 

I'm thinking the reason in this. Kernel wants, but can't to swap out
locked shm's segments.

Thank you.

Sergey E. Volkov
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: VM subsystem bug in 2.4.0 ?
  2001-01-08  8:46 VM subsystem bug in 2.4.0 ? Sergey E. Volkov
@ 2001-01-08 18:00 ` Rik van Riel
  2001-01-08 18:10   ` Linus Torvalds
  0 siblings, 1 reply; 22+ messages in thread
From: Rik van Riel @ 2001-01-08 18:00 UTC (permalink / raw)
  To: Sergey E. Volkov; +Cc: linux-kernel, Christoph Rohland, Linus Torvalds

On Mon, 8 Jan 2001, Sergey E. Volkov wrote:

> I have a problem with 2.4.0
> 
> I'm testing Informix IIF-2000 database server running on dual
> Intel Pentium II - 233. When I run 'make -j30 bzImage' in the
> kernel source, my Linux box hangs without any messages.

> Informix allocate about to 50% of memory as LOCKED shared memory
> segments.  I'm thinking the reason in this. Kernel wants, but
> can't to swap out locked shm's segments.

You are right. I have seen this bug before with the kernel
moving unswappable pages from the active list to the
inactive_dirty list and back.

We need a check in deactivate_page() to prevent the kernel
from moving pages from locked shared memory segments to the
inactive_dirty list.

Christoph?  Linus?

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com.br/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: VM subsystem bug in 2.4.0 ?
  2001-01-08 18:00 ` Rik van Riel
@ 2001-01-08 18:10   ` Linus Torvalds
  2001-01-08 18:30     ` Rik van Riel
  0 siblings, 1 reply; 22+ messages in thread
From: Linus Torvalds @ 2001-01-08 18:10 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Sergey E. Volkov, linux-kernel, Christoph Rohland

On Mon, 8 Jan 2001, Rik van Riel wrote:
> 
> We need a check in deactivate_page() to prevent the kernel
> from moving pages from locked shared memory segments to the
> inactive_dirty list.
> 
> Christoph?  Linus?

The only solution I see is something like a "active_immobile" list, and
add entries to that list whenever "writepage()" returns 1 - instead of
just moving them to the active list.

Seems to be a simple enough change. The main worry would be getting the
pages _off_ such a list: anything that unlocks a shared memory segment
(can you even do that? If the only way to unlock is to remove, we have no
problems) would have to have a special function to move all pages from the
immobile list back to the active list (and then they'd get moved back if
they were for another segment that is still locked).

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: VM subsystem bug in 2.4.0 ?
  2001-01-08 18:10   ` Linus Torvalds
@ 2001-01-08 18:30     ` Rik van Riel
  2001-01-09  7:52       ` Christoph Rohland
  2001-01-09 14:09       ` Stephen C. Tweedie
  0 siblings, 2 replies; 22+ messages in thread
From: Rik van Riel @ 2001-01-08 18:30 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Sergey E. Volkov, linux-kernel, Christoph Rohland

On Mon, 8 Jan 2001, Linus Torvalds wrote:
> On Mon, 8 Jan 2001, Rik van Riel wrote:
> > 
> > We need a check in deactivate_page() to prevent the kernel
> > from moving pages from locked shared memory segments to the
> > inactive_dirty list.
> > 
> > Christoph?  Linus?
> 
> The only solution I see is something like a "active_immobile"
> list, and add entries to that list whenever "writepage()"
> returns 1 - instead of just moving them to the active list.
> 
> Seems to be a simple enough change. The main worry would be
> getting the pages _off_ such a list:

Just marking them with a special "do not deactivate me"
bit seems to work fine enough. When this special bit is
set, we simply move the page to the back of the active
list instead of deactivating.

And when the bit changes again, the page can be evicted
from memory just fine. In the mean time, the locked pages
will also have undergone normal page aging and at unlock
time we know whether to swap out the page or not.

I agree that this scheme has a higher overhead than your
idea, but it also seems to be a bit more flexible and
simple.  Alternatively, we could just scan the wired_list
once a minute and move the unwired pages to the active
list.

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com.br/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: VM subsystem bug in 2.4.0 ?
  2001-01-08 18:30     ` Rik van Riel
@ 2001-01-09  7:52       ` Christoph Rohland
  2001-01-09 14:09       ` Stephen C. Tweedie
  1 sibling, 0 replies; 22+ messages in thread
From: Christoph Rohland @ 2001-01-09  7:52 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Linus Torvalds, Sergey E. Volkov, linux-kernel

Hi Rik,

On Mon, 8 Jan 2001, Rik van Riel wrote:
> And when the bit changes again, the page can be evicted
> from memory just fine. In the mean time, the locked pages
> will also have undergone normal page aging and at unlock
> time we know whether to swap out the page or not.
>
> I agree that this scheme has a higher overhead than your
> idea, but it also seems to be a bit more flexible and
> simple.  Alternatively, we could just scan the wired_list
> once a minute and move the unwired pages to the active
> list.

At IPC_UNLOCK there is no reference to the pages locked by this
segment available. We could perhaps move the whole locked list to the
active list if we unlock any segment.

Second point: How do we handle out of swap? I do not think that we
should lock these pages but keep them in the active list.

Greetings
		Christoph

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: VM subsystem bug in 2.4.0 ?
  2001-01-08 18:30     ` Rik van Riel
  2001-01-09  7:52       ` Christoph Rohland
@ 2001-01-09 14:09       ` Stephen C. Tweedie
  2001-01-09 14:53         ` Christoph Rohland
  2001-01-09 18:23         ` Linus Torvalds
  1 sibling, 2 replies; 22+ messages in thread
From: Stephen C. Tweedie @ 2001-01-09 14:09 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Linus Torvalds, Sergey E. Volkov, linux-kernel, Christoph Rohland,
	Stephen Tweedie

Hi,

On Mon, Jan 08, 2001 at 04:30:10PM -0200, Rik van Riel wrote:
> On Mon, 8 Jan 2001, Linus Torvalds wrote:
> > 
> > The only solution I see is something like a "active_immobile"
> > list, and add entries to that list whenever "writepage()"
> > returns 1 - instead of just moving them to the active list.
> 
> Just marking them with a special "do not deactivate me"
> bit seems to work fine enough. When this special bit is
> set, we simply move the page to the back of the active
> list instead of deactivating.

But again, how do you clear the bit?  Locking is a per-vma property,
not per-page.  I can mmap a file twice and mlock just one of the
mappings.  If you get a munlock(), how are you to know how many other
locked mappings still exist?

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: VM subsystem bug in 2.4.0 ?
  2001-01-09 14:09       ` Stephen C. Tweedie
@ 2001-01-09 14:53         ` Christoph Rohland
  2001-01-09 15:31           ` Stephen C. Tweedie
  2001-01-09 18:23         ` Linus Torvalds
  1 sibling, 1 reply; 22+ messages in thread
From: Christoph Rohland @ 2001-01-09 14:53 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Rik van Riel, Linus Torvalds, Sergey E. Volkov, linux-kernel

Hi Stephen,

On Tue, 9 Jan 2001, Stephen C. Tweedie wrote:
> But again, how do you clear the bit?  Locking is a per-vma property,
> not per-page.  I can mmap a file twice and mlock just one of the
> mappings.  If you get a munlock(), how are you to know how many
> other locked mappings still exist?

It's worse: The issue we are talking about is SYSV IPC_LOCK. This is a
per segment thing. A user can (un)lock a segment at any time. But we
do not have the references to the vmas attached to the segemnts or to
the pages allocated.

Greetings
		Christoph

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: VM subsystem bug in 2.4.0 ?
  2001-01-09 14:53         ` Christoph Rohland
@ 2001-01-09 15:31           ` Stephen C. Tweedie
  2001-01-09 15:45             ` Christoph Rohland
  2001-01-09 18:36             ` Linus Torvalds
  0 siblings, 2 replies; 22+ messages in thread
From: Stephen C. Tweedie @ 2001-01-09 15:31 UTC (permalink / raw)
  To: Christoph Rohland
  Cc: Stephen C. Tweedie, Rik van Riel, Linus Torvalds,
	Sergey E. Volkov, linux-kernel

Hi,

On Tue, Jan 09, 2001 at 03:53:55PM +0100, Christoph Rohland wrote:
> 
> On Tue, 9 Jan 2001, Stephen C. Tweedie wrote:
> > But again, how do you clear the bit?  Locking is a per-vma property,
> > not per-page.  I can mmap a file twice and mlock just one of the
> > mappings.  If you get a munlock(), how are you to know how many
> > other locked mappings still exist?
> 
> It's worse: The issue we are talking about is SYSV IPC_LOCK.

The issue is locked VA pages.  SysV is just one of the ways in which
it can happen: the solution has got to address both that and
mlock()/mlockall().

> This is a
> per segment thing. A user can (un)lock a segment at any time. But we
> do not have the references to the vmas attached to the segemnts

Why not?  Won't the address space mmap* lists give you this?

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: VM subsystem bug in 2.4.0 ?
  2001-01-09 15:31           ` Stephen C. Tweedie
@ 2001-01-09 15:45             ` Christoph Rohland
  2001-01-09 16:05               ` Stephen C. Tweedie
  2001-01-09 18:36             ` Linus Torvalds
  1 sibling, 1 reply; 22+ messages in thread
From: Christoph Rohland @ 2001-01-09 15:45 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Rik van Riel, Linus Torvalds, Sergey E. Volkov, linux-kernel

Hi Stephen,

On Tue, 9 Jan 2001, Stephen C. Tweedie wrote:
> On Tue, Jan 09, 2001 at 03:53:55PM +0100, Christoph Rohland wrote:
>> It's worse: The issue we are talking about is SYSV IPC_LOCK.
> 
> The issue is locked VA pages.  SysV is just one of the ways in which
> it can happen: the solution has got to address both that and
> mlock()/mlockall().

AFAIU mlock'ed pages would never get deactivated since the ptes do not
get dropped.

>> This is a per segment thing. A user can (un)lock a segment at any
>> time. But we do not have the references to the vmas attached to the
>> segemnts
> 
> Why not?  Won't the address space mmap* lists give you this?

OK. We could go from shmid_kernel->file->dentry->inode->mapping 
We had to scan all mappings for pages in the page tables and in the
page cache. Doesn't look really nice :-(

Greetings
		Christoph

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: VM subsystem bug in 2.4.0 ?
  2001-01-09 15:45             ` Christoph Rohland
@ 2001-01-09 16:05               ` Stephen C. Tweedie
  2001-01-09 16:17                 ` Christoph Rohland
  2001-01-09 16:45                 ` Daniel Phillips
  0 siblings, 2 replies; 22+ messages in thread
From: Stephen C. Tweedie @ 2001-01-09 16:05 UTC (permalink / raw)
  To: Christoph Rohland
  Cc: Stephen C. Tweedie, Rik van Riel, Linus Torvalds,
	Sergey E. Volkov, linux-kernel

Hi,

On Tue, Jan 09, 2001 at 04:45:10PM +0100, Christoph Rohland wrote:
> Hi Stephen,
> 
> AFAIU mlock'ed pages would never get deactivated since the ptes do not
> get dropped.

D'oh, right --- so can't you lock a segment just by bumping page_count
on its pages?

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: VM subsystem bug in 2.4.0 ?
  2001-01-09 16:05               ` Stephen C. Tweedie
@ 2001-01-09 16:17                 ` Christoph Rohland
  2001-01-09 18:37                   ` Linus Torvalds
  2001-01-09 16:45                 ` Daniel Phillips
  1 sibling, 1 reply; 22+ messages in thread
From: Christoph Rohland @ 2001-01-09 16:17 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Rik van Riel, Linus Torvalds, Sergey E. Volkov, linux-kernel

Hi Stephen,

On Tue, 9 Jan 2001, Stephen C. Tweedie wrote:
> D'oh, right --- so can't you lock a segment just by bumping
> page_count on its pages?

Looks like a good idea. 

Oh, and my last posting was partly bogus: I can directly get the pages
with page cache lookups on the file.

Greetings
		Christoph

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: VM subsystem bug in 2.4.0 ?
  2001-01-09 16:17                 ` Christoph Rohland
@ 2001-01-09 18:37                   ` Linus Torvalds
  0 siblings, 0 replies; 22+ messages in thread
From: Linus Torvalds @ 2001-01-09 18:37 UTC (permalink / raw)
  To: Christoph Rohland
  Cc: Stephen C. Tweedie, Rik van Riel, Sergey E. Volkov, linux-kernel



On 9 Jan 2001, Christoph Rohland wrote:

> Hi Stephen,
> 
> On Tue, 9 Jan 2001, Stephen C. Tweedie wrote:
> > D'oh, right --- so can't you lock a segment just by bumping
> > page_count on its pages?
> 
> Looks like a good idea. 
> 
> Oh, and my last posting was partly bogus: I can directly get the pages
> with page cache lookups on the file.

Even more appropriately, you have the inode->i_mapping lists that you can
use directly (no need to do lookups, just walk the list).

Note that bumping the counts is _NOT_ as easy as you seem to think. The
problem: vmtruncate() and friends. It's much easier to just have a flag
that gets cleared on truncate.

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: VM subsystem bug in 2.4.0 ?
  2001-01-09 16:05               ` Stephen C. Tweedie
  2001-01-09 16:17                 ` Christoph Rohland
@ 2001-01-09 16:45                 ` Daniel Phillips
  2001-01-17  8:33                   ` Rik van Riel
  1 sibling, 1 reply; 22+ messages in thread
From: Daniel Phillips @ 2001-01-09 16:45 UTC (permalink / raw)
  To: Stephen C. Tweedie, Christoph Rohland, linux-kernel

"Stephen C. Tweedie" wrote:
> On Tue, Jan 09, 2001 at 04:45:10PM +0100, Christoph Rohland wrote:
> >
> > AFAIU mlock'ed pages would never get deactivated since the ptes do not
> > get dropped.
> 
> D'oh, right --- so can't you lock a segment just by bumping page_count
> on its pages?

Putting this together with an idea from Linus:

Linus Torvalds wrote:
> On Mon, 8 Jan 2001, Rik van Riel wrote:
> >
> > We need a check in deactivate_page() to prevent the kernel
> > from moving pages from locked shared memory segments to the
> > inactive_dirty list.
> >
> > Christoph?  Linus?
> 
> The only solution I see is something like a "active_immobile" list, and
> add entries to that list whenever "writepage()" returns 1 - instead of
> just moving them to the active list.

Call it 'pinned'... the pinned list would have pages with use count = 2
or more.  A page gets off the pinned list when its use count goes to 1
in put_page.

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: VM subsystem bug in 2.4.0 ?
  2001-01-09 16:45                 ` Daniel Phillips
@ 2001-01-17  8:33                   ` Rik van Riel
  2001-01-18  8:23                     ` Christoph Rohland
  0 siblings, 1 reply; 22+ messages in thread
From: Rik van Riel @ 2001-01-17  8:33 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Stephen C. Tweedie, Christoph Rohland, linux-kernel

On Tue, 9 Jan 2001, Daniel Phillips wrote:

> Call it 'pinned'... the pinned list would have pages with use
> count = 2 or more.  A page gets off the pinned list when its use
> count goes to 1 in put_page.

I don't even want to start thinking about how this would
screw up the (already fragile) page aging balance...

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com.br/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: VM subsystem bug in 2.4.0 ?
  2001-01-17  8:33                   ` Rik van Riel
@ 2001-01-18  8:23                     ` Christoph Rohland
  2001-01-25 22:47                       ` Daniel Phillips
  0 siblings, 1 reply; 22+ messages in thread
From: Christoph Rohland @ 2001-01-18  8:23 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Daniel Phillips, Stephen C. Tweedie, linux-kernel

Hi Rik,

On Wed, 17 Jan 2001, Rik van Riel wrote:
> I don't even want to start thinking about how this would
> screw up the (already fragile) page aging balance...

As of 2.4.1-pre we pin the pages by increasing the page count for
locked segments. No special list needed.

Greetings
		Christoph

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: VM subsystem bug in 2.4.0 ?
  2001-01-18  8:23                     ` Christoph Rohland
@ 2001-01-25 22:47                       ` Daniel Phillips
  0 siblings, 0 replies; 22+ messages in thread
From: Daniel Phillips @ 2001-01-25 22:47 UTC (permalink / raw)
  To: Christoph Rohland, linux-kernel

Christoph Rohland wrote:
> As of 2.4.1-pre we pin the pages by increasing the page count for
> locked segments. No special list needed.

Sure no special list is needed.  But without a special list to park
those pages on they will just circulate on the active/inactive lists,
wasting CPU cycles and trashing cache.  A special list would be an
improvement, but is no burning issue.

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: VM subsystem bug in 2.4.0 ?
  2001-01-09 15:31           ` Stephen C. Tweedie
  2001-01-09 15:45             ` Christoph Rohland
@ 2001-01-09 18:36             ` Linus Torvalds
  1 sibling, 0 replies; 22+ messages in thread
From: Linus Torvalds @ 2001-01-09 18:36 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Christoph Rohland, Rik van Riel, Sergey E. Volkov, linux-kernel

On Tue, 9 Jan 2001, Stephen C. Tweedie wrote:
> > 
> > It's worse: The issue we are talking about is SYSV IPC_LOCK.
> 
> The issue is locked VA pages.  SysV is just one of the ways in which
> it can happen: the solution has got to address both that and
> mlock()/mlockall().

No, mlock() and mlockall() already works. Exactly because mlocked pages
will never be removed from the VM, the VM layer knows how to deal with
them (or "not deal with them" as the case is more properly stated). They
won't ever get on the inactive list, and because refill_inactive_scan()
won't be able to move handle them (the count is elevated by the VM
mappings), the VM will correctly and gracefully fall back on scanning the
page tables to find some _other_ blocks.

So mlock works fine.

The reason shm locked segments do _not_ work fine is exactly because they
are not locked down in the VM, and for that reason they can end up being
detached from everything and thus moved to the inactive list. That counts
as "progress" as far as the VM is concerned, so we get into this endless
loop where we move the page to the inactive list, then try to write it
out, fail, and move it back to the active list again. The VM _thinks_ it
is making progress, but obviously isn't. 

End result: lockup.

Marking the pages with a magic flag would solve it. Then we could 
just make refill_inactive_scan() ignore such pages. Something like
"PG_Reallydirty".

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: VM subsystem bug in 2.4.0 ?
  2001-01-09 14:09       ` Stephen C. Tweedie
  2001-01-09 14:53         ` Christoph Rohland
@ 2001-01-09 18:23         ` Linus Torvalds
  2001-01-09 22:20           ` Christoph Rohland
  1 sibling, 1 reply; 22+ messages in thread
From: Linus Torvalds @ 2001-01-09 18:23 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Rik van Riel, Sergey E. Volkov, linux-kernel, Christoph Rohland

On Tue, 9 Jan 2001, Stephen C. Tweedie wrote:
> 
> But again, how do you clear the bit?  Locking is a per-vma property,
> not per-page.  I can mmap a file twice and mlock just one of the
> mappings.  If you get a munlock(), how are you to know how many other
> locked mappings still exist?

Note that this would be solved very cleanly if the SHM code would use the
"VM_LOCKED" flag, and actually lock the pages in the VM, instead of trying
to lock them down for writepage().

That would mean that such a segment would still get swapped out when it is
not mapped anywhere, but I wonder if that semantic difference really
matters.

If the vma is marked VM_LOCKED, the VM subsystem will do the right thing
(the page will never get removed from the page tables, so it won't ever
make it into that back-and-forth bounce between the active and the
inactive lists).

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: VM subsystem bug in 2.4.0 ?
  2001-01-09 18:23         ` Linus Torvalds
@ 2001-01-09 22:20           ` Christoph Rohland
  2001-01-09 22:59             ` Linus Torvalds
  0 siblings, 1 reply; 22+ messages in thread
From: Christoph Rohland @ 2001-01-09 22:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Stephen C. Tweedie, Rik van Riel, Sergey E. Volkov, linux-kernel

Linus Torvalds <torvalds@transmeta.com> writes:

> On Tue, 9 Jan 2001, Stephen C. Tweedie wrote:
> > 
> > But again, how do you clear the bit?  Locking is a per-vma property,
> > not per-page.  I can mmap a file twice and mlock just one of the
> > mappings.  If you get a munlock(), how are you to know how many other
> > locked mappings still exist?
> 
> Note that this would be solved very cleanly if the SHM code would use the
> "VM_LOCKED" flag, and actually lock the pages in the VM, instead of trying
> to lock them down for writepage().

here comes the patch. (lightly tested)

Greetings
                Christoph


diff -uNr 2.4.0/include/linux/shmem_fs.h c/include/linux/shmem_fs.h
--- 2.4.0/include/linux/shmem_fs.h	Tue Jan  2 21:58:11 2001
+++ c/include/linux/shmem_fs.h	Tue Jan  9 22:01:48 2001
@@ -22,7 +22,6 @@
 	swp_entry_t	i_direct[SHMEM_NR_DIRECT]; /* for the first blocks */
 	swp_entry_t   **i_indirect; /* doubly indirect blocks */
 	unsigned long	swapped;
-	int		locked;     /* into memory */
 	struct list_head	list;
 };
 
diff -uNr 2.4.0/ipc/shm.c c/ipc/shm.c
--- 2.4.0/ipc/shm.c	Tue Jan  2 21:58:11 2001
+++ c/ipc/shm.c	Tue Jan  9 22:39:18 2001
@@ -91,9 +91,10 @@
 	return ipc_addid(&shm_ids, &shp->shm_perm, shm_ctlmni+1);
 }
 
-
-
-static inline void shm_inc (int id) {
+/* This is called by fork, once for every shm attach. */
+static void shm_open (struct vm_area_struct *shmd)
+{
+	int id = shmd->vm_file->f_dentry->d_inode->i_ino;
 	struct shmid_kernel *shp;
 
 	if(!(shp = shm_lock(id)))
@@ -104,12 +105,6 @@
 	shm_unlock(id);
 }
 
-/* This is called by fork, once for every shm attach. */
-static void shm_open (struct vm_area_struct *shmd)
-{
-	shm_inc (shmd->vm_file->f_dentry->d_inode->i_ino);
-}
-
 /*
  * shm_destroy - free the struct shmid_kernel
  *
@@ -154,9 +149,20 @@
 
 static int shm_mmap(struct file * file, struct vm_area_struct * vma)
 {
-	UPDATE_ATIME(file->f_dentry->d_inode);
+	struct shmid_kernel *shp;
+	struct inode * inode = file->f_dentry->d_inode;
+
+	UPDATE_ATIME(inode);
 	vma->vm_ops = &shm_vm_ops;
-	shm_inc(file->f_dentry->d_inode->i_ino);
+
+	if(!(shp = shm_lock(inode->i_ino)))
+		BUG();
+	shp->shm_atim = CURRENT_TIME;
+	shp->shm_lprid = current->pid;
+	shp->shm_nattch++;
+	if (shp->shm_flags & SHM_LOCKED)
+		vma->vm_flags |= VM_LOCKED;
+	shm_unlock(inode->i_ino);
 	return 0;
 }
 
@@ -365,6 +371,29 @@
 	}
 }
 
+static void shm_lockseg (struct shmid_kernel * shp, int cmd)
+{
+	struct address_space *mapping = shp->shm_file->f_dentry->d_inode->i_mapping;
+	struct vm_area_struct *mpnt;
+
+	spin_lock(&mapping->i_shared_lock);
+	if(cmd==SHM_LOCK) {
+		shp->shm_flags |= SHM_LOCKED;
+		for (mpnt = mapping->i_mmap; mpnt; mpnt = mpnt->vm_next_share)
+			mpnt->vm_flags |= VM_LOCKED;
+		for (mpnt = mapping->i_mmap_shared; mpnt; mpnt = mpnt->vm_next_share)
+			mpnt->vm_flags |= VM_LOCKED;
+	} else {
+		shp->shm_flags &= ~SHM_LOCKED;
+		for (mpnt = mapping->i_mmap; mpnt; mpnt = mpnt->vm_next_share)
+			mpnt->vm_flags &= ~VM_LOCKED;
+		for (mpnt = mapping->i_mmap_shared; mpnt; mpnt = mpnt->vm_next_share)
+			mpnt->vm_flags &= ~VM_LOCKED;
+	}
+	spin_unlock(&mapping->i_shared_lock);
+		
+}
+
 asmlinkage long sys_shmctl (int shmid, int cmd, struct shmid_ds *buf)
 {
 	struct shm_setbuf setbuf;
@@ -466,13 +495,7 @@
 		err = shm_checkid(shp,shmid);
 		if(err)
 			goto out_unlock;
-		if(cmd==SHM_LOCK) {
-			shp->shm_file->f_dentry->d_inode->u.shmem_i.locked = 1;
-			shp->shm_flags |= SHM_LOCKED;
-		} else {
-			shp->shm_file->f_dentry->d_inode->u.shmem_i.locked = 0;
-			shp->shm_flags &= ~SHM_LOCKED;
-		}
+		shm_lockseg(shp, cmd);
 		shm_unlock(shmid);
 		return err;
 	}
diff -uNr 2.4.0/mm/shmem.c c/mm/shmem.c
--- 2.4.0/mm/shmem.c	Tue Jan  2 21:58:11 2001
+++ c/mm/shmem.c	Tue Jan  9 22:02:18 2001
@@ -201,8 +201,6 @@
 	swp_entry_t *entry, swap;
 
 	info = &page->mapping->host->u.shmem_i;
-	if (info->locked)
-		return 1;
 	swap = __get_swap_page(2);
 	if (!swap.val)
 		return 1;

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: VM subsystem bug in 2.4.0 ?
  2001-01-09 22:20           ` Christoph Rohland
@ 2001-01-09 22:59             ` Linus Torvalds
  2001-01-10  7:33               ` Christoph Rohland
  2001-01-10 15:50               ` Tim Wright
  0 siblings, 2 replies; 22+ messages in thread
From: Linus Torvalds @ 2001-01-09 22:59 UTC (permalink / raw)
  To: Christoph Rohland
  Cc: Stephen C. Tweedie, Rik van Riel, Sergey E. Volkov, linux-kernel

On 9 Jan 2001, Christoph Rohland wrote:

> Linus Torvalds <torvalds@transmeta.com> writes:
> > 
> > Note that this would be solved very cleanly if the SHM code would use the
> > "VM_LOCKED" flag, and actually lock the pages in the VM, instead of trying
> > to lock them down for writepage().
> 
> here comes the patch. (lightly tested)

I'd really like an opinion on whether this is truly legal or not? After
all, it does change the behaviour to mean "pages are locked only if they
have been mapped into virtual memory". Which is not what it used to mean.

Arguably the new semantics are perfectly valid semantics on their own, but
I'm not sure they are acceptable.

In contrast, the PG_realdirty approach would give the old behaviour of
truly locked-down shm segments, with not significantly different
complexity behaviour.

What do other UNIXes do for shm_lock()?

The Linux man-page explicitly states for SHM_LOCK that

	The user must fault in any pages that are required to be present
	after locking is enabled.

which kind of implies to me that the VM_LOCKED implementation is ok.
HOWEVER, looking at the HP-UX man-page, for example, certainly implies
that the PG_realdirty approach is the correct one. The IRIX man-pages in
contrast say

				Locking occurs per address space;
        multiple processes or sprocs mapping the area at different
        addresses each need to issue the lock (this is primarily an
        issue with the per-process page tables).

which again implies that they've done something akin to a VM_LOCKED
implementation.

Does anybody have any better pointers, ideas, or opinions?

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: VM subsystem bug in 2.4.0 ?
  2001-01-09 22:59             ` Linus Torvalds
@ 2001-01-10  7:33               ` Christoph Rohland
  2001-01-10 15:50               ` Tim Wright
  1 sibling, 0 replies; 22+ messages in thread
From: Christoph Rohland @ 2001-01-10  7:33 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Stephen C. Tweedie, Rik van Riel, Sergey E. Volkov, linux-kernel

Hi Linus,

Linus Torvalds <torvalds@transmeta.com> writes:

> I'd really like an opinion on whether this is truly legal or not? After
> all, it does change the behaviour to mean "pages are locked only if they
> have been mapped into virtual memory". Which is not what it used to mean.
> 
> Arguably the new semantics are perfectly valid semantics on their
> own, but I'm not sure they are acceptable.

I just checked SuS and they do not list SHM_LOCK as command at all.

> In contrast, the PG_realdirty approach would give the old behaviour of
> truly locked-down shm segments, with not significantly different
> complexity behaviour.
>
> What do other UNIXes do for shm_lock()?
> 
> The Linux man-page explicitly states for SHM_LOCK that
> 
> 	The user must fault in any pages that are required to be present
> 	after locking is enabled.
> 
> which kind of implies to me that the VM_LOCKED implementation is ok.

Yes.

> HOWEVER, looking at the HP-UX man-page, for example, certainly implies
> that the PG_realdirty approach is the correct one. 

Yes.

> The IRIX man-pages in contrast say
> 
> 				Locking occurs per address space;
>         multiple processes or sprocs mapping the area at different
>         addresses each need to issue the lock (this is primarily an
>         issue with the per-process page tables).
> 
> which again implies that they've done something akin to a VM_LOCKED
> implementation.

So Irix does something quite different. For Irix SHM_LOCK is a special
version of mlock...

> Does anybody have any better pointers, ideas, or opinions?

I think the VM_LOCKED approach is the best: 

- SuS does not specify anything, the different vendors do different
  things. So people using SHM_LOCK have to be aware that the details
  differ.
- Technically this is the fastest approach for attached segments: We
  do not scan the relevent vmas at all and by doing so we keep the
  overhead lowest. And I do not see a reason to use SHM_LOCK besides
  performance.

BTW I also have a patch appended which bumps the page count. Works
also, is also small, but we will have a higher soft fault rate with
that.

Greetings 
                Christoph

diff -uNr 2.4.0/ipc/shm.c c/ipc/shm.c
--- 2.4.0/ipc/shm.c	Mon Jan  8 11:24:39 2001
+++ c/ipc/shm.c	Tue Jan  9 17:48:55 2001
@@ -121,6 +121,7 @@
 {
 	shm_tot -= (shp->shm_segsz + PAGE_SIZE - 1) >> PAGE_SHIFT;
 	shm_rmid (shp->id);
+	shmem_lock(shp->shm_file, 0);
 	fput (shp->shm_file);
 	kfree (shp);
 }
@@ -467,10 +468,10 @@
 		if(err)
 			goto out_unlock;
 		if(cmd==SHM_LOCK) {
-			shp->shm_file->f_dentry->d_inode->u.shmem_i.locked = 1;
+			shmem_lock(shp->shm_file, 1);
 			shp->shm_flags |= SHM_LOCKED;
 		} else {
-			shp->shm_file->f_dentry->d_inode->u.shmem_i.locked = 0;
+			shmem_lock(shp->shm_file, 0);
 			shp->shm_flags &= ~SHM_LOCKED;
 		}
 		shm_unlock(shmid);
diff -uNr 2.4.0/mm/shmem.c c/mm/shmem.c
--- 2.4.0/mm/shmem.c	Mon Jan  8 11:24:39 2001
+++ c/mm/shmem.c	Tue Jan  9 18:04:16 2001
@@ -310,6 +310,8 @@
 	}
 	/* We have the page */
 	SetPageUptodate (page);
+	if (info->locked)
+		page_cache_get(page);
 
 cached_page:
 	UnlockPage (page);
@@ -399,6 +401,32 @@
 	spin_unlock (&sb->u.shmem_sb.stat_lock);
 	buf->f_namelen = 255;
 	return 0;
+}
+
+void shmem_lock(struct file * file, int lock)
+{
+	struct inode * inode = file->f_dentry->d_inode;
+	struct shmem_inode_info * info = &inode->u.shmem_i;
+	struct page * page;
+	unsigned long idx, size;
+
+	if (info->locked == lock)
+		return;
+	down(&inode->i_sem);
+	info->locked = lock;
+	size = (inode->i_size + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
+	for (idx = 0; idx < size; idx++) {
+		page = find_lock_page(inode->i_mapping, idx);
+		if (!page)
+			continue;
+		if (!lock) {
+			/* release the extra count and our reference */
+			page_cache_release(page);
+			page_cache_release(page);
+		}
+		UnlockPage(page);
+	}
+	up(&inode->i_sem);
 }
 
 /*

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: VM subsystem bug in 2.4.0 ?
  2001-01-09 22:59             ` Linus Torvalds
  2001-01-10  7:33               ` Christoph Rohland
@ 2001-01-10 15:50               ` Tim Wright
  1 sibling, 0 replies; 22+ messages in thread
From: Tim Wright @ 2001-01-10 15:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Christoph Rohland, Stephen C. Tweedie, Rik van Riel,
	Sergey E. Volkov, linux-kernel

Hi Linus,

On Tue, Jan 09, 2001 at 02:59:07PM -0800, Linus Torvalds wrote:
> 
> Arguably the new semantics are perfectly valid semantics on their own, but
> I'm not sure they are acceptable.
> 
> In contrast, the PG_realdirty approach would give the old behaviour of
> truly locked-down shm segments, with not significantly different
> complexity behaviour.
> 
> What do other UNIXes do for shm_lock()?
> 

It appears that the fine-detail semantics vary across the board. DYNIX/ptx
supports two forms of SysV shm locking - soft and hard. Soft-locking (the
default) merely makes the pages sticky, so if you fault them in, they stay
in your resident set, but don't count against it. If, however the process
swaps, they're all evicted, and when the process is swapped back in, you
get to fault the back in all over again. Hard locking pins the segment into
physical memory until such time as it's destroyed. It stays there even if
there are currently no attaches. Again, such pages are not counted against
the process RSS.

SVR4 only support one form. It faults all the pages in and locks them into
memory, but doesn't treat the especially wrt rss/paging, which seems none
too clever - if they're locked into memory, you might as well use them :-)

[Details of the differing approches omitted]

> 
> Does anybody have any better pointers, ideas, or opinions?
> 
> 		Linus
> 

I don't know if there are any arguments in favour of making both approaches
available. Gut feel says that's overkill. We ended up with two by historical
accident. The soft-locking was always there (althought semantically different
to SVR4), and the hard-locking stuff was added to boost performance with a
certain six-letter RDBMS that attaches an SGA to each process. They all get
to attach it "for free", and since it doesn't count towards the RSS, it
allowed tuning a fairly small RSS across the system without having the RDMBS
processes spent all their time (soft) faulting SGA pages in and out of their
RSS.

Tim

-- 
Tim Wright - timw@splhi.com or timw@aracnet.com or twright@us.ibm.com
IBM Linux Technology Center, Beaverton, Oregon
"Nobody ever said I was charming, they said "Rimmer, you're a git!"" RD VI
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2001-01-25 22:50 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2001-01-08  8:46 VM subsystem bug in 2.4.0 ? Sergey E. Volkov
2001-01-08 18:00 ` Rik van Riel
2001-01-08 18:10   ` Linus Torvalds
2001-01-08 18:30     ` Rik van Riel
2001-01-09  7:52       ` Christoph Rohland
2001-01-09 14:09       ` Stephen C. Tweedie
2001-01-09 14:53         ` Christoph Rohland
2001-01-09 15:31           ` Stephen C. Tweedie
2001-01-09 15:45             ` Christoph Rohland
2001-01-09 16:05               ` Stephen C. Tweedie
2001-01-09 16:17                 ` Christoph Rohland
2001-01-09 18:37                   ` Linus Torvalds
2001-01-09 16:45                 ` Daniel Phillips
2001-01-17  8:33                   ` Rik van Riel
2001-01-18  8:23                     ` Christoph Rohland
2001-01-25 22:47                       ` Daniel Phillips
2001-01-09 18:36             ` Linus Torvalds
2001-01-09 18:23         ` Linus Torvalds
2001-01-09 22:20           ` Christoph Rohland
2001-01-09 22:59             ` Linus Torvalds
2001-01-10  7:33               ` Christoph Rohland
2001-01-10 15:50               ` Tim Wright

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox