* VM subsystem bug in 2.4.0 ? @ 2001-01-08 8:46 Sergey E. Volkov 2001-01-08 18:00 ` Rik van Riel 0 siblings, 1 reply; 22+ messages in thread From: Sergey E. Volkov @ 2001-01-08 8:46 UTC (permalink / raw) To: linux-kernel Hi all! I have a problem with 2.4.0 I'm testing Informix IIF-2000 database server running on dual Intel Pentium II - 233. When I run 'make -j30 bzImage' in the kernel source, my Linux box hangs without any messages. This occurs when Informix is running. When I stoped Informix and tryed to do the same, all passed ok! I think this is bug in kernel ( VM subsystem ) code. Informix allocate about to 50% of memory as LOCKED shared memory segments. I'm thinking the reason in this. Kernel wants, but can't to swap out locked shm's segments. Thank you. Sergey E. Volkov - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: VM subsystem bug in 2.4.0 ? 2001-01-08 8:46 VM subsystem bug in 2.4.0 ? Sergey E. Volkov @ 2001-01-08 18:00 ` Rik van Riel 2001-01-08 18:10 ` Linus Torvalds 0 siblings, 1 reply; 22+ messages in thread From: Rik van Riel @ 2001-01-08 18:00 UTC (permalink / raw) To: Sergey E. Volkov; +Cc: linux-kernel, Christoph Rohland, Linus Torvalds On Mon, 8 Jan 2001, Sergey E. Volkov wrote: > I have a problem with 2.4.0 > > I'm testing Informix IIF-2000 database server running on dual > Intel Pentium II - 233. When I run 'make -j30 bzImage' in the > kernel source, my Linux box hangs without any messages. > Informix allocate about to 50% of memory as LOCKED shared memory > segments. I'm thinking the reason in this. Kernel wants, but > can't to swap out locked shm's segments. You are right. I have seen this bug before with the kernel moving unswappable pages from the active list to the inactive_dirty list and back. We need a check in deactivate_page() to prevent the kernel from moving pages from locked shared memory segments to the inactive_dirty list. Christoph? Linus? regards, Rik -- Virtual memory is like a game you can't win; However, without VM there's truly nothing to lose... http://www.surriel.com/ http://www.conectiva.com/ http://distro.conectiva.com.br/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: VM subsystem bug in 2.4.0 ? 2001-01-08 18:00 ` Rik van Riel @ 2001-01-08 18:10 ` Linus Torvalds 2001-01-08 18:30 ` Rik van Riel 0 siblings, 1 reply; 22+ messages in thread From: Linus Torvalds @ 2001-01-08 18:10 UTC (permalink / raw) To: Rik van Riel; +Cc: Sergey E. Volkov, linux-kernel, Christoph Rohland On Mon, 8 Jan 2001, Rik van Riel wrote: > > We need a check in deactivate_page() to prevent the kernel > from moving pages from locked shared memory segments to the > inactive_dirty list. > > Christoph? Linus? The only solution I see is something like a "active_immobile" list, and add entries to that list whenever "writepage()" returns 1 - instead of just moving them to the active list. Seems to be a simple enough change. The main worry would be getting the pages _off_ such a list: anything that unlocks a shared memory segment (can you even do that? If the only way to unlock is to remove, we have no problems) would have to have a special function to move all pages from the immobile list back to the active list (and then they'd get moved back if they were for another segment that is still locked). Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: VM subsystem bug in 2.4.0 ? 2001-01-08 18:10 ` Linus Torvalds @ 2001-01-08 18:30 ` Rik van Riel 2001-01-09 7:52 ` Christoph Rohland 2001-01-09 14:09 ` Stephen C. Tweedie 0 siblings, 2 replies; 22+ messages in thread From: Rik van Riel @ 2001-01-08 18:30 UTC (permalink / raw) To: Linus Torvalds; +Cc: Sergey E. Volkov, linux-kernel, Christoph Rohland On Mon, 8 Jan 2001, Linus Torvalds wrote: > On Mon, 8 Jan 2001, Rik van Riel wrote: > > > > We need a check in deactivate_page() to prevent the kernel > > from moving pages from locked shared memory segments to the > > inactive_dirty list. > > > > Christoph? Linus? > > The only solution I see is something like a "active_immobile" > list, and add entries to that list whenever "writepage()" > returns 1 - instead of just moving them to the active list. > > Seems to be a simple enough change. The main worry would be > getting the pages _off_ such a list: Just marking them with a special "do not deactivate me" bit seems to work fine enough. When this special bit is set, we simply move the page to the back of the active list instead of deactivating. And when the bit changes again, the page can be evicted from memory just fine. In the mean time, the locked pages will also have undergone normal page aging and at unlock time we know whether to swap out the page or not. I agree that this scheme has a higher overhead than your idea, but it also seems to be a bit more flexible and simple. Alternatively, we could just scan the wired_list once a minute and move the unwired pages to the active list. regards, Rik -- Virtual memory is like a game you can't win; However, without VM there's truly nothing to lose... http://www.surriel.com/ http://www.conectiva.com/ http://distro.conectiva.com.br/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: VM subsystem bug in 2.4.0 ? 2001-01-08 18:30 ` Rik van Riel @ 2001-01-09 7:52 ` Christoph Rohland 2001-01-09 14:09 ` Stephen C. Tweedie 1 sibling, 0 replies; 22+ messages in thread From: Christoph Rohland @ 2001-01-09 7:52 UTC (permalink / raw) To: Rik van Riel; +Cc: Linus Torvalds, Sergey E. Volkov, linux-kernel Hi Rik, On Mon, 8 Jan 2001, Rik van Riel wrote: > And when the bit changes again, the page can be evicted > from memory just fine. In the mean time, the locked pages > will also have undergone normal page aging and at unlock > time we know whether to swap out the page or not. > > I agree that this scheme has a higher overhead than your > idea, but it also seems to be a bit more flexible and > simple. Alternatively, we could just scan the wired_list > once a minute and move the unwired pages to the active > list. At IPC_UNLOCK there is no reference to the pages locked by this segment available. We could perhaps move the whole locked list to the active list if we unlock any segment. Second point: How do we handle out of swap? I do not think that we should lock these pages but keep them in the active list. Greetings Christoph - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: VM subsystem bug in 2.4.0 ? 2001-01-08 18:30 ` Rik van Riel 2001-01-09 7:52 ` Christoph Rohland @ 2001-01-09 14:09 ` Stephen C. Tweedie 2001-01-09 14:53 ` Christoph Rohland 2001-01-09 18:23 ` Linus Torvalds 1 sibling, 2 replies; 22+ messages in thread From: Stephen C. Tweedie @ 2001-01-09 14:09 UTC (permalink / raw) To: Rik van Riel Cc: Linus Torvalds, Sergey E. Volkov, linux-kernel, Christoph Rohland, Stephen Tweedie Hi, On Mon, Jan 08, 2001 at 04:30:10PM -0200, Rik van Riel wrote: > On Mon, 8 Jan 2001, Linus Torvalds wrote: > > > > The only solution I see is something like a "active_immobile" > > list, and add entries to that list whenever "writepage()" > > returns 1 - instead of just moving them to the active list. > > Just marking them with a special "do not deactivate me" > bit seems to work fine enough. When this special bit is > set, we simply move the page to the back of the active > list instead of deactivating. But again, how do you clear the bit? Locking is a per-vma property, not per-page. I can mmap a file twice and mlock just one of the mappings. If you get a munlock(), how are you to know how many other locked mappings still exist? --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: VM subsystem bug in 2.4.0 ? 2001-01-09 14:09 ` Stephen C. Tweedie @ 2001-01-09 14:53 ` Christoph Rohland 2001-01-09 15:31 ` Stephen C. Tweedie 2001-01-09 18:23 ` Linus Torvalds 1 sibling, 1 reply; 22+ messages in thread From: Christoph Rohland @ 2001-01-09 14:53 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Rik van Riel, Linus Torvalds, Sergey E. Volkov, linux-kernel Hi Stephen, On Tue, 9 Jan 2001, Stephen C. Tweedie wrote: > But again, how do you clear the bit? Locking is a per-vma property, > not per-page. I can mmap a file twice and mlock just one of the > mappings. If you get a munlock(), how are you to know how many > other locked mappings still exist? It's worse: The issue we are talking about is SYSV IPC_LOCK. This is a per segment thing. A user can (un)lock a segment at any time. But we do not have the references to the vmas attached to the segemnts or to the pages allocated. Greetings Christoph - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: VM subsystem bug in 2.4.0 ? 2001-01-09 14:53 ` Christoph Rohland @ 2001-01-09 15:31 ` Stephen C. Tweedie 2001-01-09 15:45 ` Christoph Rohland 2001-01-09 18:36 ` Linus Torvalds 0 siblings, 2 replies; 22+ messages in thread From: Stephen C. Tweedie @ 2001-01-09 15:31 UTC (permalink / raw) To: Christoph Rohland Cc: Stephen C. Tweedie, Rik van Riel, Linus Torvalds, Sergey E. Volkov, linux-kernel Hi, On Tue, Jan 09, 2001 at 03:53:55PM +0100, Christoph Rohland wrote: > > On Tue, 9 Jan 2001, Stephen C. Tweedie wrote: > > But again, how do you clear the bit? Locking is a per-vma property, > > not per-page. I can mmap a file twice and mlock just one of the > > mappings. If you get a munlock(), how are you to know how many > > other locked mappings still exist? > > It's worse: The issue we are talking about is SYSV IPC_LOCK. The issue is locked VA pages. SysV is just one of the ways in which it can happen: the solution has got to address both that and mlock()/mlockall(). > This is a > per segment thing. A user can (un)lock a segment at any time. But we > do not have the references to the vmas attached to the segemnts Why not? Won't the address space mmap* lists give you this? --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: VM subsystem bug in 2.4.0 ? 2001-01-09 15:31 ` Stephen C. Tweedie @ 2001-01-09 15:45 ` Christoph Rohland 2001-01-09 16:05 ` Stephen C. Tweedie 2001-01-09 18:36 ` Linus Torvalds 1 sibling, 1 reply; 22+ messages in thread From: Christoph Rohland @ 2001-01-09 15:45 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Rik van Riel, Linus Torvalds, Sergey E. Volkov, linux-kernel Hi Stephen, On Tue, 9 Jan 2001, Stephen C. Tweedie wrote: > On Tue, Jan 09, 2001 at 03:53:55PM +0100, Christoph Rohland wrote: >> It's worse: The issue we are talking about is SYSV IPC_LOCK. > > The issue is locked VA pages. SysV is just one of the ways in which > it can happen: the solution has got to address both that and > mlock()/mlockall(). AFAIU mlock'ed pages would never get deactivated since the ptes do not get dropped. >> This is a per segment thing. A user can (un)lock a segment at any >> time. But we do not have the references to the vmas attached to the >> segemnts > > Why not? Won't the address space mmap* lists give you this? OK. We could go from shmid_kernel->file->dentry->inode->mapping We had to scan all mappings for pages in the page tables and in the page cache. Doesn't look really nice :-( Greetings Christoph - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: VM subsystem bug in 2.4.0 ? 2001-01-09 15:45 ` Christoph Rohland @ 2001-01-09 16:05 ` Stephen C. Tweedie 2001-01-09 16:17 ` Christoph Rohland 2001-01-09 16:45 ` Daniel Phillips 0 siblings, 2 replies; 22+ messages in thread From: Stephen C. Tweedie @ 2001-01-09 16:05 UTC (permalink / raw) To: Christoph Rohland Cc: Stephen C. Tweedie, Rik van Riel, Linus Torvalds, Sergey E. Volkov, linux-kernel Hi, On Tue, Jan 09, 2001 at 04:45:10PM +0100, Christoph Rohland wrote: > Hi Stephen, > > AFAIU mlock'ed pages would never get deactivated since the ptes do not > get dropped. D'oh, right --- so can't you lock a segment just by bumping page_count on its pages? --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: VM subsystem bug in 2.4.0 ? 2001-01-09 16:05 ` Stephen C. Tweedie @ 2001-01-09 16:17 ` Christoph Rohland 2001-01-09 18:37 ` Linus Torvalds 2001-01-09 16:45 ` Daniel Phillips 1 sibling, 1 reply; 22+ messages in thread From: Christoph Rohland @ 2001-01-09 16:17 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Rik van Riel, Linus Torvalds, Sergey E. Volkov, linux-kernel Hi Stephen, On Tue, 9 Jan 2001, Stephen C. Tweedie wrote: > D'oh, right --- so can't you lock a segment just by bumping > page_count on its pages? Looks like a good idea. Oh, and my last posting was partly bogus: I can directly get the pages with page cache lookups on the file. Greetings Christoph - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: VM subsystem bug in 2.4.0 ? 2001-01-09 16:17 ` Christoph Rohland @ 2001-01-09 18:37 ` Linus Torvalds 0 siblings, 0 replies; 22+ messages in thread From: Linus Torvalds @ 2001-01-09 18:37 UTC (permalink / raw) To: Christoph Rohland Cc: Stephen C. Tweedie, Rik van Riel, Sergey E. Volkov, linux-kernel On 9 Jan 2001, Christoph Rohland wrote: > Hi Stephen, > > On Tue, 9 Jan 2001, Stephen C. Tweedie wrote: > > D'oh, right --- so can't you lock a segment just by bumping > > page_count on its pages? > > Looks like a good idea. > > Oh, and my last posting was partly bogus: I can directly get the pages > with page cache lookups on the file. Even more appropriately, you have the inode->i_mapping lists that you can use directly (no need to do lookups, just walk the list). Note that bumping the counts is _NOT_ as easy as you seem to think. The problem: vmtruncate() and friends. It's much easier to just have a flag that gets cleared on truncate. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: VM subsystem bug in 2.4.0 ? 2001-01-09 16:05 ` Stephen C. Tweedie 2001-01-09 16:17 ` Christoph Rohland @ 2001-01-09 16:45 ` Daniel Phillips 2001-01-17 8:33 ` Rik van Riel 1 sibling, 1 reply; 22+ messages in thread From: Daniel Phillips @ 2001-01-09 16:45 UTC (permalink / raw) To: Stephen C. Tweedie, Christoph Rohland, linux-kernel "Stephen C. Tweedie" wrote: > On Tue, Jan 09, 2001 at 04:45:10PM +0100, Christoph Rohland wrote: > > > > AFAIU mlock'ed pages would never get deactivated since the ptes do not > > get dropped. > > D'oh, right --- so can't you lock a segment just by bumping page_count > on its pages? Putting this together with an idea from Linus: Linus Torvalds wrote: > On Mon, 8 Jan 2001, Rik van Riel wrote: > > > > We need a check in deactivate_page() to prevent the kernel > > from moving pages from locked shared memory segments to the > > inactive_dirty list. > > > > Christoph? Linus? > > The only solution I see is something like a "active_immobile" list, and > add entries to that list whenever "writepage()" returns 1 - instead of > just moving them to the active list. Call it 'pinned'... the pinned list would have pages with use count = 2 or more. A page gets off the pinned list when its use count goes to 1 in put_page. -- Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: VM subsystem bug in 2.4.0 ? 2001-01-09 16:45 ` Daniel Phillips @ 2001-01-17 8:33 ` Rik van Riel 2001-01-18 8:23 ` Christoph Rohland 0 siblings, 1 reply; 22+ messages in thread From: Rik van Riel @ 2001-01-17 8:33 UTC (permalink / raw) To: Daniel Phillips; +Cc: Stephen C. Tweedie, Christoph Rohland, linux-kernel On Tue, 9 Jan 2001, Daniel Phillips wrote: > Call it 'pinned'... the pinned list would have pages with use > count = 2 or more. A page gets off the pinned list when its use > count goes to 1 in put_page. I don't even want to start thinking about how this would screw up the (already fragile) page aging balance... Rik -- Virtual memory is like a game you can't win; However, without VM there's truly nothing to lose... http://www.surriel.com/ http://www.conectiva.com/ http://distro.conectiva.com.br/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: VM subsystem bug in 2.4.0 ? 2001-01-17 8:33 ` Rik van Riel @ 2001-01-18 8:23 ` Christoph Rohland 2001-01-25 22:47 ` Daniel Phillips 0 siblings, 1 reply; 22+ messages in thread From: Christoph Rohland @ 2001-01-18 8:23 UTC (permalink / raw) To: Rik van Riel; +Cc: Daniel Phillips, Stephen C. Tweedie, linux-kernel Hi Rik, On Wed, 17 Jan 2001, Rik van Riel wrote: > I don't even want to start thinking about how this would > screw up the (already fragile) page aging balance... As of 2.4.1-pre we pin the pages by increasing the page count for locked segments. No special list needed. Greetings Christoph - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: VM subsystem bug in 2.4.0 ? 2001-01-18 8:23 ` Christoph Rohland @ 2001-01-25 22:47 ` Daniel Phillips 0 siblings, 0 replies; 22+ messages in thread From: Daniel Phillips @ 2001-01-25 22:47 UTC (permalink / raw) To: Christoph Rohland, linux-kernel Christoph Rohland wrote: > As of 2.4.1-pre we pin the pages by increasing the page count for > locked segments. No special list needed. Sure no special list is needed. But without a special list to park those pages on they will just circulate on the active/inactive lists, wasting CPU cycles and trashing cache. A special list would be an improvement, but is no burning issue. -- Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: VM subsystem bug in 2.4.0 ? 2001-01-09 15:31 ` Stephen C. Tweedie 2001-01-09 15:45 ` Christoph Rohland @ 2001-01-09 18:36 ` Linus Torvalds 1 sibling, 0 replies; 22+ messages in thread From: Linus Torvalds @ 2001-01-09 18:36 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Christoph Rohland, Rik van Riel, Sergey E. Volkov, linux-kernel On Tue, 9 Jan 2001, Stephen C. Tweedie wrote: > > > > It's worse: The issue we are talking about is SYSV IPC_LOCK. > > The issue is locked VA pages. SysV is just one of the ways in which > it can happen: the solution has got to address both that and > mlock()/mlockall(). No, mlock() and mlockall() already works. Exactly because mlocked pages will never be removed from the VM, the VM layer knows how to deal with them (or "not deal with them" as the case is more properly stated). They won't ever get on the inactive list, and because refill_inactive_scan() won't be able to move handle them (the count is elevated by the VM mappings), the VM will correctly and gracefully fall back on scanning the page tables to find some _other_ blocks. So mlock works fine. The reason shm locked segments do _not_ work fine is exactly because they are not locked down in the VM, and for that reason they can end up being detached from everything and thus moved to the inactive list. That counts as "progress" as far as the VM is concerned, so we get into this endless loop where we move the page to the inactive list, then try to write it out, fail, and move it back to the active list again. The VM _thinks_ it is making progress, but obviously isn't. End result: lockup. Marking the pages with a magic flag would solve it. Then we could just make refill_inactive_scan() ignore such pages. Something like "PG_Reallydirty". Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: VM subsystem bug in 2.4.0 ? 2001-01-09 14:09 ` Stephen C. Tweedie 2001-01-09 14:53 ` Christoph Rohland @ 2001-01-09 18:23 ` Linus Torvalds 2001-01-09 22:20 ` Christoph Rohland 1 sibling, 1 reply; 22+ messages in thread From: Linus Torvalds @ 2001-01-09 18:23 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Rik van Riel, Sergey E. Volkov, linux-kernel, Christoph Rohland On Tue, 9 Jan 2001, Stephen C. Tweedie wrote: > > But again, how do you clear the bit? Locking is a per-vma property, > not per-page. I can mmap a file twice and mlock just one of the > mappings. If you get a munlock(), how are you to know how many other > locked mappings still exist? Note that this would be solved very cleanly if the SHM code would use the "VM_LOCKED" flag, and actually lock the pages in the VM, instead of trying to lock them down for writepage(). That would mean that such a segment would still get swapped out when it is not mapped anywhere, but I wonder if that semantic difference really matters. If the vma is marked VM_LOCKED, the VM subsystem will do the right thing (the page will never get removed from the page tables, so it won't ever make it into that back-and-forth bounce between the active and the inactive lists). Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: VM subsystem bug in 2.4.0 ? 2001-01-09 18:23 ` Linus Torvalds @ 2001-01-09 22:20 ` Christoph Rohland 2001-01-09 22:59 ` Linus Torvalds 0 siblings, 1 reply; 22+ messages in thread From: Christoph Rohland @ 2001-01-09 22:20 UTC (permalink / raw) To: Linus Torvalds Cc: Stephen C. Tweedie, Rik van Riel, Sergey E. Volkov, linux-kernel Linus Torvalds <torvalds@transmeta.com> writes: > On Tue, 9 Jan 2001, Stephen C. Tweedie wrote: > > > > But again, how do you clear the bit? Locking is a per-vma property, > > not per-page. I can mmap a file twice and mlock just one of the > > mappings. If you get a munlock(), how are you to know how many other > > locked mappings still exist? > > Note that this would be solved very cleanly if the SHM code would use the > "VM_LOCKED" flag, and actually lock the pages in the VM, instead of trying > to lock them down for writepage(). here comes the patch. (lightly tested) Greetings Christoph diff -uNr 2.4.0/include/linux/shmem_fs.h c/include/linux/shmem_fs.h --- 2.4.0/include/linux/shmem_fs.h Tue Jan 2 21:58:11 2001 +++ c/include/linux/shmem_fs.h Tue Jan 9 22:01:48 2001 @@ -22,7 +22,6 @@ swp_entry_t i_direct[SHMEM_NR_DIRECT]; /* for the first blocks */ swp_entry_t **i_indirect; /* doubly indirect blocks */ unsigned long swapped; - int locked; /* into memory */ struct list_head list; }; diff -uNr 2.4.0/ipc/shm.c c/ipc/shm.c --- 2.4.0/ipc/shm.c Tue Jan 2 21:58:11 2001 +++ c/ipc/shm.c Tue Jan 9 22:39:18 2001 @@ -91,9 +91,10 @@ return ipc_addid(&shm_ids, &shp->shm_perm, shm_ctlmni+1); } - - -static inline void shm_inc (int id) { +/* This is called by fork, once for every shm attach. */ +static void shm_open (struct vm_area_struct *shmd) +{ + int id = shmd->vm_file->f_dentry->d_inode->i_ino; struct shmid_kernel *shp; if(!(shp = shm_lock(id))) @@ -104,12 +105,6 @@ shm_unlock(id); } -/* This is called by fork, once for every shm attach. */ -static void shm_open (struct vm_area_struct *shmd) -{ - shm_inc (shmd->vm_file->f_dentry->d_inode->i_ino); -} - /* * shm_destroy - free the struct shmid_kernel * @@ -154,9 +149,20 @@ static int shm_mmap(struct file * file, struct vm_area_struct * vma) { - UPDATE_ATIME(file->f_dentry->d_inode); + struct shmid_kernel *shp; + struct inode * inode = file->f_dentry->d_inode; + + UPDATE_ATIME(inode); vma->vm_ops = &shm_vm_ops; - shm_inc(file->f_dentry->d_inode->i_ino); + + if(!(shp = shm_lock(inode->i_ino))) + BUG(); + shp->shm_atim = CURRENT_TIME; + shp->shm_lprid = current->pid; + shp->shm_nattch++; + if (shp->shm_flags & SHM_LOCKED) + vma->vm_flags |= VM_LOCKED; + shm_unlock(inode->i_ino); return 0; } @@ -365,6 +371,29 @@ } } +static void shm_lockseg (struct shmid_kernel * shp, int cmd) +{ + struct address_space *mapping = shp->shm_file->f_dentry->d_inode->i_mapping; + struct vm_area_struct *mpnt; + + spin_lock(&mapping->i_shared_lock); + if(cmd==SHM_LOCK) { + shp->shm_flags |= SHM_LOCKED; + for (mpnt = mapping->i_mmap; mpnt; mpnt = mpnt->vm_next_share) + mpnt->vm_flags |= VM_LOCKED; + for (mpnt = mapping->i_mmap_shared; mpnt; mpnt = mpnt->vm_next_share) + mpnt->vm_flags |= VM_LOCKED; + } else { + shp->shm_flags &= ~SHM_LOCKED; + for (mpnt = mapping->i_mmap; mpnt; mpnt = mpnt->vm_next_share) + mpnt->vm_flags &= ~VM_LOCKED; + for (mpnt = mapping->i_mmap_shared; mpnt; mpnt = mpnt->vm_next_share) + mpnt->vm_flags &= ~VM_LOCKED; + } + spin_unlock(&mapping->i_shared_lock); + +} + asmlinkage long sys_shmctl (int shmid, int cmd, struct shmid_ds *buf) { struct shm_setbuf setbuf; @@ -466,13 +495,7 @@ err = shm_checkid(shp,shmid); if(err) goto out_unlock; - if(cmd==SHM_LOCK) { - shp->shm_file->f_dentry->d_inode->u.shmem_i.locked = 1; - shp->shm_flags |= SHM_LOCKED; - } else { - shp->shm_file->f_dentry->d_inode->u.shmem_i.locked = 0; - shp->shm_flags &= ~SHM_LOCKED; - } + shm_lockseg(shp, cmd); shm_unlock(shmid); return err; } diff -uNr 2.4.0/mm/shmem.c c/mm/shmem.c --- 2.4.0/mm/shmem.c Tue Jan 2 21:58:11 2001 +++ c/mm/shmem.c Tue Jan 9 22:02:18 2001 @@ -201,8 +201,6 @@ swp_entry_t *entry, swap; info = &page->mapping->host->u.shmem_i; - if (info->locked) - return 1; swap = __get_swap_page(2); if (!swap.val) return 1; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: VM subsystem bug in 2.4.0 ? 2001-01-09 22:20 ` Christoph Rohland @ 2001-01-09 22:59 ` Linus Torvalds 2001-01-10 7:33 ` Christoph Rohland 2001-01-10 15:50 ` Tim Wright 0 siblings, 2 replies; 22+ messages in thread From: Linus Torvalds @ 2001-01-09 22:59 UTC (permalink / raw) To: Christoph Rohland Cc: Stephen C. Tweedie, Rik van Riel, Sergey E. Volkov, linux-kernel On 9 Jan 2001, Christoph Rohland wrote: > Linus Torvalds <torvalds@transmeta.com> writes: > > > > Note that this would be solved very cleanly if the SHM code would use the > > "VM_LOCKED" flag, and actually lock the pages in the VM, instead of trying > > to lock them down for writepage(). > > here comes the patch. (lightly tested) I'd really like an opinion on whether this is truly legal or not? After all, it does change the behaviour to mean "pages are locked only if they have been mapped into virtual memory". Which is not what it used to mean. Arguably the new semantics are perfectly valid semantics on their own, but I'm not sure they are acceptable. In contrast, the PG_realdirty approach would give the old behaviour of truly locked-down shm segments, with not significantly different complexity behaviour. What do other UNIXes do for shm_lock()? The Linux man-page explicitly states for SHM_LOCK that The user must fault in any pages that are required to be present after locking is enabled. which kind of implies to me that the VM_LOCKED implementation is ok. HOWEVER, looking at the HP-UX man-page, for example, certainly implies that the PG_realdirty approach is the correct one. The IRIX man-pages in contrast say Locking occurs per address space; multiple processes or sprocs mapping the area at different addresses each need to issue the lock (this is primarily an issue with the per-process page tables). which again implies that they've done something akin to a VM_LOCKED implementation. Does anybody have any better pointers, ideas, or opinions? Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: VM subsystem bug in 2.4.0 ? 2001-01-09 22:59 ` Linus Torvalds @ 2001-01-10 7:33 ` Christoph Rohland 2001-01-10 15:50 ` Tim Wright 1 sibling, 0 replies; 22+ messages in thread From: Christoph Rohland @ 2001-01-10 7:33 UTC (permalink / raw) To: Linus Torvalds Cc: Stephen C. Tweedie, Rik van Riel, Sergey E. Volkov, linux-kernel Hi Linus, Linus Torvalds <torvalds@transmeta.com> writes: > I'd really like an opinion on whether this is truly legal or not? After > all, it does change the behaviour to mean "pages are locked only if they > have been mapped into virtual memory". Which is not what it used to mean. > > Arguably the new semantics are perfectly valid semantics on their > own, but I'm not sure they are acceptable. I just checked SuS and they do not list SHM_LOCK as command at all. > In contrast, the PG_realdirty approach would give the old behaviour of > truly locked-down shm segments, with not significantly different > complexity behaviour. > > What do other UNIXes do for shm_lock()? > > The Linux man-page explicitly states for SHM_LOCK that > > The user must fault in any pages that are required to be present > after locking is enabled. > > which kind of implies to me that the VM_LOCKED implementation is ok. Yes. > HOWEVER, looking at the HP-UX man-page, for example, certainly implies > that the PG_realdirty approach is the correct one. Yes. > The IRIX man-pages in contrast say > > Locking occurs per address space; > multiple processes or sprocs mapping the area at different > addresses each need to issue the lock (this is primarily an > issue with the per-process page tables). > > which again implies that they've done something akin to a VM_LOCKED > implementation. So Irix does something quite different. For Irix SHM_LOCK is a special version of mlock... > Does anybody have any better pointers, ideas, or opinions? I think the VM_LOCKED approach is the best: - SuS does not specify anything, the different vendors do different things. So people using SHM_LOCK have to be aware that the details differ. - Technically this is the fastest approach for attached segments: We do not scan the relevent vmas at all and by doing so we keep the overhead lowest. And I do not see a reason to use SHM_LOCK besides performance. BTW I also have a patch appended which bumps the page count. Works also, is also small, but we will have a higher soft fault rate with that. Greetings Christoph diff -uNr 2.4.0/ipc/shm.c c/ipc/shm.c --- 2.4.0/ipc/shm.c Mon Jan 8 11:24:39 2001 +++ c/ipc/shm.c Tue Jan 9 17:48:55 2001 @@ -121,6 +121,7 @@ { shm_tot -= (shp->shm_segsz + PAGE_SIZE - 1) >> PAGE_SHIFT; shm_rmid (shp->id); + shmem_lock(shp->shm_file, 0); fput (shp->shm_file); kfree (shp); } @@ -467,10 +468,10 @@ if(err) goto out_unlock; if(cmd==SHM_LOCK) { - shp->shm_file->f_dentry->d_inode->u.shmem_i.locked = 1; + shmem_lock(shp->shm_file, 1); shp->shm_flags |= SHM_LOCKED; } else { - shp->shm_file->f_dentry->d_inode->u.shmem_i.locked = 0; + shmem_lock(shp->shm_file, 0); shp->shm_flags &= ~SHM_LOCKED; } shm_unlock(shmid); diff -uNr 2.4.0/mm/shmem.c c/mm/shmem.c --- 2.4.0/mm/shmem.c Mon Jan 8 11:24:39 2001 +++ c/mm/shmem.c Tue Jan 9 18:04:16 2001 @@ -310,6 +310,8 @@ } /* We have the page */ SetPageUptodate (page); + if (info->locked) + page_cache_get(page); cached_page: UnlockPage (page); @@ -399,6 +401,32 @@ spin_unlock (&sb->u.shmem_sb.stat_lock); buf->f_namelen = 255; return 0; +} + +void shmem_lock(struct file * file, int lock) +{ + struct inode * inode = file->f_dentry->d_inode; + struct shmem_inode_info * info = &inode->u.shmem_i; + struct page * page; + unsigned long idx, size; + + if (info->locked == lock) + return; + down(&inode->i_sem); + info->locked = lock; + size = (inode->i_size + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; + for (idx = 0; idx < size; idx++) { + page = find_lock_page(inode->i_mapping, idx); + if (!page) + continue; + if (!lock) { + /* release the extra count and our reference */ + page_cache_release(page); + page_cache_release(page); + } + UnlockPage(page); + } + up(&inode->i_sem); } /* - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: VM subsystem bug in 2.4.0 ? 2001-01-09 22:59 ` Linus Torvalds 2001-01-10 7:33 ` Christoph Rohland @ 2001-01-10 15:50 ` Tim Wright 1 sibling, 0 replies; 22+ messages in thread From: Tim Wright @ 2001-01-10 15:50 UTC (permalink / raw) To: Linus Torvalds Cc: Christoph Rohland, Stephen C. Tweedie, Rik van Riel, Sergey E. Volkov, linux-kernel Hi Linus, On Tue, Jan 09, 2001 at 02:59:07PM -0800, Linus Torvalds wrote: > > Arguably the new semantics are perfectly valid semantics on their own, but > I'm not sure they are acceptable. > > In contrast, the PG_realdirty approach would give the old behaviour of > truly locked-down shm segments, with not significantly different > complexity behaviour. > > What do other UNIXes do for shm_lock()? > It appears that the fine-detail semantics vary across the board. DYNIX/ptx supports two forms of SysV shm locking - soft and hard. Soft-locking (the default) merely makes the pages sticky, so if you fault them in, they stay in your resident set, but don't count against it. If, however the process swaps, they're all evicted, and when the process is swapped back in, you get to fault the back in all over again. Hard locking pins the segment into physical memory until such time as it's destroyed. It stays there even if there are currently no attaches. Again, such pages are not counted against the process RSS. SVR4 only support one form. It faults all the pages in and locks them into memory, but doesn't treat the especially wrt rss/paging, which seems none too clever - if they're locked into memory, you might as well use them :-) [Details of the differing approches omitted] > > Does anybody have any better pointers, ideas, or opinions? > > Linus > I don't know if there are any arguments in favour of making both approaches available. Gut feel says that's overkill. We ended up with two by historical accident. The soft-locking was always there (althought semantically different to SVR4), and the hard-locking stuff was added to boost performance with a certain six-letter RDBMS that attaches an SGA to each process. They all get to attach it "for free", and since it doesn't count towards the RSS, it allowed tuning a fairly small RSS across the system without having the RDMBS processes spent all their time (soft) faulting SGA pages in and out of their RSS. Tim -- Tim Wright - timw@splhi.com or timw@aracnet.com or twright@us.ibm.com IBM Linux Technology Center, Beaverton, Oregon "Nobody ever said I was charming, they said "Rimmer, you're a git!"" RD VI - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2001-01-25 22:50 UTC | newest] Thread overview: 22+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2001-01-08 8:46 VM subsystem bug in 2.4.0 ? Sergey E. Volkov 2001-01-08 18:00 ` Rik van Riel 2001-01-08 18:10 ` Linus Torvalds 2001-01-08 18:30 ` Rik van Riel 2001-01-09 7:52 ` Christoph Rohland 2001-01-09 14:09 ` Stephen C. Tweedie 2001-01-09 14:53 ` Christoph Rohland 2001-01-09 15:31 ` Stephen C. Tweedie 2001-01-09 15:45 ` Christoph Rohland 2001-01-09 16:05 ` Stephen C. Tweedie 2001-01-09 16:17 ` Christoph Rohland 2001-01-09 18:37 ` Linus Torvalds 2001-01-09 16:45 ` Daniel Phillips 2001-01-17 8:33 ` Rik van Riel 2001-01-18 8:23 ` Christoph Rohland 2001-01-25 22:47 ` Daniel Phillips 2001-01-09 18:36 ` Linus Torvalds 2001-01-09 18:23 ` Linus Torvalds 2001-01-09 22:20 ` Christoph Rohland 2001-01-09 22:59 ` Linus Torvalds 2001-01-10 7:33 ` Christoph Rohland 2001-01-10 15:50 ` Tim Wright
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox