From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754156AbZBCVu0 (ORCPT ); Tue, 3 Feb 2009 16:50:26 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1751050AbZBCVuN (ORCPT ); Tue, 3 Feb 2009 16:50:13 -0500 Received: from g4t0015.houston.hp.com ([15.201.24.18]:25216 "EHLO g4t0015.houston.hp.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751391AbZBCVuK (ORCPT ); Tue, 3 Feb 2009 16:50:10 -0500 Subject: Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments From: Lee Schermerhorn To: Hugh Dickins Cc: Linus Torvalds , Greg KH , Maksim Yevmenkin , linux-kernel , Nick Piggin , Andrew Morton , will@crowder-design.com, Rik van Riel , KOSAKI Motohiro , KAMEZAWA Hiroyuki , Mikos Szeredi In-Reply-To: References: <1233259410.2315.75.camel@lts-notebook> <20090130055639.GA30950@suse.de> <1233345190.908.36.camel@lts-notebook> <1233351412.908.69.camel@lts-notebook> <1233677610.15321.129.camel@lts-notebook> Content-Type: text/plain Organization: HP/OSLO Date: Tue, 03 Feb 2009 16:50:24 -0500 Message-Id: <1233697824.15321.231.camel@lts-notebook> Mime-Version: 1.0 X-Mailer: Evolution 2.22.3.1 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, 2009-02-03 at 17:10 +0000, Hugh Dickins wrote: > On Tue, 3 Feb 2009, Lee Schermerhorn wrote: > > On Sat, 2009-01-31 at 12:35 +0000, Hugh Dickins wrote: > > > We need a way to communicate not-MAP_NORESERVE to shmem.c, and we don't > > > just need it in the explicit shmem_zero_setup() case, we also need it > > > for the (probably rare nowadays) case when mmap() is working on file > > ^^^^^^^^^^^^^^^^^^^^^^^^ > > > /dev/zero (drivers/char/mem.c mmap_zero()), rather than using MAP_ANON. > > > > > > This reminded me of something I'd seen recently looking > > at /proc//[numa]_maps for on > > Linux/x86_64: > >... > > 2adadf711000-2adadf721000 rwxp 00000000 00:0e 4072 /dev/zero > > 2adadf721000-2adadf731000 rwxp 00000000 00:0e 4072 /dev/zero > > 2adadf731000-2adadf741000 rwxp 00000000 00:0e 4072 /dev/zero > > > > > > > > 7fffcdd36000-7fffcdd4e000 rwxp 7fffcdd36000 00:00 0 [stack] > > ffffffffff600000-ffffffffffe00000 ---p 00000000 00:00 0 [vdso] > > > > For portability between Linux and various Unix-like systems that don't > > support MAP_ANON*, perhaps? > > > > Anyway, from the addresses and permissions, these all look potentially > > mergeable. The offset is preventing merging, right? I guess that's one > > of the downsides of mapping /dev/zero rather than using MAP_ANONYMOUS? > > > > Makes one wonder whether it would be worthwhile [not to mention > > possible] to rework mmap_zero() to mimic MAP_ANONYMOUS... > > That's certainly an interesting observation, and thank you for sharing > it with us (hmm, I sound like a self-help group leader or something). > > I don't really have anything to add to what Linus said (and hadn't > got around to realizing the significance of the "p" there before I > saw his reply). > > Mmm, it's interesting, but I fear to add more hacks in there just > for this - I guess we could, but I'd rather not, unless it becomes > a serious issue. > > Let's just tuck away the knowledge of this case for now. Right. And a bit more info to tuck away... I routinely grab the proc maps and numa_maps from our largish servers running various "industry standard benchmarks". Prompted by Linus' comment that "if it's just a hundred segments, nobody really cares", I went back and looked a bit further at the maps for a recent run. Below are some segment counts for the run. The benchmark involved 32 "instances" of the application--a technique used to reduce contention on application internal resources as the user count increases--along with it's data base task[s]. Each instance spawns a few processes [5-6 average, up to ~14, for this run] that share a few instance-specific SYSV segments between them. In each instance, one of those shmem segments exhibits a similar pattern to the /dev/zero segments from the prior mail. Many, altho' not all, of the individual vmas are adjacent with the same permissions: 'r--s'. E.g., a small snippet: 2ac0e3cf0000-2ac0e40f5000 r--s 00d26000 00:08 15695938 /SYSV0000277a (deleted) 2ac0e40f5000-2ac0e4101000 r--s 0112b000 00:08 15695938 /SYSV0000277a (deleted) 2ac0e4101000-2ac0e4102000 r--s 01137000 00:08 15695938 /SYSV0000277a (deleted) 2ac0e4102000-2ac0e4113000 r--s 01138000 00:08 15695938 /SYSV0000277a (deleted) 2ac0e4113000-2ac0e4114000 r--s 01149000 00:08 15695938 /SYSV0000277a (deleted) 2ac0e4114000-2ac0e4115000 r--s 0114a000 00:08 15695938 /SYSV0000277a (deleted) 2ac0e4115000-2ac0e4116000 r--s 0114b000 00:08 15695938 /SYSV0000277a (deleted) 2ac0e4116000-2ac0e4117000 r--s 0114c000 00:08 15695938 /SYSV0000277a (deleted) I counted 2000-3600+ of these for a couple of tasks. How they got like this--one vma per page?--I'm not sure. Perhaps a sequence of mprotect() calls or such after attaching the segment. [I'll try to get an strace sometime.] Then I counted the occurrences of the patterns: '^2.*r--s.*/SYSV' in each of the instances as, again, each instance uses a different shmem segment among its tasks. For good measure, I counted the '/dev/zero' segments as well: SYSV shm /dev/zero instance 00 5771 217 instance 01 6025 183 instance 02 5738 176 instance 03 5798 177 instance 04 5709 182 instance 05 5423 915 instance 06 5513 929 instance 07 5915 180 instance 08 5802 182 instance 09 5690 177 instance 10 5643 177 instance 11 5647 180 instance 12 5656 182 instance 13 5672 181 instance 14 5522 180 instance 15 5497 180 instance 16 5594 179 instance 17 4922 906 instance 18 6956 935 instance 19 5769 181 instance 20 5771 180 instance 21 5712 180 instance 22 5711 184 instance 23 5631 179 instance 24 5586 180 instance 25 5640 180 instance 26 5614 176 instance 27 5523 176 instance 28 5600 179 instance 29 5473 177 instance 30 5581 180 instance 31 5470 180 A total of ~ 180K shmem segments, not counting the /dev/zero mappings. Good thing we have a lot of memory :). A couple of those segments per instance are different shmem segments--just 2 or 3 out of 5k-6k in the cases that I looked at. The benchmark seems to run fairly well, so I'm not saying we have a problem here--with the Linux kernel, anyway. Just some raw data from a pseudo-real-world application load. ['pseudo' because I'm told no real user would ever set up the app quite this way :)] Also, this is on a vintage 2.6.16+ kernel [not my choice]. Soon I'll have data from a much more recent release. Lee