* Maybe a VM bug in 2.4.18-18 from RH 8.0?
@ 2002-12-06 0:13 Norman Gaywood
2002-12-06 1:00 ` Andrew Morton
2002-12-06 1:08 ` Andrea Arcangeli
0 siblings, 2 replies; 49+ messages in thread
From: Norman Gaywood @ 2002-12-06 0:13 UTC (permalink / raw)
To: linux-kernel
I think I have a trigger for a VM bug in the RH kernel-bigmem-2.4.18-18
The system is a 4 processor, 16GB memory Dell PE6600 running RH8.0 +
errata. More details at the end of this message.
By doing a large copy I can trigger this problem in about 30-40 minutes. At
the end of that time, kswapd will start to get a larger % of CPU and
the system load will be around 2-3. The system will feel sluggish at an
interactive shell and it will take several seconds before a command like
top would start to display. If I let it go for another 30 minutes the
system is unusable were it could take 10 minutes or more to do simple
commands. If I let it go for several hours after that, the following
messages can appear on the console depending on the type of copy:
ENOMEM in journal_get_undo_access_Rsmp_df5dec49, retrying.
or
EMOMEM in do_get_write_access, retrying.
The problem can be triggered by almost any type of copy command. In
particular, this command can trigger it:
tar cf /dev/tape .
for . large enough. Unfortunately this was how I was intending to backup
the system.
"Large enough" is several gigabytes. It also seems to depend on how much
memory is used. In particular, how much memory is used by cache. Also in
the equation is the number of files. Copying one big file does not seem
to trigger the problem. I initially discovered the problem when doing an
rsync copy over a network of the user home directories.
Can it be stopped? Yes. On the linux-poweredge@dell.com mailing list,
Stephan Wonczak suggested that I should put the system under some memory
pressure while doing the copy. The program he supplied used about 750
megabytes just to use some memory. I tried running this at 10 second
intervals while doing a copy but it did not help. Since the system has
16 Gig of memory, I tried to give it some real memory pressure and ran
7 processes that used 1.8G each like this:
#!/bin/sh
SLEEP=600
COUNT=20
while [ `expr $COUNT - 1` != 0 ]
do
date
# 2000 by 1_000_000 seems to be a 1.8G process
perl -e '$i=2000;while ($i--){ $a[$i]="x"x1_000_000; }' &
perl -e '$i=2000;while ($i--){ $a[$i]="x"x1_000_000; }' &
perl -e '$i=2000;while ($i--){ $a[$i]="x"x1_000_000; }' &
perl -e '$i=2000;while ($i--){ $a[$i]="x"x1_000_000; }' &
perl -e '$i=2000;while ($i--){ $a[$i]="x"x1_000_000; }' &
perl -e '$i=2000;while ($i--){ $a[$i]="x"x1_000_000; }' &
perl -e '$i=2000;while ($i--){ $a[$i]="x"x1_000_000; }'
sleep $SLEEP
done
This bought the cache down to about 3-4 Gig used after it ran. With this
running the system performed the copy with no problems! No doubt there
is a happy medium between these two extremes.
There is a suggestion that I may not see this problem when the system is
under real load. Since I am only setting up the system at the moment there
are no users giving the system something to do. The copy is the only real
work during these tests. I find it difficult to say "she'll be right",
(as we do in Aus) and throw the system into production hoping that it
will just work.
So what do I do now? I have a what I believe a trigger for a VM problem
in a widely used version of linux. Anyone have some patches for me to
try that won't take me too far from the RH 8.0 base system.
Here are the system details:
PE6600 running RH 8.0 with latest errata. Note that I have upgraded to
kernel 2.4.18-19.7.tg3.120bigmem which I understand to be the latest
RH8 errata kernel + patches to stop the tg3 hanging problem. This came
from http://people.redhat.com/jgarzik/tg3/. I have also tried the latest
RH errata kernel using the bcm5700 driver and it has the same problem.
HW includes:
Adaptec AIC-7892 SCSI BIOS v25704
3 Adaptex SCSI Card 39160 BIOS v2.57.2S2
8 HITACHI DK32DJ-72MC 160 drives
2 Quantum ATLAS10K3-73-SCA 160 drives
uname -a
Linux alan.une.edu.au 2.4.18-19.7.tg3.120bigmem #1 SMP Mon Nov 25 15:15:29 EST 2002 i686 i686 i386 GNU/Linux
cat /proc/meminfo
total: used: free: shared: buffers: cached:
Mem: 16671522816 444915712 16226607104 0 136830976 56520704
Swap: 34365202432 0 34365202432
MemTotal: 16280784 kB
MemFree: 15846296 kB
MemShared: 0 kB
Buffers: 133624 kB
Cached: 55196 kB
SwapCached: 0 kB
Active: 249984 kB
Inact_dirty: 18088 kB
Inact_clean: 480 kB
Inact_target: 53708 kB
HighTotal: 15597504 kB
HighFree: 15434932 kB
LowTotal: 683280 kB
LowFree: 411364 kB
SwapTotal: 33559768 kB
SwapFree: 33559768 kB
Committed_AS: 177044 kB
df
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/md2 8254136 2825112 5009736 37% /
/dev/md0 101018 25627 70175 27% /boot
/dev/md6 211671024 88323536 112595200 44% /home
/dev/md1 16515968 1785024 13891956 12% /opt
none 8140392 0 8140392 0% /dev/shm
/dev/md4 4126976 149944 3767392 4% /tmp
/dev/md3 16515968 168172 15508808 2% /var
/dev/md5 8522932 1596520 6493468 20% /var/spool/mail
/dev/sdh1 70557052 32832 66940124 1% /.automount/alan/disks/alan/h1
/dev/sdi1 70557052 22856784 44116172 35% /.automount/alan/disks/alan/i1
/dev/sdj1 70557052 13619440 53353516 21% /.automount/alan/disks/alan/j1
df -i
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/md2 1048576 167838 880738 17% /
/dev/md0 26104 59 26045 1% /boot
/dev/md6 26886144 1941926 24944218 8% /home
/dev/md1 2101152 49285 2051867 3% /opt
none 2035098 1 2035097 1% /dev/shm
/dev/md4 524288 26 524262 1% /tmp
/dev/md3 2101152 4877 2096275 1% /var
/dev/md5 1082720 2535 1080185 1% /var/spool/mail
/dev/sdh1 8962048 12 8962036 1% /.automount/alan/disks/alan/h1
/dev/sdi1 8962048 712400 8249648 8% /.automount/alan/disks/alan/i1
/dev/sdj1 8962048 10497 8951551 1% /.automount/alan/disks/alan/j1
--
Norman Gaywood -- School of Mathematical and Computer Sciences
University of New England, Armidale, NSW 2351, Australia
norm@turing.une.edu.au http://turing.une.edu.au/~norm
Phone: +61 2 6773 2412 Fax: +61 2 6773 3312
^ permalink raw reply [flat|nested] 49+ messages in thread* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0? 2002-12-06 0:13 Maybe a VM bug in 2.4.18-18 from RH 8.0? Norman Gaywood @ 2002-12-06 1:00 ` Andrew Morton 2002-12-06 1:17 ` Andrea Arcangeli 2002-12-06 1:08 ` Andrea Arcangeli 1 sibling, 1 reply; 49+ messages in thread From: Andrew Morton @ 2002-12-06 1:00 UTC (permalink / raw) To: Norman Gaywood; +Cc: linux-kernel Norman Gaywood wrote: > > I think I have a trigger for a VM bug in the RH kernel-bigmem-2.4.18-18 > > 16GB > ... > tar cf /dev/tape . > This machine will die due to buffer_heads which are attached to highmem pagecache, and due to inodes which are pinned by highmem pagecache. > ... > while [ `expr $COUNT - 1` != 0 ] > do > date > # 2000 by 1_000_000 seems to be a 1.8G process > perl -e '$i=2000;while ($i--){ $a[$i]="x"x1_000_000; }' & > ... This will evict the highmem pagecache. That frees the buffer_heads and unpins the inodes. > So what do I do now? I guess talk to Red Hat. These are well-known problems and there should be fixes for them in a "bigmem" kernel. Otherwise, the -aa kernels have patches to address these problems. One option would be to roll your own kernel, based on a kernel.org kernel and a matching patch from http://www.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/ > ... > Anyone have some patches for me to > try that won't take me too far from the RH 8.0 base system. Hard. The relevant patches are: http://www.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.20aa1/05_vm_16_active_free_zone_bhs-1 and http://www.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.20aa1/10_inode-highmem-2 The first one will not come vaguely close to applying to an RH 2.4.18 kernel. The second one may well apply, and will probably fix the problem. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0? 2002-12-06 1:00 ` Andrew Morton @ 2002-12-06 1:17 ` Andrea Arcangeli 2002-12-06 1:34 ` Andrew Morton 0 siblings, 1 reply; 49+ messages in thread From: Andrea Arcangeli @ 2002-12-06 1:17 UTC (permalink / raw) To: Andrew Morton; +Cc: Norman Gaywood, linux-kernel On Thu, Dec 05, 2002 at 05:00:15PM -0800, Andrew Morton wrote: > Hard. The relevant patches are: > > http://www.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.20aa1/05_vm_16_active_free_zone_bhs-1 > and > http://www.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.20aa1/10_inode-highmem-2 yep, those are the two I had in mind when I said they're pending for 2.4.21pre inclusion. He may still suffer other known problems besides the above two critical highmem fixes (for example if lower_zone_reserve_ratio is not applied and there's no other fix around it IMHO, that's generic OS problem not only for linux, and that was my only sensible solution to fix it, the approch in mainline is way too weak to make a real difference) though probably whatever else problem would probably need something more complicated than a tar to reproduce. Andrea ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0? 2002-12-06 1:17 ` Andrea Arcangeli @ 2002-12-06 1:34 ` Andrew Morton 2002-12-06 1:44 ` Andrea Arcangeli 0 siblings, 1 reply; 49+ messages in thread From: Andrew Morton @ 2002-12-06 1:34 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Norman Gaywood, linux-kernel Andrea Arcangeli wrote: > > ... > He may still suffer other known problems besides > the above two critical highmem fixes (for example if > lower_zone_reserve_ratio is not applied and there's no other fix around > it IMHO, that's generic OS problem not only for linux, and that was my > only sensible solution to fix it, the approch in mainline is way too > weak to make a real difference) argh. I hate that one ;) Giving away 100 megabytes of memory hurts. I've never been able to find the workload which makes this necessary. Can you please describe an "exploit" against 2.4.20 which demonstrates the need for this? Thanks. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0? 2002-12-06 1:34 ` Andrew Morton @ 2002-12-06 1:44 ` Andrea Arcangeli 2002-12-06 2:15 ` William Lee Irwin III 0 siblings, 1 reply; 49+ messages in thread From: Andrea Arcangeli @ 2002-12-06 1:44 UTC (permalink / raw) To: Andrew Morton; +Cc: Norman Gaywood, linux-kernel On Thu, Dec 05, 2002 at 05:34:34PM -0800, Andrew Morton wrote: > Andrea Arcangeli wrote: > > > > ... > > He may still suffer other known problems besides > > the above two critical highmem fixes (for example if > > lower_zone_reserve_ratio is not applied and there's no other fix around > > it IMHO, that's generic OS problem not only for linux, and that was my > > only sensible solution to fix it, the approch in mainline is way too > > weak to make a real difference) > > argh. I hate that one ;) Giving away 100 megabytes of memory > hurts. 100M hurts on a 4G box? No-way ;) it hurts when such 100M of normal zone are mlocked by an highmem-capable users and you can't allocate one more inode but you have still 3G free of highmem (google is doing this, they even drop a check so they can mlock > half of the ram). Or it hurts when you can't allocate an inode because such 100M are in pagetables on a 64G box and you still have 60G free of highmem. > I've never been able to find the workload which makes this > necessary. Can you please describe an "exploit" against ask google... > 2.4.20 which demonstrates the need for this? even simpler, swapoff -a and malloc and have fun! ;) (again ask google, they run w/o swap for obvious good reasons) Or if you have enough time, wait those 100M to be filled by pagetables on a 64G box. Andrea ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0? 2002-12-06 1:44 ` Andrea Arcangeli @ 2002-12-06 2:15 ` William Lee Irwin III 2002-12-06 2:28 ` Andrea Arcangeli 2002-12-06 10:36 ` Arjan van de Ven 0 siblings, 2 replies; 49+ messages in thread From: William Lee Irwin III @ 2002-12-06 2:15 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Andrew Morton, Norman Gaywood, linux-kernel On Fri, Dec 06, 2002 at 02:44:29AM +0100, Andrea Arcangeli wrote: > Or it hurts when you can't allocate an inode because such 100M are in > pagetables on a 64G box and you still have 60G free of highmem. This is the zone vs. zone watermark stuff that penalizes/fails allocations made with a given GFP mask from being satisfied by fallback. This is largely old news wrt. various kinds of inability to pressure those ZONE_NORMAL (maybe also ZONE_DMA) consumers. Admission control for fallback is valuable, sure. I suspect the question akpm raised is about memory utilization. My own issues are centered around allocations targeted directly at ZONE_NORMAL, which fallback prevention does not address, so the watermark patch is not something I'm personally very concerned about. 64GB isn't getting any testing that I know of; I'd hold off until someone's actually stood up and confessed to attempting to boot Linux on such a beast. Or until I get some more RAM. =) Bill ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0? 2002-12-06 2:15 ` William Lee Irwin III @ 2002-12-06 2:28 ` Andrea Arcangeli 2002-12-06 2:41 ` William Lee Irwin III 2002-12-06 10:36 ` Arjan van de Ven 1 sibling, 1 reply; 49+ messages in thread From: Andrea Arcangeli @ 2002-12-06 2:28 UTC (permalink / raw) To: William Lee Irwin III, Andrew Morton, Norman Gaywood, linux-kernel On Thu, Dec 05, 2002 at 06:15:59PM -0800, William Lee Irwin III wrote: > On Fri, Dec 06, 2002 at 02:44:29AM +0100, Andrea Arcangeli wrote: > > Or it hurts when you can't allocate an inode because such 100M are in > > pagetables on a 64G box and you still have 60G free of highmem. > > This is the zone vs. zone watermark stuff that penalizes/fails > allocations made with a given GFP mask from being satisfied by > fallback. This is largely old news wrt. various kinds of inability > to pressure those ZONE_NORMAL (maybe also ZONE_DMA) consumers. > > Admission control for fallback is valuable, sure. I suspect the > question akpm raised is about memory utilization. My own issues are > centered around allocations targeted directly at ZONE_NORMAL, > which fallback prevention does not address, so the watermark patch > is not something I'm personally very concerned about. you must be very concerned about it too. If you don't have the fallback prevention all your efforts around the allocations targeted directoy zone normal will be completely worthless. Either that or you want to drop ZONE_NORMAL enterely because it means nothing uses zone-normal dynamically anymore (ZONE_NORMAL seen as a place that is directly mapped, not necessairly always 32bit dma capable). > 64GB isn't getting any testing that I know of; I'd hold off until > someone's actually stood up and confessed to attempting to boot > Linux on such a beast. Or until I get some more RAM. =) 64GB is an example, a good example for this thing, but a 16G machine or a 4G machine can run in the very same issues. As said just swapoff -a and malloc(1G) and such 1G is all ZONE_NORMAL before you could allocate enough inodes for your workload. Or alloc 1G of pagetables by setting everything protnone, and sugh 1G of pagetables goes in zone-normal because the highmem is filled by cache. Choose whatever is your preferred example of real life bug fixed by the lowmem-reservation patch that is absolutely necessary to run stable on a big box with normal zone and highmem (not only a 64G box). The only place where you must not be concerned about these fixes are the 64bit archs. Andrea ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0? 2002-12-06 2:28 ` Andrea Arcangeli @ 2002-12-06 2:41 ` William Lee Irwin III 2002-12-06 5:25 ` Andrew Morton 2002-12-06 22:28 ` Andrea Arcangeli 0 siblings, 2 replies; 49+ messages in thread From: William Lee Irwin III @ 2002-12-06 2:41 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Andrew Morton, Norman Gaywood, linux-kernel On Thu, Dec 05, 2002 at 06:15:59PM -0800, William Lee Irwin III wrote: >> Admission control for fallback is valuable, sure. I suspect the >> question akpm raised is about memory utilization. My own issues are >> centered around allocations targeted directly at ZONE_NORMAL, >> which fallback prevention does not address, so the watermark patch >> is not something I'm personally very concerned about. On Fri, Dec 06, 2002 at 03:28:53AM +0100, Andrea Arcangeli wrote: > you must be very concerned about it too. > If you don't have the fallback prevention all your efforts around the > allocations targeted directoy zone normal will be completely worthless. > Either that or you want to drop ZONE_NORMAL enterely because it means > nothing uses zone-normal dynamically anymore (ZONE_NORMAL seen as a > place that is directly mapped, not necessairly always 32bit dma > capable). Yes, it's necessary; no, I've never directly encountered the issue it fixes. Sorry about the miscommunication there. On Thu, Dec 05, 2002 at 06:15:59PM -0800, William Lee Irwin III wrote: >> 64GB isn't getting any testing that I know of; I'd hold off until >> someone's actually stood up and confessed to attempting to boot >> Linux on such a beast. Or until I get some more RAM. =) On Fri, Dec 06, 2002 at 03:28:53AM +0100, Andrea Arcangeli wrote: > 64GB is an example, a good example for this thing, but a 16G machine or > a 4G machine can run in the very same issues. As said just swapoff -a > and malloc(1G) and such 1G is all ZONE_NORMAL before you could allocate > enough inodes for your workload. Or alloc 1G of pagetables by setting > everything protnone, and sugh 1G of pagetables goes in zone-normal > because the highmem is filled by cache. Choose whatever is your > preferred example of real life bug fixed by the lowmem-reservation patch > that is absolutely necessary to run stable on a big box with normal zone > and highmem (not only a 64G box). > The only place where you must not be concerned about these fixes are the > 64bit archs. 64GB on 32-bit is in the territory where it's dead, either literally, performance-wise, or by virtue of dropping hardware on the floor (as it's basically no longer 64GB) due to deeper design limitations. No idea why there's not more support behind or interest in page clustering. It's an optimization (not required) for 64-bit/saner arches. Bill ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0? 2002-12-06 2:41 ` William Lee Irwin III @ 2002-12-06 5:25 ` Andrew Morton 2002-12-06 5:48 ` Andrea Arcangeli 2002-12-06 6:00 ` William Lee Irwin III 2002-12-06 22:28 ` Andrea Arcangeli 1 sibling, 2 replies; 49+ messages in thread From: Andrew Morton @ 2002-12-06 5:25 UTC (permalink / raw) To: William Lee Irwin III; +Cc: Andrea Arcangeli, Norman Gaywood, linux-kernel William Lee Irwin III wrote: > > Yes, it's necessary; no, I've never directly encountered the issue it > fixes. Sorry about the miscommunication there. The google thing. The basic problem is in allowing allocations which _could_ use highmem to use the normal zone as anon memory or pagecache. Because the app could mlock that memory. So for a simple demonstration: - mem=2G - read a 1.2G file - malloc 800M, now mlock it. Those 800M will be in ZONE_NORMAL, simply because that was where the free memory was. And you're dead, even though you've only mlocked 800M. The same thing happens if you have lots of anon memory in the normal zone and there is no swapspace available. Linus's approach was to raise the ZONE_NORMAL pages_min limit for allocations which _could_ use highmem. So a GFP_HIGHUSER allocation has a pages_min limit of (say) 4M when considering the normal zone, but a GFP_KERNEL allocation has a limit of 2M. Andrea's patch does the same thing, via a separate table. He has set the threshold much higher (100M on a 4G box). AFAICT, the algorithms are identical - I was planning on just adding a multiplier to set Linus's ratio - it is currently hardwired to "1". Search for "mysterious" in mm/page_alloc.c ;) It's not clear to me why -aa defaults to 100 megs when the problem only occurs with no swap or when the app is using mlock. The default multiplier (of variable local_min) should be zero. Swapless machines or heavy mlock users can crank it up. But mlocking 700M on a 4G box would kill it as well. The google application, IIRC, mlocks 1G on a 2G machine. Daniel put them onto the 2G+2G split and all was well. Anyway, thanks. I'll take another look at Andrea's implementation. Now, regarding mlock(mmap(open(/dev/hda1))) ;) ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0? 2002-12-06 5:25 ` Andrew Morton @ 2002-12-06 5:48 ` Andrea Arcangeli 2002-12-06 6:14 ` William Lee Irwin III 2002-12-06 6:55 ` Andrew Morton 2002-12-06 6:00 ` William Lee Irwin III 1 sibling, 2 replies; 49+ messages in thread From: Andrea Arcangeli @ 2002-12-06 5:48 UTC (permalink / raw) To: Andrew Morton; +Cc: William Lee Irwin III, Norman Gaywood, linux-kernel On Thu, Dec 05, 2002 at 09:25:15PM -0800, Andrew Morton wrote: > William Lee Irwin III wrote: > > > > Yes, it's necessary; no, I've never directly encountered the issue it > > fixes. Sorry about the miscommunication there. > > The google thing. > > The basic problem is in allowing allocations which _could_ use > highmem to use the normal zone as anon memory or pagecache. > > Because the app could mlock that memory. So for a simple > demonstration: > > - mem=2G > - read a 1.2G file > - malloc 800M, now mlock it. > > Those 800M will be in ZONE_NORMAL, simply because that was where the > free memory was. And you're dead, even though you've only mlocked > 800M. The same thing happens if you have lots of anon memory in the > normal zone and there is no swapspace available. > > Linus's approach was to raise the ZONE_NORMAL pages_min limit for > allocations which _could_ use highmem. So a GFP_HIGHUSER allocation > has a pages_min limit of (say) 4M when considering the normal zone, > but a GFP_KERNEL allocation has a limit of 2M. > > Andrea's patch does the same thing, via a separate table. He has > set the threshold much higher (100M on a 4G box). AFAICT, the > algorithms are identical - I was planning on just adding a multiplier > to set Linus's ratio - it is currently hardwired to "1". Search for > "mysterious" in mm/page_alloc.c ;) > > It's not clear to me why -aa defaults to 100 megs when the problem > only occurs with no swap or when the app is using mlock. The default > multiplier (of variable local_min) should be zero. Swapless machines > or heavy mlock users can crank it up. > > But mlocking 700M on a 4G box would kill it as well. The google > application, IIRC, mlocks 1G on a 2G machine. Daniel put them > onto the 2G+2G split and all was well. > > Anyway, thanks. I'll take another look at Andrea's implementation. you should because it seems you didn't realize how my code works. the algorithm is autotuned at boot and depends on the zone sizes, and it applies to the dma zone too with respect to the normal zone, the highmem case is just one of the cases that the fix for the general problem resolves, and you're totally wrong saying that mlocking 700m on a 4G box could kill it. I call it the per-claszone point of view watermark. If you are capable of highmem (mlock users are) you must left 100M or 10M or 10G free on the normal zone (depends on the watermark setting tuned at boot that is calculated in function of the zone sizes) etc... so it doesn't matter if you mlock 700M or 700G, it can't kill it. The split doesn't matter at all. 2.5 misses this important fix too btw. If you ignore this bugfix people will notice and there's no other way to fix it completely (unless you want to drop the zone-normal and zone-dma enterely, actually zone-dma matters much less because even if it exists basically nobody uses it). > > Now, regarding mlock(mmap(open(/dev/hda1))) ;) Andrea ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0? 2002-12-06 5:48 ` Andrea Arcangeli @ 2002-12-06 6:14 ` William Lee Irwin III 2002-12-06 6:55 ` Andrew Morton 1 sibling, 0 replies; 49+ messages in thread From: William Lee Irwin III @ 2002-12-06 6:14 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Andrew Morton, Norman Gaywood, linux-kernel On Fri, Dec 06, 2002 at 06:48:04AM +0100, Andrea Arcangeli wrote: > you should because it seems you didn't realize how my code works. the > algorithm is autotuned at boot and depends on the zone sizes, and it > applies to the dma zone too with respect to the normal zone, the highmem > case is just one of the cases that the fix for the general problem > resolves, and you're totally wrong saying that mlocking 700m on a 4G box > could kill it. I call it the per-claszone point of view watermark. If > you are capable of highmem (mlock users are) you must left 100M or 10M > or 10G free on the normal zone (depends on the watermark setting tuned > at boot that is calculated in function of the zone sizes) etc... so it > doesn't matter if you mlock 700M or 700G, it can't kill it. The split > doesn't matter at all. 2.5 misses this important fix too btw. > If you ignore this bugfix people will notice and there's no other way > to fix it completely (unless you want to drop the zone-normal and > zone-dma enterely, actually zone-dma matters much less because even if > it exists basically nobody uses it). This problem is not universal; pure GFP_KERNEL allocations are the main problem here. The fix is necessary for anti-google bits but not a panacea for all workloads. The issue here is basically forkbombs (i.e. databases) with potentially high cross-process sharing. Bill ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0? 2002-12-06 5:48 ` Andrea Arcangeli 2002-12-06 6:14 ` William Lee Irwin III @ 2002-12-06 6:55 ` Andrew Morton 2002-12-06 7:14 ` GrandMasterLee 2002-12-06 14:57 ` Andrea Arcangeli 1 sibling, 2 replies; 49+ messages in thread From: Andrew Morton @ 2002-12-06 6:55 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: William Lee Irwin III, Norman Gaywood, linux-kernel Andrea Arcangeli wrote: > > the > algorithm is autotuned at boot and depends on the zone sizes, and it > applies to the dma zone too with respect to the normal zone, the highmem > case is just one of the cases that the fix for the general problem > resolves, Linus's incremental min will protect ZONE_DMA in the same manner. > and you're totally wrong saying that mlocking 700m on a 4G box > could kill it. It is possible to mlock 700M of the normal zone on a 4G -aa kernel. I can't immediately think of anything apart from vma's which will make it fall over, but it will run like crap. > 2.5 misses this important fix too btw. It does not appear to be an important fix at all. There have been zero reports of it on any mailing list which I read since the google days. Yes, it needs to be addressed. But it is not worth taking 100 megabytes of pagecache away from everyone. That is just a matter of choosing the default value. 2.5 has much bigger problems than this - radix_tree nodes and pte_chains in particular. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0? 2002-12-06 6:55 ` Andrew Morton @ 2002-12-06 7:14 ` GrandMasterLee 2002-12-06 7:25 ` Andrew Morton 2002-12-06 14:57 ` Andrea Arcangeli 1 sibling, 1 reply; 49+ messages in thread From: GrandMasterLee @ 2002-12-06 7:14 UTC (permalink / raw) To: Andrew Morton Cc: Andrea Arcangeli, William Lee Irwin III, Norman Gaywood, linux-kernel On Fri, 2002-12-06 at 00:55, Andrew Morton wrote: > Andrea Arcangeli wrote: [...] > > and you're totally wrong saying that mlocking 700m on a 4G box > > could kill it. > > It is possible to mlock 700M of the normal zone on a 4G -aa kernel. > I can't immediately think of anything apart from vma's which will > make it fall over, but it will run like crap. Just curious, but how long would it take a system with 8GB RAM, using 4G or 64G kernel to fall over? One thing I've noticed, is that 2.4.19aa2 runs great on a box with 8GB when I don't allocate all that much, but seems to run into issues after a large DB has been running on it for several days. (i.e. the system get's generally a little slower, less responsive, and in some cases crashes after 7 days). Yes, I know, sounds like a memory leak in something, but aside from patching Oracle from 8.1.7.4(dba's can't find any new patches ATM), I've tried everything except changing my kernel. Could this be similar behaviour? --The GrandMaster ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0? 2002-12-06 7:14 ` GrandMasterLee @ 2002-12-06 7:25 ` Andrew Morton 2002-12-06 7:34 ` GrandMasterLee 0 siblings, 1 reply; 49+ messages in thread From: Andrew Morton @ 2002-12-06 7:25 UTC (permalink / raw) To: GrandMasterLee Cc: Andrea Arcangeli, William Lee Irwin III, Norman Gaywood, linux-kernel GrandMasterLee wrote: > > On Fri, 2002-12-06 at 00:55, Andrew Morton wrote: > > Andrea Arcangeli wrote: > [...] > > > and you're totally wrong saying that mlocking 700m on a 4G box > > > could kill it. > > > > It is possible to mlock 700M of the normal zone on a 4G -aa kernel. > > I can't immediately think of anything apart from vma's which will > > make it fall over, but it will run like crap. > > Just curious, but how long would it take a system with 8GB RAM, using 4G > or 64G kernel to fall over? A few seconds if you ran the wrong thing. Never if you ran something else. > One thing I've noticed, is that 2.4.19aa2 > runs great on a box with 8GB when I don't allocate all that much, but > seems to run into issues after a large DB has been running on it for > several days. (i.e. the system get's generally a little slower, less > responsive, and in some cases crashes after 7 days). "crashes"? kernel, or application? What additional info is available? > Yes, I know, sounds like a memory leak in something, but aside from > patching Oracle from 8.1.7.4(dba's can't find any new patches ATM), I've > tried everything except changing my kernel. > > Could this be similar behaviour? No, it's something else. Possibly a leak, possibly vma structures. You should wait until the machine is sluggish, then capture the output of: vmstat 1 cat /proc/meminfo cat /proc/slabinfo ps aux ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0? 2002-12-06 7:25 ` Andrew Morton @ 2002-12-06 7:34 ` GrandMasterLee 2002-12-06 7:51 ` Andrew Morton 0 siblings, 1 reply; 49+ messages in thread From: GrandMasterLee @ 2002-12-06 7:34 UTC (permalink / raw) To: Andrew Morton Cc: Andrea Arcangeli, William Lee Irwin III, Norman Gaywood, linux-kernel On Fri, 2002-12-06 at 01:25, Andrew Morton wrote: > GrandMasterLee wrote: > > [...] > > Just curious, but how long would it take a system with 8GB RAM, using 4G > > or 64G kernel to fall over? > > A few seconds if you ran the wrong thing. Never if you ran something > else. > > > One thing I've noticed, is that 2.4.19aa2 > > runs great on a box with 8GB when I don't allocate all that much, but > > seems to run into issues after a large DB has been running on it for > > several days. (i.e. the system get's generally a little slower, less > > responsive, and in some cases crashes after 7 days). > > "crashes"? kernel, or application? What additional info is > available? Machine will panic. I've actually captured some and sent them to this list, but I've been told that my stack was corrupt. Problem is, ATM, I can't find a memory problem. Memtest86 locks up on test 4(as in, machine needs hard booting), no matter if it's 8GB or 4GB RAM installed. An no matter if *known good* ram is being tested as well. So I don't think it's that per se. > > Yes, I know, sounds like a memory leak in something, but aside from > > patching Oracle from 8.1.7.4(dba's can't find any new patches ATM), I've > > tried everything except changing my kernel. > > > > Could this be similar behaviour? > > No, it's something else. Possibly a leak, possibly vma structures. Could that yield a corrupt stack? > You should wait until the machine is sluggish, then capture > the output of: > > vmstat 1 > cat /proc/meminfo > cat /proc/slabinfo > ps aux I shall gather the information sometime 12/06/2002. TIA --The GrandMaster ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0? 2002-12-06 7:34 ` GrandMasterLee @ 2002-12-06 7:51 ` Andrew Morton 2002-12-06 11:37 ` Christoph Hellwig 2002-12-06 16:19 ` GrandMasterLee 0 siblings, 2 replies; 49+ messages in thread From: Andrew Morton @ 2002-12-06 7:51 UTC (permalink / raw) To: GrandMasterLee Cc: Andrea Arcangeli, William Lee Irwin III, Norman Gaywood, linux-kernel GrandMasterLee wrote: > > ... > > "crashes"? kernel, or application? What additional info is > > available? > > Machine will panic. I've actually captured some and sent them to this > list, but I've been told that my stack was corrupt. OK. In your second oops trace the `swapper' process had used 5k of its 8k kernel stack processing an XFS IO completion interrupt. And I don't think `swapper' uses much stack of its own. If some other process happens to be using 3k of stack when the same interrupt hits it, it's game over. So at a guess, I'd say you're being hit by excessive stack use in the XFS filesystem. I think the XFS team have done some work on that recently so an upgrade may help. Or it may be something completely different ;) ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0? 2002-12-06 7:51 ` Andrew Morton @ 2002-12-06 11:37 ` Christoph Hellwig 2002-12-06 16:19 ` GrandMasterLee 1 sibling, 0 replies; 49+ messages in thread From: Christoph Hellwig @ 2002-12-06 11:37 UTC (permalink / raw) To: Andrew Morton Cc: GrandMasterLee, Andrea Arcangeli, William Lee Irwin III, Norman Gaywood, linux-kernel On Thu, Dec 05, 2002 at 11:51:10PM -0800, Andrew Morton wrote: > So at a guess, I'd say you're being hit by excessive stack use in > the XFS filesystem. I think the XFS team have done some work on that > recently so an upgrade may help. Yes, XFS 1.1 used a lot of stack. XFS 1.2pre (and the stuff in 2.5) uses much less. He's also using the qla2xxx drivers that aren't exactly stack-friendly either. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0? 2002-12-06 7:51 ` Andrew Morton 2002-12-06 11:37 ` Christoph Hellwig @ 2002-12-06 16:19 ` GrandMasterLee 1 sibling, 0 replies; 49+ messages in thread From: GrandMasterLee @ 2002-12-06 16:19 UTC (permalink / raw) To: Andrew Morton Cc: Andrea Arcangeli, William Lee Irwin III, Norman Gaywood, linux-kernel On Fri, 2002-12-06 at 01:51, Andrew Morton wrote: > GrandMasterLee wrote: > > > > ... > > > "crashes"? kernel, or application? What additional info is > > > available? > > > > Machine will panic. I've actually captured some and sent them to this > > list, but I've been told that my stack was corrupt. > > OK. In your second oops trace the `swapper' process had used 5k of its > 8k kernel stack processing an XFS IO completion interrupt. And I don't > think `swapper' uses much stack of its own. The second Oops is the *best* one IMO. I got it just over 7 days. (like 7 days 6 hours or something. I've still been testing the crud out of this kernel on like hardware, and can't reproduce it. I'd love to know a method for reproducing this for my beta environment. > If some other process happens to be using 3k of stack when the same > interrupt hits it, it's game over. > > So at a guess, I'd say you're being hit by excessive stack use in > the XFS filesystem. I think the XFS team have done some work on that > recently so an upgrade may help. Since we run ~1TB dbs on the systems, and a LOT of IO, and Qlogic drivers, I think that's the culprit. Will swapper use less stack in more recent kernels?(XFS will be updated as part of a plan for the new year I'm putting together. Till then, it's reboot every 7 days) > Or it may be something completely different ;) I hope not. :) --The GrandMaster ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0? 2002-12-06 6:55 ` Andrew Morton 2002-12-06 7:14 ` GrandMasterLee @ 2002-12-06 14:57 ` Andrea Arcangeli 2002-12-06 15:12 ` William Lee Irwin III 1 sibling, 1 reply; 49+ messages in thread From: Andrea Arcangeli @ 2002-12-06 14:57 UTC (permalink / raw) To: Andrew Morton; +Cc: William Lee Irwin III, Norman Gaywood, linux-kernel On Thu, Dec 05, 2002 at 10:55:53PM -0800, Andrew Morton wrote: > Andrea Arcangeli wrote: > > > > the > > algorithm is autotuned at boot and depends on the zone sizes, and it > > applies to the dma zone too with respect to the normal zone, the highmem > > case is just one of the cases that the fix for the general problem > > resolves, > > Linus's incremental min will protect ZONE_DMA in the same manner. of how many bytes? > > > and you're totally wrong saying that mlocking 700m on a 4G box > > could kill it. > > It is possible to mlock 700M of the normal zone on a 4G -aa kernel. > I can't immediately think of anything apart from vma's which will > make it fall over, but it will run like crap. you're missing the whole point. the vma are zone-normal users. You're saying that you can run out of ZONE_NORMAL if you run alloc_page(GFP_KERNEL) for some hundred thousand times. Yeah that's not a big news. I'm saying you *can't* run out of zone-normal due highmem allocations so if you run alloc_pages(GFP_HIGHMEM), period. that's a completely different thing. I thought you understood what the problem is, not sure why you say you can run out of zone-normal running 100000 times alloc_page(GFP_KERNEL), that has *nothing* to do with the bug we're discussing here, if you don't want to run out of zone-normal after 100000 GFP_KERNEL page allocations you can only drop the zone-normal. The bug we're discussing here is that w/o my fix you will run out of zone-normal despite you didn't start allocating zone-normal yet and despite you still have 60G free in the highmem zone. This is what the patch prevents, nothing more nothing less. And it's not so much specific to google, they were just unlucky triggering it, as said just allocate plenty of pagetables (they are highmem capable in my tree and 2.5) or swapoff -a, and you'll run in the very same scenario that needs my fix in all normal workloads that allocates some more than some hunded mbytes of ram. And this is definitely a generic problem, not even specific to linux, it's an OS wide design problem while dealing with the balancing of different zones that have overlapping but not equivalent capabilities, it even applies to zone-dma with respect to zone-normal and zone-highmem and there's no other fix around it at the moment. Mainline fixes it in a very weak way, it reserves a few meges only, that's not nearly enough if you need to allocate more than one more inode etc... The lowmem reservation must allow the machine to do interesting workloads for the whole uptime, not to defer the failure of a few seconds. A few megs aren't nearly enough. If interesting workloads needs huge zone-normal, just reserve more of it at boot and they will work. if all the zone-normal isn't enough you fall into a totally different problem, that is the zone-normal existence in the first place and it has nothing to do with this bug, and you can fix the other problem only by dropping the zone-normal (of course if you do that you will in turn fix this problem too, but the problems are different). The only alternate fix is to be able to migrate pagetables (1st level only, pte) and all the other highmem capable allocations at runtime (pagecache, shared memory etc..). Which is clearly not possible in 2.5 and 2.4. Once that will be possible/implemented my fix can go away and you can simply migrate the highmem capable allocations from zone-normal to highmem. That would be the only alternate and also dynamic/superior fix but it's not feasible at the moment, at the very least not in 2.4. It would also have some performance implications, I'm sure lots of people prefers to throw away 500M of ram in a 32G machine than riskying to spend the cpu time in memcopies, so it would not be *that* superior, it would be inferior in some ways. Reserving 500M of ram on a 32G machine doesn't really matter at all, so the current fix is certainly the best thing we can do for 2.4, and for 2.5 too unless you want to implement highmem migration for all highmem capable kernel objects (which would work fine too). Also your possible multiplicator via sysctl remains a much inferior to my fix that is able to cleanly enforce classzone-point-of-view watermarks (not fixed watermarks), you would need to change multiplicator depending on zone size and depending on the zone to make it equivalent, so yes, you could implement it equivally but it would be much less clean and readable than my current code (and more hardly tunable with a kernel paramter at boot like my current fix is). > > 2.5 misses this important fix too btw. > > It does not appear to be an important fix at all. There have been well if you ignore it people can use my tree, I personally need that fix for myself on big boxes so I'm going to retain it in one form or another (the form in mainline is too weak as said and just adding a multiplicator would not be equivalent as said above). > 2.5 has much bigger problems than this - radix_tree nodes and pte_chains > in particular. I'm not saying there aren't bigger problems in 2.5, but I don't classify this one as a minor one, infact it was a showstopper for a long time in 2.4 (one of the last ones), until I fixed it and it still is a problem because the 2.4 fix is way too weak (a few megs aren't enough to guarantee big workloads to succeed). Andrea ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0? 2002-12-06 14:57 ` Andrea Arcangeli @ 2002-12-06 15:12 ` William Lee Irwin III 2002-12-06 23:32 ` Andrea Arcangeli 0 siblings, 1 reply; 49+ messages in thread From: William Lee Irwin III @ 2002-12-06 15:12 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Andrew Morton, Norman Gaywood, linux-kernel On Fri, Dec 06, 2002 at 03:57:19PM +0100, Andrea Arcangeli wrote: > The only alternate fix is to be able to migrate pagetables (1st level > only, pte) and all the other highmem capable allocations at runtime > (pagecache, shared memory etc..). Which is clearly not possible in 2.5 > and 2.4. Actually it should not be difficult for 2.5, though it's not done now. Shared pagetables would complicate the implementation slightly. I've gotten 100% backlash from my proposals in this area, so I'm not touching it at all out of aggravation or whatever. Bill ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0? 2002-12-06 15:12 ` William Lee Irwin III @ 2002-12-06 23:32 ` Andrea Arcangeli 2002-12-06 23:45 ` William Lee Irwin III 0 siblings, 1 reply; 49+ messages in thread From: Andrea Arcangeli @ 2002-12-06 23:32 UTC (permalink / raw) To: William Lee Irwin III, Andrew Morton, Norman Gaywood, linux-kernel On Fri, Dec 06, 2002 at 07:12:20AM -0800, William Lee Irwin III wrote: > On Fri, Dec 06, 2002 at 03:57:19PM +0100, Andrea Arcangeli wrote: > > The only alternate fix is to be able to migrate pagetables (1st level > > only, pte) and all the other highmem capable allocations at runtime > > (pagecache, shared memory etc..). Which is clearly not possible in 2.5 > > and 2.4. > > Actually it should not be difficult for 2.5, though it's not done now. "difficult" is a relative word, nothing is difficult but everything is difficult, depends the way you feel about it. but note that even with rmap you don't know the pmd that points to the pte that you want to relocate and for the anon pages you miss information about mm and virtual address where those pages are allocated, so basically rmap is useless for doing it, you need to do the pagetable walking ala swap_out, in turn it's not easier at all in 2.5 than it could been in 2.4 (but of course this is a 2.5 thing only, I just want to say that if it's not difficult in 2.5 it wasn't difficult in 2.4 either). Andrea ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0? 2002-12-06 23:32 ` Andrea Arcangeli @ 2002-12-06 23:45 ` William Lee Irwin III 2002-12-06 23:57 ` Andrea Arcangeli 0 siblings, 1 reply; 49+ messages in thread From: William Lee Irwin III @ 2002-12-06 23:45 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Andrew Morton, Norman Gaywood, linux-kernel On Sat, Dec 07, 2002 at 12:32:43AM +0100, Andrea Arcangeli wrote: > but note that even with rmap you don't know the pmd that points to the > pte that you want to relocate and for the anon pages you miss > information about mm and virtual address where those pages are > allocated, so basically rmap is useless for doing it, you need to do the > pagetable walking ala swap_out, in turn it's not easier at all in 2.5 > than it could been in 2.4 (but of course this is a 2.5 thing only, I > just want to say that if it's not difficult in 2.5 it wasn't difficult > in 2.4 either). Actually, we do. From include/asm-generic/rmap.h: static inline void pgtable_add_rmap(struct page * page, struct mm_struct * mm, unsigned long address) { #ifdef BROKEN_PPC_PTE_ALLOC_ONE /* OK, so PPC calls pte_alloc() before mem_map[] is setup ... ;( */ extern int mem_init_done; if (!mem_init_done) return; #endif page->mapping = (void *)mm; page->index = address & ~((PTRS_PER_PTE * PAGE_SIZE) - 1); inc_page_state(nr_page_table_pages); } So pagetable pages are tagged with the right information, and in principle could even be tagged here with the pmd in page->private. These fields are actually required for use by try_to_unmap_one(), and something similar could be done for a try_to_move_one(). This information remains intact with shared pagetables, and is generalized so that the PTE page is tagged with a list of mm's (the mm_chain), and in that case no unique pmd could be directly stored in the page, but it could just as easily be derived from the mm's in the mm_chain. But there's no denying it would involve a substantial amount of work. Bill ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0? 2002-12-06 23:45 ` William Lee Irwin III @ 2002-12-06 23:57 ` Andrea Arcangeli 0 siblings, 0 replies; 49+ messages in thread From: Andrea Arcangeli @ 2002-12-06 23:57 UTC (permalink / raw) To: William Lee Irwin III, Andrew Morton, Norman Gaywood, linux-kernel On Fri, Dec 06, 2002 at 03:45:24PM -0800, William Lee Irwin III wrote: > On Sat, Dec 07, 2002 at 12:32:43AM +0100, Andrea Arcangeli wrote: > > but note that even with rmap you don't know the pmd that points to the > > pte that you want to relocate and for the anon pages you miss > > information about mm and virtual address where those pages are > > allocated, so basically rmap is useless for doing it, you need to do the > > pagetable walking ala swap_out, in turn it's not easier at all in 2.5 > > than it could been in 2.4 (but of course this is a 2.5 thing only, I > > just want to say that if it's not difficult in 2.5 it wasn't difficult > > in 2.4 either). > > Actually, we do. From include/asm-generic/rmap.h: > > static inline void pgtable_add_rmap(struct page * page, struct mm_struct * mm, unsigned long address) > { > #ifdef BROKEN_PPC_PTE_ALLOC_ONE > /* OK, so PPC calls pte_alloc() before mem_map[] is setup ... ;( */ > extern int mem_init_done; > > if (!mem_init_done) > return; > #endif > page->mapping = (void *)mm; > page->index = address & ~((PTRS_PER_PTE * PAGE_SIZE) - 1); > inc_page_state(nr_page_table_pages); > } > > So pagetable pages are tagged with the right information, and in > principle could even be tagged here with the pmd in page->private. sorry I didn't noticed the overlap of page->mapping to store the mm. But yes, I should have realized that you had do because otherwise you wouldn't know how to flush the tlb ;) so without the mm and address rmap would be useless. So via the address and mapping you can walk the pagetables and reach it with lower complexity than w/o rmap. Still doing the pagetable walk wouldn't be an huge increase in complexity but it would increase the "computational" complexity of the algorithm. > These fields are actually required for use by try_to_unmap_one(), > and something similar could be done for a try_to_move_one(). This > information remains intact with shared pagetables, and is generalized > so that the PTE page is tagged with a list of mm's (the mm_chain), > and in that case no unique pmd could be directly stored in the page, > but it could just as easily be derived from the mm's in the mm_chain. > > But there's no denying it would involve a substantial amount of work. > > > Bill Andrea ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0? 2002-12-06 5:25 ` Andrew Morton 2002-12-06 5:48 ` Andrea Arcangeli @ 2002-12-06 6:00 ` William Lee Irwin III 1 sibling, 0 replies; 49+ messages in thread From: William Lee Irwin III @ 2002-12-06 6:00 UTC (permalink / raw) To: Andrew Morton; +Cc: Andrea Arcangeli, Norman Gaywood, linux-kernel William Lee Irwin III wrote: >> Yes, it's necessary; no, I've never directly encountered the issue it >> fixes. Sorry about the miscommunication there. On Thu, Dec 05, 2002 at 09:25:15PM -0800, Andrew Morton wrote: > Linus's approach was to raise the ZONE_NORMAL pages_min limit for > allocations which _could_ use highmem. So a GFP_HIGHUSER allocation > has a pages_min limit of (say) 4M when considering the normal zone, > but a GFP_KERNEL allocation has a limit of 2M. > Andrea's patch does the same thing, via a separate table. He has > set the threshold much higher (100M on a 4G box). AFAICT, the > algorithms are identical - I was planning on just adding a multiplier > to set Linus's ratio - it is currently hardwired to "1". Search for > "mysterious" in mm/page_alloc.c ;) There's no mystery here aside from a couple of magic numbers and a not-very-well-explained admission control policy. Tweaking magic numbers a la 2.4.x-aa until more infrastructure is available (2.7) sounds good to me. Thanks, Bill ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0? 2002-12-06 2:41 ` William Lee Irwin III 2002-12-06 5:25 ` Andrew Morton @ 2002-12-06 22:28 ` Andrea Arcangeli 2002-12-06 23:21 ` William Lee Irwin III 1 sibling, 1 reply; 49+ messages in thread From: Andrea Arcangeli @ 2002-12-06 22:28 UTC (permalink / raw) To: William Lee Irwin III, Andrew Morton, Norman Gaywood, linux-kernel On Thu, Dec 05, 2002 at 06:41:40PM -0800, William Lee Irwin III wrote: > No idea why there's not more support behind or interest in page > clustering. It's an optimization (not required) for 64-bit/saner arches. softpagesize sounds a good idea to try for archs with a page size < 8k indeed, modulo a few places where the 4k pagesize is part of the userspace abi, for that reason on x86-64 Andi recently suggested to changed the abi to assume a bigger page size and I suggested to assume it to be 2M and not a smaller thing as originally suggested, that way we waste some more virtual space (not an issue on 64bit) and some cache color (not a big deal either, those caches are multiway associative even if not fully associative), so eventually in theory we could even switch the page size to 2M ;) however don't mistake softpagesize with the PAGE_CACHE_SIZE (the latter I think was completed some time ago by Hugh). I think PAGE_CACHE_SIZE is a broken idea (i'm talking about the PAGE_CACHE_SIZE at large, not about the implementation that may even be fine with Hugh's patch applied). PAGE_CACHE_SIZE will never work well due the fragmentation problems it introduces. So I definitely vote for dropping PAGE_CACHE_SIZE and to experiment with a soft PAGE_SIZE, multiple of the hardware PAGE_SIZE. That means the allocator minimal granularity will return 8k. on x86 that breaks a bit the ABI. on x86-64 the softpagesize would breaks only the 32bit compatibilty mode abi a little so it would be even less severe. And I think the softpagesize should be a config option so it can be experimented without breaking the default config even on x86. the soft PAGE_SIZE will also decrease of an order of magnitude the page fault rate, the number of pte will be the same but we'll cluster the pte refills all served from the same I/O anyways (readhaead usually loads the next pages too anyways). So it's a kind of quite obvious design optimization to experiment with (maybe for 2.7?). Andrea ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0? 2002-12-06 22:28 ` Andrea Arcangeli @ 2002-12-06 23:21 ` William Lee Irwin III 2002-12-06 23:50 ` Andrea Arcangeli 2002-12-07 0:01 ` Andrew Morton 0 siblings, 2 replies; 49+ messages in thread From: William Lee Irwin III @ 2002-12-06 23:21 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Andrew Morton, Norman Gaywood, linux-kernel On Thu, Dec 05, 2002 at 06:41:40PM -0800, William Lee Irwin III wrote: >> No idea why there's not more support behind or interest in page >> clustering. It's an optimization (not required) for 64-bit/saner arches. On Fri, Dec 06, 2002 at 11:28:52PM +0100, Andrea Arcangeli wrote: > softpagesize sounds a good idea to try for archs with a page size < 8k > indeed, modulo a few places where the 4k pagesize is part of the > userspace abi, for that reason on x86-64 Andi recently suggested to > changed the abi to assume a bigger page size and I suggested to assume > it to be 2M and not a smaller thing as originally suggested, that way we > waste some more virtual space (not an issue on 64bit) and some cache > color (not a big deal either, those caches are multiway associative even > if not fully associative), so eventually in theory we could even switch > the page size to 2M ;) The patch I'm talking about introduces a distinction between the size of an area mapped by a PTE or TLB entry (MMUPAGE_SIZE) and the kernel's internal allocation unit (PAGE_SIZE), and does (AFAICT) properly vectored PTE operations in the VM to support the system's native page size, and does a whole kernel audit of drivers/ and fs/ PAGE_SIZE usage so that the distinction between PAGE_SIZE and MMUPAGE_SIZE is understood. On Fri, Dec 06, 2002 at 11:28:52PM +0100, Andrea Arcangeli wrote: > however don't mistake softpagesize with the PAGE_CACHE_SIZE (the latter > I think was completed some time ago by Hugh). I think PAGE_CACHE_SIZE > is a broken idea (i'm talking about the PAGE_CACHE_SIZE at large, not > about the implementation that may even be fine with Hugh's patch > applied). PAGE_CACHE_SIZE is mostly an fs thing, so there's not much danger of confusion, at least not in my mind. On Fri, Dec 06, 2002 at 11:28:52PM +0100, Andrea Arcangeli wrote: > PAGE_CACHE_SIZE will never work well due the fragmentation problems it > introduces. So I definitely vote for dropping PAGE_CACHE_SIZE and to > experiment with a soft PAGE_SIZE, multiple of the hardware PAGE_SIZE. > That means the allocator minimal granularity will return 8k. on x86 that > breaks a bit the ABI. on x86-64 the softpagesize would breaks only the 32bit > compatibilty mode abi a little so it would be even less severe. And I > think the softpagesize should be a config option so it can be > experimented without breaking the default config even on x86. Hmm, from the appearances of the patch (my ability to test the patch is severely hampered by its age) it should actually maintain hardware pagesize mmap() granularity, ABI compatibility, etc. On Fri, Dec 06, 2002 at 11:28:52PM +0100, Andrea Arcangeli wrote: > the soft PAGE_SIZE will also decrease of an order of magnitude the page > fault rate, the number of pte will be the same but we'll cluster the pte > refills all served from the same I/O anyways (readhaead usually loads > the next pages too anyways). So it's a kind of quite obvious design > optimization to experiment with (maybe for 2.7?). Sounds like the right timing for me. A 16KB or 64KB kernel allocation unit would then annihilate sizeof(mem_map) concerns on 3/1 splits. 720MB -> 180MB or 45MB. Or on my home machine (768MB PC) 6MB -> 1.5MB or 384KB, which is a substantial reduction in cache footprint and outright memory footprint. I think this is a perfect example of how the increased awareness of space consumption highmem gives us helps us optimize all boxen. Thanks, Bill ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0? 2002-12-06 23:21 ` William Lee Irwin III @ 2002-12-06 23:50 ` Andrea Arcangeli 2002-12-07 0:30 ` William Lee Irwin III 2002-12-07 0:01 ` Andrew Morton 1 sibling, 1 reply; 49+ messages in thread From: Andrea Arcangeli @ 2002-12-06 23:50 UTC (permalink / raw) To: William Lee Irwin III, Andrew Morton, Norman Gaywood, linux-kernel On Fri, Dec 06, 2002 at 03:21:25PM -0800, William Lee Irwin III wrote: > On Thu, Dec 05, 2002 at 06:41:40PM -0800, William Lee Irwin III wrote: > >> No idea why there's not more support behind or interest in page > >> clustering. It's an optimization (not required) for 64-bit/saner arches. > > On Fri, Dec 06, 2002 at 11:28:52PM +0100, Andrea Arcangeli wrote: > > softpagesize sounds a good idea to try for archs with a page size < 8k > > indeed, modulo a few places where the 4k pagesize is part of the > > userspace abi, for that reason on x86-64 Andi recently suggested to > > changed the abi to assume a bigger page size and I suggested to assume > > it to be 2M and not a smaller thing as originally suggested, that way we > > waste some more virtual space (not an issue on 64bit) and some cache > > color (not a big deal either, those caches are multiway associative even > > if not fully associative), so eventually in theory we could even switch > > the page size to 2M ;) > > The patch I'm talking about introduces a distinction between the size > of an area mapped by a PTE or TLB entry (MMUPAGE_SIZE) and the kernel's > internal allocation unit (PAGE_SIZE), and does (AFAICT) properly > vectored PTE operations in the VM to support the system's native page > size, and does a whole kernel audit of drivers/ and fs/ PAGE_SIZE usage > so that the distinction between PAGE_SIZE and MMUPAGE_SIZE is understood. My point is that making any distinction will lead to inevitable fragmentation of memory. Going to an higher kernel wide PAGE_SIZE and avoiding the distinction will even fix the 8k fragmentation issue with the kernel stack ;) Not to tell allowing more workloads to be able to use all ram of the 32bit 64G boxes. > On Fri, Dec 06, 2002 at 11:28:52PM +0100, Andrea Arcangeli wrote: > > however don't mistake softpagesize with the PAGE_CACHE_SIZE (the latter > > I think was completed some time ago by Hugh). I think PAGE_CACHE_SIZE > > is a broken idea (i'm talking about the PAGE_CACHE_SIZE at large, not > > about the implementation that may even be fine with Hugh's patch > > applied). > > PAGE_CACHE_SIZE is mostly an fs thing, so there's not much danger of > confusion, at least not in my mind. ok, I thought MMUPAGE_SIZE and PAGE_CACHE_SIZE were related, but of course they don't need to. > On Fri, Dec 06, 2002 at 11:28:52PM +0100, Andrea Arcangeli wrote: > > PAGE_CACHE_SIZE will never work well due the fragmentation problems it > > introduces. So I definitely vote for dropping PAGE_CACHE_SIZE and to > > experiment with a soft PAGE_SIZE, multiple of the hardware PAGE_SIZE. > > That means the allocator minimal granularity will return 8k. on x86 that > > breaks a bit the ABI. on x86-64 the softpagesize would breaks only the 32bit > > compatibilty mode abi a little so it would be even less severe. And I > > think the softpagesize should be a config option so it can be > > experimented without breaking the default config even on x86. > > Hmm, from the appearances of the patch (my ability to test the patch > is severely hampered by its age) it should actually maintain hardware > pagesize mmap() granularity, ABI compatibility, etc. If it only implements the MMUPAGE_SIZE, yes, it can. You break the ABI as soon as you change the kernel wide PAGE_SIZE. it is allowed only on 64bit binaries running on a x86-64 kernel. The 32bit binaries running in compatibility mode as said would suffer a bit, but most things should run and we can make hacks like using anon mappings if the files are small just for the sake of running some app 32bit (like we use anon mappings for a.out binaries needing 1k offsets today). Said that even the MMUPAGE_SIZE alone would be useful, but I'd prefer if the kernel wide PAGE_SIZE would be increased (with the disavantage of breaking the ABI, but it would be a config option, even the 2G/3.5G/1G split has the chance of breaking some app despite I wouldn't classify it as an ABI violation for the reason explained in one of the last emails). > On Fri, Dec 06, 2002 at 11:28:52PM +0100, Andrea Arcangeli wrote: > > the soft PAGE_SIZE will also decrease of an order of magnitude the page > > fault rate, the number of pte will be the same but we'll cluster the pte > > refills all served from the same I/O anyways (readhaead usually loads > > the next pages too anyways). So it's a kind of quite obvious design > > optimization to experiment with (maybe for 2.7?). > > Sounds like the right timing for me. > > A 16KB or 64KB kernel allocation unit would then annihilate > sizeof(mem_map) concerns on 3/1 splits. 720MB -> 180MB or 45MB. > > Or on my home machine (768MB PC) 6MB -> 1.5MB or 384KB, which > is a substantial reduction in cache footprint and outright > memory footprint. Yep. > > I think this is a perfect example of how the increased awareness of > space consumption highmem gives us helps us optimize all boxen. In this case funnily it has a chance to help some 64bit boxes too ;). Andrea ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0? 2002-12-06 23:50 ` Andrea Arcangeli @ 2002-12-07 0:30 ` William Lee Irwin III 0 siblings, 0 replies; 49+ messages in thread From: William Lee Irwin III @ 2002-12-07 0:30 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Andrew Morton, Norman Gaywood, linux-kernel At some point in the past, I wrote: > My point is that making any distinction will lead to inevitable > fragmentation of memory. It's mostly userspace; the kernel is usually (hello drivers/ !) cautious and uses slab.c's anti-internal fragmentation techniques for most structs. At some point in the past, I wrote: >> Hmm, from the appearances of the patch (my ability to test the patch >> is severely hampered by its age) it should actually maintain hardware >> pagesize mmap() granularity, ABI compatibility, etc. On Sat, Dec 07, 2002 at 12:50:32AM +0100, Andrea Arcangeli wrote: > If it only implements the MMUPAGE_SIZE, yes, it can. > You break the ABI as soon as you change the kernel wide PAGE_SIZE. it is > allowed only on 64bit binaries running on a x86-64 kernel. The 32bit > binaries running in compatibility mode as said would suffer a bit, but > most things should run and we can make hacks like using anon mappings if > the files are small just for the sake of running some app 32bit (like we > use anon mappings for a.out binaries needing 1k offsets today). I'm not sure what to make of this. The distinction and PTE vectoring API (AFAICT) allows PTE's to map sub-PAGE_SIZE-sized (MMUPAGE_SIZE to be exact) regions. Someone start screaming if I misread the patch. On Sat, Dec 07, 2002 at 12:50:32AM +0100, Andrea Arcangeli wrote: > Said that even the MMUPAGE_SIZE alone would be useful, but I'd prefer if > the kernel wide PAGE_SIZE would be increased (with the disavantage of > breaking the ABI, but it would be a config option, even the 2G/3.5G/1G > split has the chance of breaking some app despite I wouldn't classify it > as an ABI violation for the reason explained in one of the last emails). Userspace is required to have >= 3GB of virtualspace, according to the SVR4 i386 ABI spec. But we don't follow that strictly anyway. At some point in the past, I wrote: >> I think this is a perfect example of how the increased awareness of >> space consumption highmem gives us helps us optimize all boxen. On Sat, Dec 07, 2002 at 12:50:32AM +0100, Andrea Arcangeli wrote: > In this case funnily it has a chance to help some 64bit boxes too ;). I've heard the sizeof(mem_map) footprint is worse on 64-bit because while PAGE_SIZE remains the same, but pointers double in size. This would help a bit there, too. Bill ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0? 2002-12-06 23:21 ` William Lee Irwin III 2002-12-06 23:50 ` Andrea Arcangeli @ 2002-12-07 0:01 ` Andrew Morton 2002-12-07 0:21 ` William Lee Irwin III ` (3 more replies) 1 sibling, 4 replies; 49+ messages in thread From: Andrew Morton @ 2002-12-07 0:01 UTC (permalink / raw) To: William Lee Irwin III; +Cc: Andrea Arcangeli, Norman Gaywood, linux-kernel William Lee Irwin III wrote: > > ... > A 16KB or 64KB kernel allocation unit would then annihilate You want to be careful about this: CPU: L1 I cache: 16K, L1 D cache: 16K Because instantiating a 16k page into user pagetables in one hit means that it must all be zeroed. With these large pagesizes that means that the application is likely to get 100% L1 misses against the new page, whereas it currently gets 100% hits. I'd expect this performance dropoff to occur when going from 8k to 16k. By the time you get to 32k it would be quite bad. One way to address this could be to find a way of making the pages present, but still cause a fault on first access. Then have a special-case fastpath in the fault handler to really wipe the page just before it is used. I don't know how though - maybe _PAGE_USER? get_user_pages() would need attention too - you don't want to allow the user to perform O_DIRECT writes of uninitialised pages to their files... ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0? 2002-12-07 0:01 ` Andrew Morton @ 2002-12-07 0:21 ` William Lee Irwin III 2002-12-07 0:30 ` Andrew Morton 2002-12-07 2:19 ` Alan Cox 2002-12-07 0:22 ` Andrea Arcangeli ` (2 subsequent siblings) 3 siblings, 2 replies; 49+ messages in thread From: William Lee Irwin III @ 2002-12-07 0:21 UTC (permalink / raw) To: Andrew Morton; +Cc: Andrea Arcangeli, Norman Gaywood, linux-kernel > William Lee Irwin III wrote: > > > > ... > > A 16KB or 64KB kernel allocation unit would then annihilate > On Fri, Dec 06, 2002 at 04:01:24PM -0800, Andrew Morton wrote: > You want to be careful about this: > CPU: L1 I cache: 16K, L1 D cache: 16K > Because instantiating a 16k page into user pagetables in > one hit means that it must all be zeroed. With these large > pagesizes that means that the application is likely to get > 100% L1 misses against the new page, whereas it currently > gets 100% hits. 16K is reasonable; after that one might as well go all the way. About the only way to cope is amortizing it by cacheing zeroed pages, and that has other downsides. Bill ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0? 2002-12-07 0:21 ` William Lee Irwin III @ 2002-12-07 0:30 ` Andrew Morton 2002-12-07 2:19 ` Alan Cox 1 sibling, 0 replies; 49+ messages in thread From: Andrew Morton @ 2002-12-07 0:30 UTC (permalink / raw) To: William Lee Irwin III; +Cc: Andrea Arcangeli, Norman Gaywood, linux-kernel William Lee Irwin III wrote: > > > William Lee Irwin III wrote: > > > > > > ... > > > A 16KB or 64KB kernel allocation unit would then annihilate > > > On Fri, Dec 06, 2002 at 04:01:24PM -0800, Andrew Morton wrote: > > You want to be careful about this: > > CPU: L1 I cache: 16K, L1 D cache: 16K > > Because instantiating a 16k page into user pagetables in > > one hit means that it must all be zeroed. With these large > > pagesizes that means that the application is likely to get > > 100% L1 misses against the new page, whereas it currently > > gets 100% hits. > > 16K is reasonable; after that one might as well go all the way. 16k will cause serious slowdowns. > About the only way to cope is amortizing it by cacheing zeroed pages, > and that has other downsides. So will that. You've seen the kernbench profiles... You will need to find a way to clear the page just before it is used. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0? 2002-12-07 0:21 ` William Lee Irwin III 2002-12-07 0:30 ` Andrew Morton @ 2002-12-07 2:19 ` Alan Cox 2002-12-07 1:46 ` William Lee Irwin III 1 sibling, 1 reply; 49+ messages in thread From: Alan Cox @ 2002-12-07 2:19 UTC (permalink / raw) To: William Lee Irwin III Cc: Andrew Morton, Andrea Arcangeli, Norman Gaywood, Linux Kernel Mailing List On Sat, 2002-12-07 at 00:21, William Lee Irwin III wrote: > 16K is reasonable; after that one might as well go all the way. > About the only way to cope is amortizing it by cacheing zeroed pages, > and that has other downsides. Some of the lower end CPU's only have about 12-16K of L1. I don't think thats a big problem since those aren't going to be highmem or large memory users ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0? 2002-12-07 2:19 ` Alan Cox @ 2002-12-07 1:46 ` William Lee Irwin III 2002-12-07 1:56 ` Andrea Arcangeli 2002-12-07 2:31 ` Alan Cox 0 siblings, 2 replies; 49+ messages in thread From: William Lee Irwin III @ 2002-12-07 1:46 UTC (permalink / raw) To: Alan Cox Cc: Andrew Morton, Andrea Arcangeli, Norman Gaywood, Linux Kernel Mailing List On Sat, 2002-12-07 at 00:21, William Lee Irwin III wrote: >> 16K is reasonable; after that one might as well go all the way. >> About the only way to cope is amortizing it by cacheing zeroed pages, >> and that has other downsides. On Sat, Dec 07, 2002 at 02:19:49AM +0000, Alan Cox wrote: > Some of the lower end CPU's only have about 12-16K of L1. I don't think > thats a big problem since those aren't going to be highmem or large > memory users It's an arch parameter, so they'd probably just #define MMUPAGE_SIZE PAGE_SIZE Hugh's original patch did that for all non-i386 arches. Bill ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0? 2002-12-07 1:46 ` William Lee Irwin III @ 2002-12-07 1:56 ` Andrea Arcangeli 2002-12-07 2:31 ` Alan Cox 1 sibling, 0 replies; 49+ messages in thread From: Andrea Arcangeli @ 2002-12-07 1:56 UTC (permalink / raw) To: William Lee Irwin III, Alan Cox, Andrew Morton, Norman Gaywood, Linux Kernel Mailing List On Fri, Dec 06, 2002 at 05:46:43PM -0800, William Lee Irwin III wrote: > On Sat, 2002-12-07 at 00:21, William Lee Irwin III wrote: > >> 16K is reasonable; after that one might as well go all the way. > >> About the only way to cope is amortizing it by cacheing zeroed pages, > >> and that has other downsides. > > On Sat, Dec 07, 2002 at 02:19:49AM +0000, Alan Cox wrote: > > Some of the lower end CPU's only have about 12-16K of L1. I don't think > > thats a big problem since those aren't going to be highmem or large > > memory users > > It's an arch parameter, so they'd probably just > #define MMUPAGE_SIZE PAGE_SIZE > Hugh's original patch did that for all non-i386 arches. I would say the most important thing to evaluate before the cpu and cache size is the amount of ram in the machine. The major downside of going to 8k is the loss of granularity in the paging, so a small machine may not want to pagein the next page too unless it's been explicitly touched by the program, to utilize the few available ram at best and to have the most finegrined info possible about the working set in the pagetables. The breakpoint depends on the workload. probably it would make sense to keep at 4k all boxes <= 64M or something on those lines. Andrea ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0? 2002-12-07 1:46 ` William Lee Irwin III 2002-12-07 1:56 ` Andrea Arcangeli @ 2002-12-07 2:31 ` Alan Cox 2002-12-07 2:09 ` William Lee Irwin III 1 sibling, 1 reply; 49+ messages in thread From: Alan Cox @ 2002-12-07 2:31 UTC (permalink / raw) To: William Lee Irwin III Cc: Andrew Morton, Andrea Arcangeli, Norman Gaywood, Linux Kernel Mailing List On Sat, 2002-12-07 at 01:46, William Lee Irwin III wrote: > It's an arch parameter, so they'd probably just > #define MMUPAGE_SIZE PAGE_SIZE > Hugh's original patch did that for all non-i386 arches. These are low end x86 - but we could this based on <= i586 i586 i686+ ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0? 2002-12-07 2:31 ` Alan Cox @ 2002-12-07 2:09 ` William Lee Irwin III 0 siblings, 0 replies; 49+ messages in thread From: William Lee Irwin III @ 2002-12-07 2:09 UTC (permalink / raw) To: Alan Cox Cc: Andrew Morton, Andrea Arcangeli, Norman Gaywood, Linux Kernel Mailing List On Sat, 2002-12-07 at 01:46, William Lee Irwin III wrote: >> It's an arch parameter, so they'd probably just >> #define MMUPAGE_SIZE PAGE_SIZE >> Hugh's original patch did that for all non-i386 arches. On Sat, Dec 07, 2002 at 02:31:37AM +0000, Alan Cox wrote: > These are low end x86 - but we could this based on > <= i586 > i586 > i686+ It's relatively flexible as to the choice of PAGE_SIZE (it's MMUPAGE_SIZE that's defined by hardware); about the only constraints are that jacking it up where PAGE_SIZE spans pmd's breaks the core vectoring API, PAGE_SIZE >= MMUPAGE_SIZE, both are powers of 2, the vectors (which are of size MMUPAGE_COUNT*sizeof(pte_t *)) are stack- allocated, and arch code has to understand small bits of it. It sounds like we could pick sane defaults based on CPU revision. Bill ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0? 2002-12-07 0:01 ` Andrew Morton 2002-12-07 0:21 ` William Lee Irwin III @ 2002-12-07 0:22 ` Andrea Arcangeli 2002-12-07 0:35 ` Andrew Morton 2002-12-07 0:46 ` William Lee Irwin III 2002-12-07 10:55 ` Arjan van de Ven 3 siblings, 1 reply; 49+ messages in thread From: Andrea Arcangeli @ 2002-12-07 0:22 UTC (permalink / raw) To: Andrew Morton; +Cc: William Lee Irwin III, Norman Gaywood, linux-kernel On Fri, Dec 06, 2002 at 04:01:24PM -0800, Andrew Morton wrote: > William Lee Irwin III wrote: > > > > ... > > A 16KB or 64KB kernel allocation unit would then annihilate > > You want to be careful about this: > > CPU: L1 I cache: 16K, L1 D cache: 16K > > Because instantiating a 16k page into user pagetables in > one hit means that it must all be zeroed. With these large > pagesizes that means that the application is likely to get > 100% L1 misses against the new page, whereas it currently > gets 100% hits. > > I'd expect this performance dropoff to occur when going from 8k > to 16k. By the time you get to 32k it would be quite bad. > > One way to address this could be to find a way of making the > pages present, but still cause a fault on first access. Then > have a special-case fastpath in the fault handler to really wipe > the page just before it is used. I don't know how though - maybe > _PAGE_USER? I think taking the page fault itself is the biggest overhead that would be nice to avoid on every second virtually consecutive page, if we've to take the page fault on every page we could as well do the rest of the work that should not that big compared to the overhead of entering/exiting kernel and preparing to handle the fault. > > get_user_pages() would need attention too - you don't want to > allow the user to perform O_DIRECT writes of uninitialised > pages to their files... Andrea ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0? 2002-12-07 0:22 ` Andrea Arcangeli @ 2002-12-07 0:35 ` Andrew Morton 0 siblings, 0 replies; 49+ messages in thread From: Andrew Morton @ 2002-12-07 0:35 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: William Lee Irwin III, Norman Gaywood, linux-kernel Andrea Arcangeli wrote: > > > One way to address this could be to find a way of making the > > pages present, but still cause a fault on first access. Then > > have a special-case fastpath in the fault handler to really wipe > > the page just before it is used. I don't know how though - maybe > > _PAGE_USER? > > I think taking the page fault itself is the biggest overhead that would > be nice to avoid on every second virtually consecutive page, if we've to > take the page fault on every page we could as well do the rest of the > work that should not that big compared to the overhead of > entering/exiting kernel and preparing to handle the fault. Yes, 8k at a time would probably be OK. Say, L1-size/2. I expect that anything bigger would cause 2x or worse slowdowns of a range of apps. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0? 2002-12-07 0:01 ` Andrew Morton 2002-12-07 0:21 ` William Lee Irwin III 2002-12-07 0:22 ` Andrea Arcangeli @ 2002-12-07 0:46 ` William Lee Irwin III 2002-12-07 10:55 ` Arjan van de Ven 3 siblings, 0 replies; 49+ messages in thread From: William Lee Irwin III @ 2002-12-07 0:46 UTC (permalink / raw) To: Andrew Morton; +Cc: Andrea Arcangeli, Norman Gaywood, linux-kernel On Fri, Dec 06, 2002 at 04:01:24PM -0800, Andrew Morton wrote: > One way to address this could be to find a way of making the > pages present, but still cause a fault on first access. Then > have a special-case fastpath in the fault handler to really wipe > the page just before it is used. I don't know how though - maybe > _PAGE_USER? All of the problems there have to do with accounting which pieces of the page are zeroed. The PTE's map the same size areas (MMUPAGE_SIZE stays 4KB)... So after a partial zero we end up with a struct page pointing at MMUPAGE_COUNT mmupages, and a PTE pointing at the one that's been zeroed and not a whole lot of flag bits left to keep track of which pieces are initialized. How about a single PG_zero flag and map out which bits of the thing are already zeroed in page->private? (basically the swapcache can be considered the owning fs and it then then uses page->private for those shenanigans). On Fri, Dec 06, 2002 at 04:01:24PM -0800, Andrew Morton wrote: > get_user_pages() would need attention too - you don't want to > allow the user to perform O_DIRECT writes of uninitialised > pages to their files... Well, I'm not sure how that would happen. fs io should deal with kernel PAGE_SIZE-sized units so we're dealing with anonymous memory only. O_DIRECT if we perform a write would only find the part of the page mapped by a PTE, which must have been pre-zeroed prior to being mapped. Reads seem to be in equally good shape. Perhaps it's more of "this is yet another things to audit when dealing with it"; I'll admit that the audit needed for this thing is somewhat large. Bill ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0? 2002-12-07 0:01 ` Andrew Morton ` (2 preceding siblings ...) 2002-12-07 0:46 ` William Lee Irwin III @ 2002-12-07 10:55 ` Arjan van de Ven 3 siblings, 0 replies; 49+ messages in thread From: Arjan van de Ven @ 2002-12-07 10:55 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel On Sat, 2002-12-07 at 01:01, Andrew Morton wrote: > William Lee Irwin III wrote: > > > > ... > > A 16KB or 64KB kernel allocation unit would then annihilate > > You want to be careful about this: > > CPU: L1 I cache: 16K, L1 D cache: 16K > > Because instantiating a 16k page into user pagetables in > one hit means that it must all be zeroed. With these large > pagesizes that means that the application is likely to get > 100% L1 misses against the new page, whereas it currently > gets 100% hits. If you really want you can cheat that 100% statistic into something much lower by zeroing the page from back to front (based on the exact faulting address even, because you know THAT one will get used) and/or zeroing the second half while bypassing the cache. At least it's 50% hits then ;) Still not 100% and I still agree that the 8Kb number is much nicer for 16Kb L1 cache machines.... ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0? 2002-12-06 2:15 ` William Lee Irwin III 2002-12-06 2:28 ` Andrea Arcangeli @ 2002-12-06 10:36 ` Arjan van de Ven 2002-12-06 14:23 ` William Lee Irwin III 1 sibling, 1 reply; 49+ messages in thread From: Arjan van de Ven @ 2002-12-06 10:36 UTC (permalink / raw) To: William Lee Irwin III Cc: Andrea Arcangeli, Andrew Morton, Norman Gaywood, linux-kernel > 64GB isn't getting any testing that I know of; I'd hold off until > someone's actually stood up and confessed to attempting to boot > Linux on such a beast. Or until I get some more RAM. =) United Linux at least has tested this according to http://www.unitedlinux.com/en/press/pr111902.html Hardware functionality is exploited through advanced features such as large memory support for up to 64 GB of RAM so I'm sure Andrea's VM deals with it gracefully ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0? 2002-12-06 10:36 ` Arjan van de Ven @ 2002-12-06 14:23 ` William Lee Irwin III 2002-12-06 15:12 ` William Lee Irwin III 0 siblings, 1 reply; 49+ messages in thread From: William Lee Irwin III @ 2002-12-06 14:23 UTC (permalink / raw) To: Arjan van de Ven Cc: Andrea Arcangeli, Andrew Morton, Norman Gaywood, linux-kernel At some point in the past, I wrote: >> 64GB isn't getting any testing that I know of; I'd hold off until >> someone's actually stood up and confessed to attempting to boot >> Linux on such a beast. Or until I get some more RAM. =) On Fri, Dec 06, 2002 at 11:36:15AM +0100, Arjan van de Ven wrote: > United Linux at least has tested this according to > http://www.unitedlinux.com/en/press/pr111902.html > Hardware functionality is exploited through advanced features such as > large memory support for up to 64 GB of RAM > so I'm sure Andrea's VM deals with it gracefully I'm not convinced of grace even if I were to take it from this that it were directly tested, which seems doubtful given the nature of the page. This page sounds more like CONFIG_HIGHMEM64G is an option. And besides, the report is useless unless it's got actual technical content and descriptions reported by an kernel hacker. Bill ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0? 2002-12-06 14:23 ` William Lee Irwin III @ 2002-12-06 15:12 ` William Lee Irwin III 2002-12-06 22:34 ` Andrea Arcangeli 0 siblings, 1 reply; 49+ messages in thread From: William Lee Irwin III @ 2002-12-06 15:12 UTC (permalink / raw) To: Arjan van de Ven, Andrea Arcangeli, Andrew Morton, Norman Gaywood, linux-kernel On Fri, Dec 06, 2002 at 11:36:15AM +0100, Arjan van de Ven wrote: >> United Linux at least has tested this according to >> http://www.unitedlinux.com/en/press/pr111902.html >> Hardware functionality is exploited through advanced features such as >> large memory support for up to 64 GB of RAM >> so I'm sure Andrea's VM deals with it gracefully On Fri, Dec 06, 2002 at 06:23:02AM -0800, William Lee Irwin III wrote: > I'm not convinced of grace even if I were to take it from this that it > were directly tested, which seems doubtful given the nature of the page. > This page sounds more like CONFIG_HIGHMEM64G is an option. > And besides, the report is useless unless it's got actual technical > content and descriptions reported by an kernel hacker. Well, since I've not seen recent attempts at the Right Way To Do It (TM), there's also a remote possibility of someone changing the user/kernel split just to get a bloated mem_map to fit. Many of the smaller apps, e.g. /bin/sh etc. are indifferent to the ABI violation. Bill ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0? 2002-12-06 15:12 ` William Lee Irwin III @ 2002-12-06 22:34 ` Andrea Arcangeli 2002-12-07 18:27 ` Eric W. Biederman 0 siblings, 1 reply; 49+ messages in thread From: Andrea Arcangeli @ 2002-12-06 22:34 UTC (permalink / raw) To: William Lee Irwin III, Arjan van de Ven, Andrew Morton, Norman Gaywood, linux-kernel On Fri, Dec 06, 2002 at 07:12:38AM -0800, William Lee Irwin III wrote: > split just to get a bloated mem_map to fit. Many of the smaller apps, > e.g. /bin/sh etc. are indifferent to the ABI violation. the problem of the split is that it would reduce the address space available to userspace that is quite critical on big machines (one of the big advantages of 64bit that can't be fixed on 32bit) but I wouldn't classify it as an ABI violation, infact the little I can remember about the 2.0 kernels [I almost never read that code] is that it had shared address space and tlb flush while entering/exiting kernel, so I can bet the user stack in 2.0 was put at 4G, not at 3G. 2.2 had to put it at 3G because then the address space was shared with the obvious performance advantages, so while I didn't read any ABI, I deduce you can't say the ABI got broken if the stack is put at 2G or 1G or 3.5G or 4G again with x86-64 (of course x86-64 can give the full 4G to userspace because the kernel runs in the negative part of the [64bit] address space, as 2.0 could too). Andrea ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0? 2002-12-06 22:34 ` Andrea Arcangeli @ 2002-12-07 18:27 ` Eric W. Biederman 0 siblings, 0 replies; 49+ messages in thread From: Eric W. Biederman @ 2002-12-07 18:27 UTC (permalink / raw) To: Andrea Arcangeli Cc: William Lee Irwin III, Arjan van de Ven, Andrew Morton, Norman Gaywood, linux-kernel Andrea Arcangeli <andrea@suse.de> writes: > On Fri, Dec 06, 2002 at 07:12:38AM -0800, William Lee Irwin III wrote: > > split just to get a bloated mem_map to fit. Many of the smaller apps, > > e.g. /bin/sh etc. are indifferent to the ABI violation. > > the problem of the split is that it would reduce the address space > available to userspace that is quite critical on big machines (one of > the big advantages of 64bit that can't be fixed on 32bit) but I wouldn't > classify it as an ABI violation, infact the little I can remember about > the 2.0 kernels [I almost never read that code] is that it had shared > address space and tlb flush while entering/exiting kernel, so I can bet > the user stack in 2.0 was put at 4G, not at 3G. 2.2 had to put it at 3G > because then the address space was shared with the obvious performance > advantages, so while I didn't read any ABI, I deduce you can't say the > ABI got broken if the stack is put at 2G or 1G or 3.5G or 4G again with > x86-64 (of course x86-64 can give the full 4G to userspace because the > kernel runs in the negative part of the [64bit] address space, as 2.0 > could too). As I remember it 2.0 used the 3/1 split the difference was that segments had different base register values. So the kernel though it was running at 0. %fs which retained a base address of 0 was used when access to user space was desired. Eric ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0? 2002-12-06 0:13 Maybe a VM bug in 2.4.18-18 from RH 8.0? Norman Gaywood 2002-12-06 1:00 ` Andrew Morton @ 2002-12-06 1:08 ` Andrea Arcangeli 1 sibling, 0 replies; 49+ messages in thread From: Andrea Arcangeli @ 2002-12-06 1:08 UTC (permalink / raw) To: Norman Gaywood; +Cc: linux-kernel On Fri, Dec 06, 2002 at 11:13:26AM +1100, Norman Gaywood wrote: > I think I have a trigger for a VM bug in the RH kernel-bigmem-2.4.18-18 > > The system is a 4 processor, 16GB memory Dell PE6600 running RH8.0 + > errata. More details at the end of this message. Thanks to lots of feedback from users in the last months I fixed all known vm bugs todate that can be reproduced on those big machines. They're all included in my tree and in the current UL/SuSE releases. Over time I should have posted all of them to the kernel list in one way or another. The most critical ones are now pending for merging in 2.4.21pre. So in the meantime you want to try to reproduce on top of 2.4.20aa1 or the UL kernel and (unless your problem is a tape driver bug ;) I'm pretty sure it will fix the problems on your big machine. http://www.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.20aa1.gz http://www.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.20aa1/ Hope this helps, Andrea ^ permalink raw reply [flat|nested] 49+ messages in thread
[parent not found: <mailman.1039133948.27411.linux-kernel2news@redhat.com>]
* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0? [not found] <mailman.1039133948.27411.linux-kernel2news@redhat.com> @ 2002-12-06 0:35 ` Pete Zaitcev 2002-12-06 1:27 ` Norman Gaywood 0 siblings, 1 reply; 49+ messages in thread From: Pete Zaitcev @ 2002-12-06 0:35 UTC (permalink / raw) To: Norman Gaywood; +Cc: linux-kernel > I think I have a trigger for a VM bug in the RH kernel-bigmem-2.4.18-18 > By doing a large copy I can trigger this problem in about 30-40 minutes. At > the end of that time, kswapd will start to get a larger % of CPU and > the system load will be around 2-3. The system will feel sluggish at an > interactive shell and it will take several seconds before a command like > top would start to display. [...] Check your /proc/slabinfo, just in case, to rule out a leak. > cat /proc/meminfo > total: used: free: shared: buffers: cached: > Mem: 16671522816 444915712 16226607104 0 136830976 56520704 > Swap: 34365202432 0 34365202432 > MemTotal: 16280784 kB > MemFree: 15846296 kB > MemShared: 0 kB > Buffers: 133624 kB > Cached: 55196 kB > SwapCached: 0 kB > Active: 249984 kB > Inact_dirty: 18088 kB > Inact_clean: 480 kB > Inact_target: 53708 kB > HighTotal: 15597504 kB > HighFree: 15434932 kB > LowTotal: 683280 kB > LowFree: 411364 kB > SwapTotal: 33559768 kB > SwapFree: 33559768 kB > Committed_AS: 177044 kB This is not interesting. Get it _after_ the box becomes sluggish. Remember, the 2.4.18 stream in RH does not have its own VM, distinct from Marcelo+Riel. So, you can come to linux-kernel for advice, but first, get it all reproduced with Marcelo's tree with Riel's patches all the same. -- Pete ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0? 2002-12-06 0:35 ` Pete Zaitcev @ 2002-12-06 1:27 ` Norman Gaywood 2002-12-06 12:48 ` Rik van Riel 0 siblings, 1 reply; 49+ messages in thread From: Norman Gaywood @ 2002-12-06 1:27 UTC (permalink / raw) To: Pete Zaitcev; +Cc: linux-kernel On Thu, Dec 05, 2002 at 07:35:49PM -0500, Pete Zaitcev wrote: > > I think I have a trigger for a VM bug in the RH kernel-bigmem-2.4.18-18 > > > By doing a large copy I can trigger this problem in about 30-40 minutes. At > > the end of that time, kswapd will start to get a larger % of CPU and > > the system load will be around 2-3. The system will feel sluggish at an > > interactive shell and it will take several seconds before a command like > > top would start to display. [...] > > Check your /proc/slabinfo, just in case, to rule out a leak. Here is a /proc/slabinfo diff of a good system and a very sluggish one: 1c1 < Mon Nov 25 17:13:04 EST 2002 --- > Mon Nov 25 22:35:58 EST 2002 6c6 < nfs_inode_cache 6 6 640 1 1 1 : 124 62 --- > nfs_inode_cache 1 6 640 1 1 1 : 124 62 8,11c8,11 < ip_fib_hash 224 224 32 2 2 1 : 252 126 < journal_head 3101 36113 48 69 469 1 : 252 126 < revoke_table 250 250 12 1 1 1 : 252 126 < revoke_record 672 672 32 6 6 1 : 252 126 --- > ip_fib_hash 10 224 32 2 2 1 : 252 126 > journal_head 12 154 48 2 2 1 : 252 126 > revoke_table 7 250 12 1 1 1 : 252 126 > revoke_record 0 0 32 0 0 1 : 252 126 14,20c14,20 < tcp_tw_bucket 210 210 128 7 7 1 : 252 126 < tcp_bind_bucket 896 896 32 8 8 1 : 252 126 < tcp_open_request 180 180 128 6 6 1 : 252 126 < inet_peer_cache 0 0 64 0 0 1 : 252 126 < ip_dst_cache 105 105 256 7 7 1 : 252 126 < arp_cache 90 90 128 3 3 1 : 252 126 < blkdev_requests 16548 17430 128 561 581 1 : 252 126 --- > tcp_tw_bucket 0 0 128 0 0 1 : 252 126 > tcp_bind_bucket 28 784 32 7 7 1 : 252 126 > tcp_open_request 0 0 128 0 0 1 : 252 126 > inet_peer_cache 1 58 64 1 1 1 : 252 126 > ip_dst_cache 40 105 256 7 7 1 : 252 126 > arp_cache 4 90 128 3 3 1 : 252 126 > blkdev_requests 16384 16410 128 547 547 1 : 252 126 22c22 < file_lock_cache 328 328 92 8 8 1 : 252 126 --- > file_lock_cache 3 82 92 2 2 1 : 252 126 24,27c24,27 < uid_cache 672 672 32 6 6 1 : 252 126 < skbuff_head_cache 1107 2745 256 77 183 1 : 252 126 < sock 270 270 1280 90 90 1 : 60 30 < sigqueue 870 870 132 30 30 1 : 252 126 --- > uid_cache 9 448 32 4 4 1 : 252 126 > skbuff_head_cache 816 1110 256 74 74 1 : 252 126 > sock 81 129 1280 43 43 1 : 60 30 > sigqueue 29 29 132 1 1 1 : 252 126 29,33c29,33 < cdev_cache 498 2262 64 12 39 1 : 252 126 < bdev_cache 290 290 64 5 5 1 : 252 126 < mnt_cache 232 232 64 4 4 1 : 252 126 < inode_cache 543337 553490 512 79070 79070 1 : 124 62 < dentry_cache 373336 554430 128 18481 18481 1 : 252 126 --- > cdev_cache 16 290 64 5 5 1 : 252 126 > bdev_cache 27 174 64 3 3 1 : 252 126 > mnt_cache 19 174 64 3 3 1 : 252 126 > inode_cache 305071 305081 512 43583 43583 1 : 124 62 > dentry_cache 418 2430 128 81 81 1 : 252 126 35,43c35,43 < filp 930 930 128 31 31 1 : 252 126 < names_cache 48 48 4096 48 48 1 : 60 30 < buffer_head 831810 831810 128 27727 27727 1 : 252 126 < mm_struct 510 510 256 34 34 1 : 252 126 < vm_area_struct 4488 4740 128 158 158 1 : 252 126 < fs_cache 696 696 64 12 12 1 : 252 126 < files_cache 469 469 512 67 67 1 : 124 62 < signal_act 388 418 1408 38 38 4 : 60 30 < pae_pgd 696 696 64 12 12 1 : 252 126 --- > filp 1041 1230 128 41 41 1 : 252 126 > names_cache 7 8 4096 7 8 1 : 60 30 > buffer_head 3431966 3432150 128 114405 114405 1 : 252 126 > mm_struct 198 315 256 21 21 1 : 252 126 > vm_area_struct 5905 5970 128 199 199 1 : 252 126 > fs_cache 204 464 64 8 8 1 : 252 126 > files_cache 204 217 512 31 31 1 : 124 62 > signal_act 246 286 1408 26 26 4 : 60 30 > pae_pgd 198 638 64 11 11 1 : 252 126 51c51 < size-16384 16 24 16384 16 24 4 : 0 0 --- > size-16384 20 20 16384 20 20 4 : 0 0 53c53 < size-8192 5 11 8192 5 11 2 : 0 0 --- > size-8192 9 9 8192 9 9 2 : 0 0 55c55 < size-4096 287 407 4096 287 407 1 : 60 30 --- > size-4096 56 56 4096 56 56 1 : 60 30 57c57 < size-2048 426 666 2048 213 333 1 : 60 30 --- > size-2048 281 314 2048 157 157 1 : 60 30 59c59 < size-1024 1024 1272 1024 256 318 1 : 124 62 --- > size-1024 659 712 1024 178 178 1 : 124 62 61c61 < size-512 3398 3584 512 445 448 1 : 124 62 --- > size-512 2782 2856 512 357 357 1 : 124 62 63c63 < size-256 777 1155 256 67 77 1 : 252 126 --- > size-256 101 255 256 17 17 1 : 252 126 65c65 < size-128 4836 19200 128 244 640 1 : 252 126 --- > size-128 2757 3750 128 125 125 1 : 252 126 67c67 < size-64 8958 20550 128 356 685 1 : 252 126 --- > size-64 178 510 128 17 17 1 : 252 126 69c69 < size-32 23262 43674 64 433 753 1 : 252 126 --- > size-32 711 1218 64 21 21 1 : 252 126 > > cat /proc/meminfo > This is not interesting. Get it _after_ the box becomes sluggish. I don't have one of those, but here is a top of a sluggish system: 3:51pm up 43 min, 3 users, load average: 1.69, 1.28, 0.92 109 processes: 108 sleeping, 1 running, 0 zombie, 0 stopped CPU0 states: 0.0% user, 0.3% system, 0.0% nice, 99.2% idle CPU1 states: 0.0% user, 0.0% system, 0.0% nice, 100.0% idle CPU2 states: 0.0% user, 0.0% system, 0.0% nice, 100.0% idle CPU3 states: 0.0% user, 1.4% system, 0.0% nice, 98.0% idle CPU4 states: 0.0% user, 58.2% system, 0.0% nice, 41.2% idle CPU5 states: 0.0% user, 96.4% system, 0.0% nice, 3.0% idle CPU6 states: 0.0% user, 0.5% system, 0.0% nice, 99.0% idle CPU7 states: 0.0% user, 0.3% system, 0.0% nice, 99.2% idle Mem: 16280784K av, 15747124K used, 533660K free, 0K shrd, 20952K buff Swap: 33559768K av, 0K used, 33559768K free 15037240K cached PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME COMMAND 19 root 25 0 0 0 0 SW 96.7 0.0 1:52 kswapd 1173 root 21 0 10592 10M 424 D 58.2 0.0 3:30 cp 202 root 15 0 0 0 0 DW 1.9 0.0 0:04 kjournald 205 root 15 0 0 0 0 DW 0.9 0.0 0:10 kjournald 21 root 15 0 0 0 0 SW 0.5 0.0 0:01 kupdated 1121 root 16 0 1056 1056 836 R 0.5 0.0 0:09 top 1 root 15 0 476 476 424 S 0.0 0.0 0:04 init 2 root 0K 0 0 0 0 SW 0.0 0.0 0:00 migration_CPU0 3 root 0K 0 0 0 0 SW 0.0 0.0 0:00 migration_CPU1 4 root 0K 0 0 0 0 SW 0.0 0.0 0:00 migration_CPU2 5 root 0K 0 0 0 0 SW 0.0 0.0 0:00 migration_CPU3 6 root 0K 0 0 0 0 SW 0.0 0.0 0:00 migration_CPU4 7 root 0K 0 0 0 0 SW 0.0 0.0 0:00 migration_CPU5 8 root 0K 0 0 0 0 SW 0.0 0.0 0:00 migration_CPU6 9 root 0K 0 0 0 0 SW 0.0 0.0 0:00 migration_CPU7 10 root 15 0 0 0 0 SW 0.0 0.0 0:00 keventd > Remember, the 2.4.18 stream in RH does not have its own VM, distinct > from Marcelo+Riel. So, you can come to linux-kernel for advice, > but first, get it all reproduced with Marcelo's tree with > Riel's patches all the same. Yep, I understand that. I just thought this might be of interest however. It's pretty hard to find a place to talk about this problem with someone who might know something! I've got a service request in with RH but no answer yet, but it's only been 1.5 days. While I've been writing this it looks like Andrew Morton and Andrea Arcangeli have given me some great answers and have declared this a "well known problem". Looks like I've got something to try. -- Norman Gaywood -- School of Mathematical and Computer Sciences University of New England, Armidale, NSW 2351, Australia norm@turing.une.edu.au http://turing.une.edu.au/~norm Phone: +61 2 6773 2412 Fax: +61 2 6773 3312 ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0? 2002-12-06 1:27 ` Norman Gaywood @ 2002-12-06 12:48 ` Rik van Riel 0 siblings, 0 replies; 49+ messages in thread From: Rik van Riel @ 2002-12-06 12:48 UTC (permalink / raw) To: Norman Gaywood; +Cc: Pete Zaitcev, linux-kernel On Fri, 6 Dec 2002, Norman Gaywood wrote: > On Thu, Dec 05, 2002 at 07:35:49PM -0500, Pete Zaitcev wrote: > > Check your /proc/slabinfo, just in case, to rule out a leak. > > Here is a /proc/slabinfo diff of a good system and a very sluggish one: > > inode_cache 305071 305081 512 43583 43583 1 : 124 62 > > buffer_head 3431966 3432150 128 114405 114405 1 : 252 126 Guess what ? 120 MB in inode cache and 450 MB in buffer heads, or 570 MB of zone_normal eaten with just these two items. Looks like the RH kernel needs Stephen Tweedie's patch to reclaim the buffer heads once IO is done ;) regards, Rik -- A: No. Q: Should I include quotations after my reply? http://www.surriel.com/ http://guru.conectiva.com/ ^ permalink raw reply [flat|nested] 49+ messages in thread
end of thread, other threads:[~2002-12-07 18:21 UTC | newest]
Thread overview: 49+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-12-06 0:13 Maybe a VM bug in 2.4.18-18 from RH 8.0? Norman Gaywood
2002-12-06 1:00 ` Andrew Morton
2002-12-06 1:17 ` Andrea Arcangeli
2002-12-06 1:34 ` Andrew Morton
2002-12-06 1:44 ` Andrea Arcangeli
2002-12-06 2:15 ` William Lee Irwin III
2002-12-06 2:28 ` Andrea Arcangeli
2002-12-06 2:41 ` William Lee Irwin III
2002-12-06 5:25 ` Andrew Morton
2002-12-06 5:48 ` Andrea Arcangeli
2002-12-06 6:14 ` William Lee Irwin III
2002-12-06 6:55 ` Andrew Morton
2002-12-06 7:14 ` GrandMasterLee
2002-12-06 7:25 ` Andrew Morton
2002-12-06 7:34 ` GrandMasterLee
2002-12-06 7:51 ` Andrew Morton
2002-12-06 11:37 ` Christoph Hellwig
2002-12-06 16:19 ` GrandMasterLee
2002-12-06 14:57 ` Andrea Arcangeli
2002-12-06 15:12 ` William Lee Irwin III
2002-12-06 23:32 ` Andrea Arcangeli
2002-12-06 23:45 ` William Lee Irwin III
2002-12-06 23:57 ` Andrea Arcangeli
2002-12-06 6:00 ` William Lee Irwin III
2002-12-06 22:28 ` Andrea Arcangeli
2002-12-06 23:21 ` William Lee Irwin III
2002-12-06 23:50 ` Andrea Arcangeli
2002-12-07 0:30 ` William Lee Irwin III
2002-12-07 0:01 ` Andrew Morton
2002-12-07 0:21 ` William Lee Irwin III
2002-12-07 0:30 ` Andrew Morton
2002-12-07 2:19 ` Alan Cox
2002-12-07 1:46 ` William Lee Irwin III
2002-12-07 1:56 ` Andrea Arcangeli
2002-12-07 2:31 ` Alan Cox
2002-12-07 2:09 ` William Lee Irwin III
2002-12-07 0:22 ` Andrea Arcangeli
2002-12-07 0:35 ` Andrew Morton
2002-12-07 0:46 ` William Lee Irwin III
2002-12-07 10:55 ` Arjan van de Ven
2002-12-06 10:36 ` Arjan van de Ven
2002-12-06 14:23 ` William Lee Irwin III
2002-12-06 15:12 ` William Lee Irwin III
2002-12-06 22:34 ` Andrea Arcangeli
2002-12-07 18:27 ` Eric W. Biederman
2002-12-06 1:08 ` Andrea Arcangeli
[not found] <mailman.1039133948.27411.linux-kernel2news@redhat.com>
2002-12-06 0:35 ` Pete Zaitcev
2002-12-06 1:27 ` Norman Gaywood
2002-12-06 12:48 ` Rik van Riel
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox