* Re: [VM] 2.4.14/15-pre4 too "swap-happy"?
2001-11-19 23:52 ` Linus Torvalds
@ 2001-11-20 0:18 ` M. Edward (Ed) Borasky
2001-11-20 0:25 ` Ken Brownfield
2001-11-20 3:09 ` Ken Brownfield
2 siblings, 0 replies; 20+ messages in thread
From: M. Edward (Ed) Borasky @ 2001-11-20 0:18 UTC (permalink / raw)
To: linux-kernel
On a related note, the files "/usr/src/linux/Documentation/filesystems/proc.txt"
and "sysctl/vm.txt" refer to some variables I need to be able to set on a
system running 2.4.12. In particular, I need to be able to get to the values
in "/proc/sys/vm/freepages", "/proc/sys/vm/buffermem" and
"/proc/sys/vm/pagecache". However, despite their existence in the documentation
files, these files don't exist on a 2.4.12 system. How can I read and set these
values on a 2.4.12 system?
--
znmeb@aracnet.com (M. Edward Borasky) http://www.aracnet.com/~znmeb
Relax! Run Your Own Brain with Neuro-Semantics!
http://www.meta-trading-coach.com
"Outside of a dog, a book is a man's best friend. Inside a dog, it's
too dark to read." -- Marx
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [VM] 2.4.14/15-pre4 too "swap-happy"?
2001-11-19 23:52 ` Linus Torvalds
2001-11-20 0:18 ` M. Edward (Ed) Borasky
@ 2001-11-20 0:25 ` Ken Brownfield
2001-11-20 0:31 ` Linus Torvalds
2001-11-20 3:09 ` Ken Brownfield
2 siblings, 1 reply; 20+ messages in thread
From: Ken Brownfield @ 2001-11-20 0:25 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-kernel, Andrea Arcangeli
On Mon, Nov 19, 2001 at 03:52:44PM -0800, Linus Torvalds wrote:
|
| On Mon, 19 Nov 2001, Ken Brownfield wrote:
| >
| > I went straight to the aa patch, and it looks like it either fixes the
| > problem or (because of the side-effects Linus mentioned) otherwise
| > prevents the issue:
|
| So is this pre6aa1, or pre6 + just the watermark patch?
I'm currently using -pre6 with his separately-posted zone-watermark-1
patch. Sorry, I should have been clearer.
| > The machine went into swap immediately when the page cache stopped
| > growing and hovered at 100-400MB. Also, in my experience the page cache
| > will grow until there's only 5ishMB of free RAM, but with the aa patch
| > it looks like it stops at 320MB or maybe 10% of RAM. Was that the aa
| > patch, or part of -pre6?
|
| That was the watermarking. The way Andrea did it, the page cache will
| basically refuse to touch as much of the "normal" page zone, because it
| would prefer to allocate more from highmem..
|
| I think it's excessive to have 320MB free memory, though, that's just
| an insane waste. I suspect that the real number should be somewhere
| between the old behaviour and the new one. You can tweak the behaviour of
| andrea's kernel by changing the "reserved" page numbers, but I'd like to
| hear whether my simpler approach works too..
Yeah, maybe a tiered default would be best, IMHO. 5MB on a 3GB box
does, on the other hand, seem anemic.
| > The Oracle SGA is set to ~522MB, with nothing else running except a
| > couple of sshds, getty, etc. Now that I'm looking, 2.8GB page cache
| > plus 328MB free adds up to about 3.1GB of RAM -- where does the 512MB
| > shared memory segment fit? Is it being swapped out in deference to page
| > cache?
|
| Shared memory actually uses the page cache too, so it will be accounted
| for in the 2.8GB number.
My bad, should have realized.
| Anyway, can you try plain vanilla pre6, with the appended patch? This is
| my suggested simplified version of what Andrea tried to do, and it should
| try to keep only a few extra megs of memory free in the low memory
| regions, not 300+ MB.
|
| (and the profiling would be interesting regardless, but I think Andrea did
| find the real problem, his fix just seems a bit of an overkill ;)
|
| Linus
I'll try this patch ASAP.
Thanks a LOT to all involved,
--
Ken.
brownfld@irridia.com
| diff -u --recursive --new-file pre6/linux/mm/page_alloc.c linux/mm/page_alloc.c
| --- pre6/linux/mm/page_alloc.c Sat Nov 17 19:07:43 2001
| +++ linux/mm/page_alloc.c Mon Nov 19 15:13:36 2001
| @@ -299,29 +299,26 @@
| return page;
| }
|
| -static inline unsigned long zone_free_pages(zone_t * zone, unsigned int order)
| -{
| - long free = zone->free_pages - (1UL << order);
| - return free >= 0 ? free : 0;
| -}
| -
| /*
| * This is the 'heart' of the zoned buddy allocator:
| */
| struct page * __alloc_pages(unsigned int gfp_mask, unsigned int order, zonelist_t *zonelist)
| {
| + unsigned long min;
| zone_t **zone, * classzone;
| struct page * page;
| int freed;
|
| zone = zonelist->zones;
| classzone = *zone;
| + min = 1UL << order;
| for (;;) {
| zone_t *z = *(zone++);
| if (!z)
| break;
|
| - if (zone_free_pages(z, order) > z->pages_low) {
| + min += z->pages_low;
| + if (z->free_pages > min) {
| page = rmqueue(z, order);
| if (page)
| return page;
| @@ -334,16 +331,18 @@
| wake_up_interruptible(&kswapd_wait);
|
| zone = zonelist->zones;
| + min = 1UL << order;
| for (;;) {
| - unsigned long min;
| + unsigned long local_min;
| zone_t *z = *(zone++);
| if (!z)
| break;
|
| - min = z->pages_min;
| + local_min = z->pages_min;
| if (!(gfp_mask & __GFP_WAIT))
| - min >>= 2;
| - if (zone_free_pages(z, order) > min) {
| + local_min >>= 2;
| + min += local_min;
| + if (z->free_pages > min) {
| page = rmqueue(z, order);
| if (page)
| return page;
| @@ -376,12 +375,14 @@
| return page;
|
| zone = zonelist->zones;
| + min = 1UL << order;
| for (;;) {
| zone_t *z = *(zone++);
| if (!z)
| break;
|
| - if (zone_free_pages(z, order) > z->pages_min) {
| + min += z->pages_min;
| + if (z->free_pages > min) {
| page = rmqueue(z, order);
| if (page)
| return page;
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: [VM] 2.4.14/15-pre4 too "swap-happy"?
2001-11-20 0:25 ` Ken Brownfield
@ 2001-11-20 0:31 ` Linus Torvalds
0 siblings, 0 replies; 20+ messages in thread
From: Linus Torvalds @ 2001-11-20 0:31 UTC (permalink / raw)
To: Ken Brownfield; +Cc: linux-kernel, Andrea Arcangeli
On Mon, 19 Nov 2001, Ken Brownfield wrote:
> |
> | So is this pre6aa1, or pre6 + just the watermark patch?
>
> I'm currently using -pre6 with his separately-posted zone-watermark-1
> patch. Sorry, I should have been clearer.
Good. That removes the other variables from the equation, ie it's not an
effect of some of the other tweaking in the -aa patches.
> Yeah, maybe a tiered default would be best, IMHO. 5MB on a 3GB box
> does, on the other hand, seem anemic.
Yeah, the 5MB _is_ anemic. It comes from the fact that we decide to never
bother having more than zone_balance_max[] pages free, even if we have
tons of memory. And zone_balance_max[] is fairly small, it limits us to
255 free pages per zone (for page_min - wth "page_low" being twice that).
So you get 3 zones, with 255*2 pages free max each, except the DMA zone
has much less just because it's smaller. Thus 5MB.
There's no real reason for having zone_balance_max[] at all - without it
we'd just always try to keep about 1/128th of memory free, which would be
about 24MB on a 3GB box. Which is probably not a bad idea.
With my "simplified-Andrea" patch, you should see slightly more than 5MB
free, but not a lot more. A HIGHMEM allocation now wants to leave an
"extra" 510 pages in NORMAL, and even more in the DMA zone, so you should
see something like maybe 12-15 MB free instead of 300MB.
(Wild hand-waving number, I'm too lazy to actually do the math, and I
haven't even tested that the simple patch works at all - I think I forgot
to mention that small detail ;)
Linus
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [VM] 2.4.14/15-pre4 too "swap-happy"?
2001-11-19 23:52 ` Linus Torvalds
2001-11-20 0:18 ` M. Edward (Ed) Borasky
2001-11-20 0:25 ` Ken Brownfield
@ 2001-11-20 3:09 ` Ken Brownfield
2001-11-20 3:30 ` Linus Torvalds
2001-11-20 3:32 ` Andrea Arcangeli
2 siblings, 2 replies; 20+ messages in thread
From: Ken Brownfield @ 2001-11-20 3:09 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-kernel, Andrea Arcangeli
Well, I think you'll be pleased to hear that your untested patch
compiled, booted, _and_ fixed the problem. :)
The minimum free RAM was about 9.8-11MB (matching your guestimate) and
kswapd seemed to behave the same as the watermark patch. The results of
top were basically the same, so I'm omitting it.
However, I do have some profiling numbers, thanks to Marcelo. Attached
are numbers from "readprofile | sort -nr +2 | head -20". I think the
pre4 numbers point to shrink_cache, prune_icache, and statm_pgd_range.
The other two might have significance for wizards, but statistically
don't stand out to me, except maybe statm_pgd_range.
I reset the counters just before starting Oracle and the stress test. I
think a -pre7 with a blessed patch would be good, since my testing was
very narrow.
I'll test new kernels as I hear new info.
Thanks much!
--
Ken.
brownfld@irridia.com
2.4.15-pre4 with your original patch:
(shorter time period since the machine went to hell fast)
(matches vanilla behaviour)
164536 default_idle 3164.1538
101562 shrink_cache 113.8587
3683 prune_icache 13.5404
3034 file_read_actor 12.2339
914 DAC960_BA_InterruptHandler 5.5732
1128 statm_pgd_range 2.9072
40 page_cache_release 0.8333
31 add_page_to_hash_queue 0.5167
89 page_cache_read 0.4363
25 remove_inode_page 0.4167
26 unlock_page 0.3095
509 __make_request 0.3008
66 smp_call_function 0.2946
21 set_bh_page 0.2917
9 __brelse 0.2812
90 try_to_free_buffers 0.2778
13 mark_page_accessed 0.2708
8 __free_pages 0.2500
43 get_hash_table 0.2443
42 activate_page 0.2234
2.4.15-pre6 with watermark patch:
1617446 default_idle 31104.7308
27599 DAC960_BA_InterruptHandler 168.2866
38918 file_read_actor 156.9274
528 page_cache_release 11.0000
554 add_page_to_hash_queue 9.2333
15487 __make_request 9.1531
3453 statm_pgd_range 8.8995
514 remove_inode_page 8.5667
1453 blk_init_free_list 7.2650
377 set_bh_page 5.2361
898 page_cache_read 4.4020
590 add_to_page_cache_unique 4.3382
136 __brelse 4.2500
1120 kmem_cache_alloc 3.8356
628 kunmap_high 3.7381
1189 try_to_free_buffers 3.6698
625 get_hash_table 3.5511
439 lru_cache_add 3.4297
1715 rmqueue 3.0194
105 remove_wait_queue 2.9167
2.4.15-pre6 with Linus patch:
1249875 default_idle 24036.0577
65324 file_read_actor 263.4032
36979 DAC960_BA_InterruptHandler 225.4817
9809 statm_pgd_range 25.2809
1039 page_cache_release 21.6458
994 add_page_to_hash_queue 16.5667
922 remove_inode_page 15.3667
2409 blk_init_free_list 12.0450
20159 __make_request 11.9143
1198 lru_cache_add 9.3594
1628 page_cache_read 7.9804
987 add_to_page_cache_unique 7.2574
2202 try_to_free_buffers 6.7963
1038 get_unused_buffer_head 6.6538
484 unlock_page 5.7619
3182 rmqueue 5.6021
874 kunmap_high 5.2024
164 __brelse 5.1250
900 get_hash_table 5.1136
357 set_bh_page 4.9583
On Mon, Nov 19, 2001 at 03:52:44PM -0800, Linus Torvalds wrote:
|
| On Mon, 19 Nov 2001, Ken Brownfield wrote:
| >
| > I went straight to the aa patch, and it looks like it either fixes the
| > problem or (because of the side-effects Linus mentioned) otherwise
| > prevents the issue:
|
| So is this pre6aa1, or pre6 + just the watermark patch?
|
| > The machine went into swap immediately when the page cache stopped
| > growing and hovered at 100-400MB. Also, in my experience the page cache
| > will grow until there's only 5ishMB of free RAM, but with the aa patch
| > it looks like it stops at 320MB or maybe 10% of RAM. Was that the aa
| > patch, or part of -pre6?
|
| That was the watermarking. The way Andrea did it, the page cache will
| basically refuse to touch as much of the "normal" page zone, because it
| would prefer to allocate more from highmem..
|
| I think it's excessive to have 320MB free memory, though, that's just
| an insane waste. I suspect that the real number should be somewhere
| between the old behaviour and the new one. You can tweak the behaviour of
| andrea's kernel by changing the "reserved" page numbers, but I'd like to
| hear whether my simpler approach works too..
|
| > The Oracle SGA is set to ~522MB, with nothing else running except a
| > couple of sshds, getty, etc. Now that I'm looking, 2.8GB page cache
| > plus 328MB free adds up to about 3.1GB of RAM -- where does the 512MB
| > shared memory segment fit? Is it being swapped out in deference to page
| > cache?
|
| Shared memory actually uses the page cache too, so it will be accounted
| for in the 2.8GB number.
|
| Anyway, can you try plain vanilla pre6, with the appended patch? This is
| my suggested simplified version of what Andrea tried to do, and it should
| try to keep only a few extra megs of memory free in the low memory
| regions, not 300+ MB.
|
| (and the profiling would be interesting regardless, but I think Andrea did
| find the real problem, his fix just seems a bit of an overkill ;)
|
| Linus
| diff -u --recursive --new-file pre6/linux/mm/page_alloc.c linux/mm/page_alloc.c
| --- pre6/linux/mm/page_alloc.c Sat Nov 17 19:07:43 2001
| +++ linux/mm/page_alloc.c Mon Nov 19 15:13:36 2001
| @@ -299,29 +299,26 @@
| return page;
| }
|
| -static inline unsigned long zone_free_pages(zone_t * zone, unsigned int order)
| -{
| - long free = zone->free_pages - (1UL << order);
| - return free >= 0 ? free : 0;
| -}
| -
| /*
| * This is the 'heart' of the zoned buddy allocator:
| */
| struct page * __alloc_pages(unsigned int gfp_mask, unsigned int order, zonelist_t *zonelist)
| {
| + unsigned long min;
| zone_t **zone, * classzone;
| struct page * page;
| int freed;
|
| zone = zonelist->zones;
| classzone = *zone;
| + min = 1UL << order;
| for (;;) {
| zone_t *z = *(zone++);
| if (!z)
| break;
|
| - if (zone_free_pages(z, order) > z->pages_low) {
| + min += z->pages_low;
| + if (z->free_pages > min) {
| page = rmqueue(z, order);
| if (page)
| return page;
| @@ -334,16 +331,18 @@
| wake_up_interruptible(&kswapd_wait);
|
| zone = zonelist->zones;
| + min = 1UL << order;
| for (;;) {
| - unsigned long min;
| + unsigned long local_min;
| zone_t *z = *(zone++);
| if (!z)
| break;
|
| - min = z->pages_min;
| + local_min = z->pages_min;
| if (!(gfp_mask & __GFP_WAIT))
| - min >>= 2;
| - if (zone_free_pages(z, order) > min) {
| + local_min >>= 2;
| + min += local_min;
| + if (z->free_pages > min) {
| page = rmqueue(z, order);
| if (page)
| return page;
| @@ -376,12 +375,14 @@
| return page;
|
| zone = zonelist->zones;
| + min = 1UL << order;
| for (;;) {
| zone_t *z = *(zone++);
| if (!z)
| break;
|
| - if (zone_free_pages(z, order) > z->pages_min) {
| + min += z->pages_min;
| + if (z->free_pages > min) {
| page = rmqueue(z, order);
| if (page)
| return page;
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: [VM] 2.4.14/15-pre4 too "swap-happy"?
2001-11-20 3:09 ` Ken Brownfield
@ 2001-11-20 3:30 ` Linus Torvalds
2001-11-20 3:32 ` Andrea Arcangeli
1 sibling, 0 replies; 20+ messages in thread
From: Linus Torvalds @ 2001-11-20 3:30 UTC (permalink / raw)
To: Ken Brownfield; +Cc: linux-kernel, Andrea Arcangeli
On Mon, 19 Nov 2001, Ken Brownfield wrote:
>
> Well, I think you'll be pleased to hear that your untested patch
> compiled, booted, _and_ fixed the problem. :)
Good. The patch itself was fairly simple, and the problem was
straightforward, the real credit for the fix goes to Andrea for thinking
about what was wrong with the old code..
> The minimum free RAM was about 9.8-11MB (matching your guestimate) and
> kswapd seemed to behave the same as the watermark patch. The results of
> top were basically the same, so I'm omitting it.
All right. I think 10MB free for a 3GB machine is good - and we can easily
tweak the zone_balance_max[] numbers if somebody comes to the conclusion
that it's better to have more free. It's about .3% of RAM, so it's small
enough that it's certainly not too much, and yet at the same time it's
probably enough to give reasonable behaviour in a temporary memory crunch.
> However, I do have some profiling numbers, thanks to Marcelo. Attached
> are numbers from "readprofile | sort -nr +2 | head -20". I think the
> pre4 numbers point to shrink_cache, prune_icache, and statm_pgd_range.
> The other two might have significance for wizards, but statistically
> don't stand out to me, except maybe statm_pgd_range.
I'd say that this clearly shows that yes, 2.4.14 did the wrong thing, and
wasted time in shrink_cache() without making any real progress. The two
other profiles look reasonable to me - nothing stands out that shouldn't.
(yeah, we spend _much_ too much time doing VM statistics with "top", and
the only way to get rid of that would be to add a per-vma "rss" field.
Which might not be a bad idea, but it's not a high priority for me).
> I reset the counters just before starting Oracle and the stress test. I
> think a -pre7 with a blessed patch would be good, since my testing was
> very narrow.
Sude, I'll do a pre7. This closes my last behaviour issue with the VM,
although I'm sure we'll end up spending tons of time chasing bugs still
(both VM and not).
Linus
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [VM] 2.4.14/15-pre4 too "swap-happy"?
2001-11-20 3:09 ` Ken Brownfield
2001-11-20 3:30 ` Linus Torvalds
@ 2001-11-20 3:32 ` Andrea Arcangeli
2001-11-20 5:54 ` Ken Brownfield
1 sibling, 1 reply; 20+ messages in thread
From: Andrea Arcangeli @ 2001-11-20 3:32 UTC (permalink / raw)
To: Ken Brownfield; +Cc: Linus Torvalds, linux-kernel
On Mon, Nov 19, 2001 at 09:09:41PM -0600, Ken Brownfield wrote:
> Well, I think you'll be pleased to hear that your untested patch
> compiled, booted, _and_ fixed the problem. :)
Can you try to run an updatedb constantly in background?
Andrea
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [VM] 2.4.14/15-pre4 too "swap-happy"?
2001-11-20 3:32 ` Andrea Arcangeli
@ 2001-11-20 5:54 ` Ken Brownfield
2001-11-20 6:50 ` Linus Torvalds
0 siblings, 1 reply; 20+ messages in thread
From: Ken Brownfield @ 2001-11-20 5:54 UTC (permalink / raw)
To: Andrea Arcangeli; +Cc: Linus Torvalds, linux-kernel
kswapd goes up to 5-10% CPU (vs 3-6) but it finishes without issue or
apparent interactivity problems. I'm keeping it in while( 1 ), but it's
been predictable so far.
3-10 is a lot better than 99, but is kswapd really going to eat that
much CPU in an essentially allocation-less state?
But certainly you found the right thing.
Thx all!
--
Ken.
brownfld@irridia.com
On Tue, Nov 20, 2001 at 04:32:23AM +0100, Andrea Arcangeli wrote:
| On Mon, Nov 19, 2001 at 09:09:41PM -0600, Ken Brownfield wrote:
| > Well, I think you'll be pleased to hear that your untested patch
| > compiled, booted, _and_ fixed the problem. :)
|
| Can you try to run an updatedb constantly in background?
|
| Andrea
| -
| To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
| the body of a message to majordomo@vger.kernel.org
| More majordomo info at http://vger.kernel.org/majordomo-info.html
| Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [VM] 2.4.14/15-pre4 too "swap-happy"?
2001-11-20 5:54 ` Ken Brownfield
@ 2001-11-20 6:50 ` Linus Torvalds
2001-12-01 13:15 ` Slight Return (was Re: [VM] 2.4.14/15-pre4 too "swap-happy"?) Ken Brownfield
0 siblings, 1 reply; 20+ messages in thread
From: Linus Torvalds @ 2001-11-20 6:50 UTC (permalink / raw)
To: linux-kernel
In article <20011119235422.F10597@asooo.flowerfire.com>,
Ken Brownfield <brownfld@irridia.com> wrote:
>kswapd goes up to 5-10% CPU (vs 3-6) but it finishes without issue or
>apparent interactivity problems. I'm keeping it in while( 1 ), but it's
>been predictable so far.
>
>3-10 is a lot better than 99, but is kswapd really going to eat that
>much CPU in an essentially allocation-less state?
Well, it's obviously not allocation-less: updatedb will really hit on
the dcache and icache (which are both in the NORMAL zone only, which is
why Andrea asked for it), and obviously your Oracle load itself seems to
be happily paging stuff around, which causes a lot of allocations for
page-ins.
It only _looks_ static, because once you find the proper "balance", the
VM numbers themselves shouldn't change under a constant load.
We could make kswapd use less CPU time, of course, simply by making the
actual working processes do more of the work to free memory. The total
work ends up being the same, though, and the advantage of kswapd is that
it tends to make the freeing slightly more asynchronous, which helps
throughput.
The _disadvantage_ of kswapd is that if it goes crazy and uses up all
CPU time, you get bad results ;)
But it doesn't sound crazy in your load. I'd be happier if the VM took
less CPU, of course, but for now we seem to be doing ok.
Linus
^ permalink raw reply [flat|nested] 20+ messages in thread
* Slight Return (was Re: [VM] 2.4.14/15-pre4 too "swap-happy"?)
2001-11-20 6:50 ` Linus Torvalds
@ 2001-12-01 13:15 ` Ken Brownfield
2001-12-08 13:12 ` Ken Brownfield
0 siblings, 1 reply; 20+ messages in thread
From: Ken Brownfield @ 2001-12-01 13:15 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-kernel
When updatedb kicked off on my 2.4.16 6-way Xeon 4GB box this morning, I
had an unfortunate flashback:
5:02am up 2 days, 1 min, 59 users, load average: 5.66, 4.86, 3.60
741 processes: 723 sleeping, 4 running, 0 zombie, 14 stopped
CPU states: 0.2% user, 77.3% system, 0.0% nice, 22.3% idle
Mem: 3351664K av, 3346504K used, 5160K free, 0K shrd, 498048K buff
Swap: 1052248K av, 282608K used, 769640K free 2531892K cached
PID USER PRI NI SIZE RSS SHARE STAT LIB %CPU %MEM TIME COMMAND
2117 root 15 5 580 580 408 R N 0 99.9 0.0 17:19 updatedb
2635 kb 12 0 1696 1556 1216 R 0 99.9 0.0 4:16 smbd
2672 root 17 10 4212 4212 492 D N 0 94.7 0.1 1:39 rsync
2609 root 2 -20 1284 1284 672 R < 0 81.2 0.0 4:02 top
9 root 9 0 0 0 0 SW 0 80.7 0.0 42:50 kswapd
22879 kb 9 0 11548 6316 1684 S 0 11.8 0.1 7:33 smbd
Under varied load I'm not seeing the kswapd issue, but it looks like
updatedb combined with one or two samba transfers does still reproduce
the problem easily, and adding rsync or NFS transfers to the mix makes
kswapd peg at 99%.
I noticed because I was trying to do kernel patches and compiles using a
partition NFS-mounted from this machine. I guess it sometimes pays to
be up at 5am...
Unfortunately it's difficult for me to reboot this machine to update the
kernel (59 users) but I will try to reproduce the problem on a separate
machine this weekend or early next week. And I don't have profiling on,
so that will have to wait as well. :-(
Andrea, do you have a patch vs. 2.4.16 of your original solution to this
problem that I could test out? I'd rather just change one thing at a
time rather than switching completely to an -aa kernel.
Grrrr!
Thanks much,
--
Ken.
brownfld@irridia.com
On Tue, Nov 20, 2001 at 06:50:50AM +0000, Linus Torvalds wrote:
| In article <20011119235422.F10597@asooo.flowerfire.com>,
| Ken Brownfield <brownfld@irridia.com> wrote:
| >kswapd goes up to 5-10% CPU (vs 3-6) but it finishes without issue or
| >apparent interactivity problems. I'm keeping it in while( 1 ), but it's
| >been predictable so far.
| >
| >3-10 is a lot better than 99, but is kswapd really going to eat that
| >much CPU in an essentially allocation-less state?
|
| Well, it's obviously not allocation-less: updatedb will really hit on
| the dcache and icache (which are both in the NORMAL zone only, which is
| why Andrea asked for it), and obviously your Oracle load itself seems to
| be happily paging stuff around, which causes a lot of allocations for
| page-ins.
|
| It only _looks_ static, because once you find the proper "balance", the
| VM numbers themselves shouldn't change under a constant load.
|
| We could make kswapd use less CPU time, of course, simply by making the
| actual working processes do more of the work to free memory. The total
| work ends up being the same, though, and the advantage of kswapd is that
| it tends to make the freeing slightly more asynchronous, which helps
| throughput.
|
| The _disadvantage_ of kswapd is that if it goes crazy and uses up all
| CPU time, you get bad results ;)
|
| But it doesn't sound crazy in your load. I'd be happier if the VM took
| less CPU, of course, but for now we seem to be doing ok.
|
| Linus
| -
| To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
| the body of a message to majordomo@vger.kernel.org
| More majordomo info at http://vger.kernel.org/majordomo-info.html
| Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: Slight Return (was Re: [VM] 2.4.14/15-pre4 too "swap-happy"?)
2001-12-01 13:15 ` Slight Return (was Re: [VM] 2.4.14/15-pre4 too "swap-happy"?) Ken Brownfield
@ 2001-12-08 13:12 ` Ken Brownfield
2001-12-09 18:51 ` Marcelo Tosatti
0 siblings, 1 reply; 20+ messages in thread
From: Ken Brownfield @ 2001-12-08 13:12 UTC (permalink / raw)
To: linux-kernel
Just a quick followup to this, which is still a near show-stopper issue
for me.
This is easy to reproduce for me if I run updatedb locally, and then run
updatedb on a remote machine that's scanning an NFS-mounted filesystem
from the original local machine. Instant kswapd saturation, especially
on large filesystems.
Doing updatedb on NFS-mounted filesystems also seems to cause kswapd to
peg on the NFS-client side as well.
I recently realized that slocate (at least on RH6.2 w/ 2.4 kernels) does
not seem to properly detect NFS when provided "-f nfs"... Urgh.
Also something I noticed in slab_info (other info below):
inode_cache 369188 1027256 480 59716 128407 1 : 124 62
dentry_cache 256380 705510 128 14946 23517 1 : 252 126
buffer_head 46961 47800 96 1195 1195 1 : 252 126
That seems like a TON of {dentry,inode}_cache on a 1GB (HIMEM) machine.
I'd try 10_vm-19 but it doesn't apply cleanly for me.
Thanks for any input or ports of 10_vm-19 to 2.4.17-pre6. ;)
--
Ken.
brownfld@irridia.com
total: used: free: shared: buffers: cached:
Mem: 1054011392 900526080 153485312 0 67829760 174866432
Swap: 2149548032 581632 2148966400
MemTotal: 1029308 kB
MemFree: 149888 kB
MemShared: 0 kB
Buffers: 66240 kB
Cached: 170376 kB
SwapCached: 392 kB
Active: 202008 kB
Inactive: 40380 kB
HighTotal: 131008 kB
HighFree: 30604 kB
LowTotal: 898300 kB
LowFree: 119284 kB
SwapTotal: 2099168 kB
SwapFree: 2098600 kB
Mem: 1029308K av, 886144K used, 143164K free, 0K shrd, 66240K buff
Swap: 2099168K av, 568K used, 2098600K free 170872K cached
On Sat, Dec 01, 2001 at 07:15:02AM -0600, Ken Brownfield wrote:
| When updatedb kicked off on my 2.4.16 6-way Xeon 4GB box this morning, I
| had an unfortunate flashback:
|
| 5:02am up 2 days, 1 min, 59 users, load average: 5.66, 4.86, 3.60
| 741 processes: 723 sleeping, 4 running, 0 zombie, 14 stopped
| CPU states: 0.2% user, 77.3% system, 0.0% nice, 22.3% idle
| Mem: 3351664K av, 3346504K used, 5160K free, 0K shrd, 498048K buff
| Swap: 1052248K av, 282608K used, 769640K free 2531892K cached
|
| PID USER PRI NI SIZE RSS SHARE STAT LIB %CPU %MEM TIME COMMAND
| 2117 root 15 5 580 580 408 R N 0 99.9 0.0 17:19 updatedb
| 2635 kb 12 0 1696 1556 1216 R 0 99.9 0.0 4:16 smbd
| 2672 root 17 10 4212 4212 492 D N 0 94.7 0.1 1:39 rsync
| 2609 root 2 -20 1284 1284 672 R < 0 81.2 0.0 4:02 top
| 9 root 9 0 0 0 0 SW 0 80.7 0.0 42:50 kswapd
| 22879 kb 9 0 11548 6316 1684 S 0 11.8 0.1 7:33 smbd
|
| Under varied load I'm not seeing the kswapd issue, but it looks like
| updatedb combined with one or two samba transfers does still reproduce
| the problem easily, and adding rsync or NFS transfers to the mix makes
| kswapd peg at 99%.
|
| I noticed because I was trying to do kernel patches and compiles using a
| partition NFS-mounted from this machine. I guess it sometimes pays to
| be up at 5am...
|
| Unfortunately it's difficult for me to reboot this machine to update the
| kernel (59 users) but I will try to reproduce the problem on a separate
| machine this weekend or early next week. And I don't have profiling on,
| so that will have to wait as well. :-(
|
| Andrea, do you have a patch vs. 2.4.16 of your original solution to this
| problem that I could test out? I'd rather just change one thing at a
| time rather than switching completely to an -aa kernel.
|
| Grrrr!
|
| Thanks much,
| --
| Ken.
| brownfld@irridia.com
|
|
| On Tue, Nov 20, 2001 at 06:50:50AM +0000, Linus Torvalds wrote:
| | In article <20011119235422.F10597@asooo.flowerfire.com>,
| | Ken Brownfield <brownfld@irridia.com> wrote:
| | >kswapd goes up to 5-10% CPU (vs 3-6) but it finishes without issue or
| | >apparent interactivity problems. I'm keeping it in while( 1 ), but it's
| | >been predictable so far.
| | >
| | >3-10 is a lot better than 99, but is kswapd really going to eat that
| | >much CPU in an essentially allocation-less state?
| |
| | Well, it's obviously not allocation-less: updatedb will really hit on
| | the dcache and icache (which are both in the NORMAL zone only, which is
| | why Andrea asked for it), and obviously your Oracle load itself seems to
| | be happily paging stuff around, which causes a lot of allocations for
| | page-ins.
| |
| | It only _looks_ static, because once you find the proper "balance", the
| | VM numbers themselves shouldn't change under a constant load.
| |
| | We could make kswapd use less CPU time, of course, simply by making the
| | actual working processes do more of the work to free memory. The total
| | work ends up being the same, though, and the advantage of kswapd is that
| | it tends to make the freeing slightly more asynchronous, which helps
| | throughput.
| |
| | The _disadvantage_ of kswapd is that if it goes crazy and uses up all
| | CPU time, you get bad results ;)
| |
| | But it doesn't sound crazy in your load. I'd be happier if the VM took
| | less CPU, of course, but for now we seem to be doing ok.
| |
| | Linus
| | -
| | To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
| | the body of a message to majordomo@vger.kernel.org
| | More majordomo info at http://vger.kernel.org/majordomo-info.html
| | Please read the FAQ at http://www.tux.org/lkml/
| -
| To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
| the body of a message to majordomo@vger.kernel.org
| More majordomo info at http://vger.kernel.org/majordomo-info.html
| Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: Slight Return (was Re: [VM] 2.4.14/15-pre4 too "swap-happy"?)
2001-12-08 13:12 ` Ken Brownfield
@ 2001-12-09 18:51 ` Marcelo Tosatti
2001-12-10 6:56 ` Ken Brownfield
0 siblings, 1 reply; 20+ messages in thread
From: Marcelo Tosatti @ 2001-12-09 18:51 UTC (permalink / raw)
To: Ken Brownfield; +Cc: linux-kernel
On Sat, 8 Dec 2001, Ken Brownfield wrote:
> Just a quick followup to this, which is still a near show-stopper issue
> for me.
>
> This is easy to reproduce for me if I run updatedb locally, and then run
> updatedb on a remote machine that's scanning an NFS-mounted filesystem
> from the original local machine. Instant kswapd saturation, especially
> on large filesystems.
>
> Doing updatedb on NFS-mounted filesystems also seems to cause kswapd to
> peg on the NFS-client side as well.
Can you reproduce the problem without the over NFS updatedb?
Thanks
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Slight Return (was Re: [VM] 2.4.14/15-pre4 too "swap-happy"?)
2001-12-09 18:51 ` Marcelo Tosatti
@ 2001-12-10 6:56 ` Ken Brownfield
0 siblings, 0 replies; 20+ messages in thread
From: Ken Brownfield @ 2001-12-10 6:56 UTC (permalink / raw)
To: Marcelo Tosatti; +Cc: linux-kernel
Yes, any kind of fairly heavy, spread-out I/O combined with updatedb
will do the trick, like samba. NFS isn't required, it just seems to be
a particularly good trigger.
It seems like anything that hits the inode/dentry caches hard, actually,
and doesn't always happen when freepages (or its 2.4.x equivalent) has
been hit. I had a little applet that malloc'ed and memcpy'ed 1GB of RAM
and exited, which doesn't really help like it did before 2.4.15-pre[56].
It also happens for me a lot more with my 4GB machines, though I have
seen it on my 1GB HIGHMEM boxes as well. If the problem is related to
scanning the cache, perhaps more RAM simply makes it worse.
I'm planning on trying Andrew Morton's patches as soon as I'm able.
Thanks,
--
Ken.
brownfld@irridia.com
On Sun, Dec 09, 2001 at 04:51:14PM -0200, Marcelo Tosatti wrote:
|
|
| On Sat, 8 Dec 2001, Ken Brownfield wrote:
|
| > Just a quick followup to this, which is still a near show-stopper issue
| > for me.
| >
| > This is easy to reproduce for me if I run updatedb locally, and then run
| > updatedb on a remote machine that's scanning an NFS-mounted filesystem
| > from the original local machine. Instant kswapd saturation, especially
| > on large filesystems.
| >
| > Doing updatedb on NFS-mounted filesystems also seems to cause kswapd to
| > peg on the NFS-client side as well.
|
| Can you reproduce the problem without the over NFS updatedb?
|
| Thanks
|
| -
| To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
| the body of a message to majordomo@vger.kernel.org
| More majordomo info at http://vger.kernel.org/majordomo-info.html
| Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 20+ messages in thread