* Re: [VM] 2.4.14/15-pre4 too "swap-happy"?
[not found] <200111191801.fAJI1l922388@neosilicon.transmeta.com>
@ 2001-11-19 18:07 ` Linus Torvalds
2001-11-19 18:31 ` Ken Brownfield
2001-11-19 19:44 ` Slo Mo Snail
0 siblings, 2 replies; 20+ messages in thread
From: Linus Torvalds @ 2001-11-19 18:07 UTC (permalink / raw)
To: Sebastian Dröge; +Cc: linux-kernel
On Mon, 19 Nov 2001, Sebastian Dröge wrote:
> Hi,
> I couldn't answer ealier because I had some problems with my ISP
> the heavy swapping problem while burning a cd is solved in pre6aa1
> but if you want i can do some statistics tommorow
Well, pre6aa1 performs really badly exactly because it by default doesn't
swap enough even on _normal_ loads because Andrea is playing with some
tuning (and see the bad results of that tuning in the VM testing by
rwhron@earthlink.net).
So the pre6aa1 numbers are kind of suspect - lack of swapping may not be
due to fixing the problem, but due to bad tuning.
Does plain pre6 solve it? Plain pre6 has a fix where a locked shared
memory area would previously cause unnecessary swapping, and maybe the CD
burning buffer is using shmlock..
Linus
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [VM] 2.4.14/15-pre4 too "swap-happy"?
2001-11-19 19:30 ` [VM] 2.4.14/15-pre4 too "swap-happy"? Ken Brownfield
@ 2001-11-19 18:26 ` Marcelo Tosatti
0 siblings, 0 replies; 20+ messages in thread
From: Marcelo Tosatti @ 2001-11-19 18:26 UTC (permalink / raw)
To: Ken Brownfield; +Cc: linux-kernel
On Mon, 19 Nov 2001, Ken Brownfield wrote:
> Actually, I spoke too soon. We developed a quick stress test that
> causes the problem immediately:
>
> 11:18am up 3 days, 1:36, 3 users, load average: 8.72, 7.18, 3.96
> 91 processes: 85 sleeping, 6 running, 0 zombie, 0 stopped
> CPU states: 0.1% user, 93.4% system, 0.0% nice, 6.4% idle
> Mem: 3343688K av, 3340784K used, 2904K free, 0K shrd, 308K buff
> Swap: 1004052K av, 567404K used, 436648K free 2994288K cached
>
> PID USER PRI NI SIZE RSS SHARE STAT LIB %CPU %MEM TIME COMMAND
> 12102 oracle 13 0 16320 15M 14868 R 5584 67.2 0.4 18:58 oracle
> 12365 oracle 18 5 39352 38M 37796 R N 30M 66.7 1.1 4:14 oracle
> 12353 oracle 18 5 39956 38M 38408 R N 31M 66.5 1.1 9:14 oracle
> 12191 root 13 0 892 852 672 R 0 66.4 0.0 6:09 top
> 12366 oracle 9 0 892 892 672 S 0 60.0 0.0 3:20 top
> 9 root 9 0 0 0 0 SW 0 49.0 0.0 9:27 kswapd
> 11 root 9 0 0 0 0 SW 0 38.3 0.0 3:58 kupdated
> 105 root 9 0 0 0 0 SW 0 28.8 0.0 4:56 kjournald
> 470 root 9 0 844 828 472 S 0 28.1 0.0 1:46 gamdrvd
> 12351 oracle 13 5 39956 38M 38408 S N 31M 25.6 1.1 3:08 oracle
> 669 oracle 9 0 4780 4780 4384 S 492 24.4 0.1 1:42 oracle
> 1 root 14 0 476 424 408 R 0 21.6 0.0 1:19 init
> 2 root 14 0 0 0 0 RW 0 20.8 0.0 1:29 keventd
> 615 oracle 9 0 8984 8984 8460 S 4380 16.3 0.2 2:41 oracle
> 388 root 9 0 732 728 592 S 0 11.5 0.0 0:17 syslogd
>
> kswapd bounces up and down from 99%.
Ken,
Could you please check _where_ kswapd is spending its time ?
(you can use kernel profiling and the "readprofile" tool to report us the
functions which are wasting more CPU cycles in the kernel)
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [VM] 2.4.14/15-pre4 too "swap-happy"?
2001-11-19 18:07 ` [VM] 2.4.14/15-pre4 too "swap-happy"? Linus Torvalds
@ 2001-11-19 18:31 ` Ken Brownfield
2001-11-19 19:23 ` Linus Torvalds
2001-11-19 19:30 ` [VM] 2.4.14/15-pre4 too "swap-happy"? Ken Brownfield
2001-11-19 19:44 ` Slo Mo Snail
1 sibling, 2 replies; 20+ messages in thread
From: Ken Brownfield @ 2001-11-19 18:31 UTC (permalink / raw)
To: linux-kernel
Linus, so far 2.4.15-pre4 with your patch does not reproduce the kswapd
issue with Oracle, but I do need to perform more deterministic tests
before I can fully sign off on that.
BTW, didn't your patch go into -pre5? Or is there an additional mod in
-pre6 that we should try?
--
Ken.
brownfld@irridia.com
On Mon, Nov 19, 2001 at 10:07:58AM -0800, Linus Torvalds wrote:
|
| On Mon, 19 Nov 2001, Sebastian Dröge wrote:
| > Hi,
| > I couldn't answer ealier because I had some problems with my ISP
| > the heavy swapping problem while burning a cd is solved in pre6aa1
| > but if you want i can do some statistics tommorow
|
| Well, pre6aa1 performs really badly exactly because it by default doesn't
| swap enough even on _normal_ loads because Andrea is playing with some
| tuning (and see the bad results of that tuning in the VM testing by
| rwhron@earthlink.net).
|
| So the pre6aa1 numbers are kind of suspect - lack of swapping may not be
| due to fixing the problem, but due to bad tuning.
|
| Does plain pre6 solve it? Plain pre6 has a fix where a locked shared
| memory area would previously cause unnecessary swapping, and maybe the CD
| burning buffer is using shmlock..
|
| Linus
|
| -
| To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
| the body of a message to majordomo@vger.kernel.org
| More majordomo info at http://vger.kernel.org/majordomo-info.html
| Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [VM] 2.4.14/15-pre4 too "swap-happy"?
2001-11-19 18:31 ` Ken Brownfield
@ 2001-11-19 19:23 ` Linus Torvalds
2001-11-19 23:39 ` Ken Brownfield
2001-11-19 19:30 ` [VM] 2.4.14/15-pre4 too "swap-happy"? Ken Brownfield
1 sibling, 1 reply; 20+ messages in thread
From: Linus Torvalds @ 2001-11-19 19:23 UTC (permalink / raw)
To: linux-kernel
In article <20011119123125.B1439@asooo.flowerfire.com>,
Ken Brownfield <brownfld@irridia.com> wrote:
>Linus, so far 2.4.15-pre4 with your patch does not reproduce the kswapd
>issue with Oracle, but I do need to perform more deterministic tests
>before I can fully sign off on that.
>
>BTW, didn't your patch go into -pre5? Or is there an additional mod in
>-pre6 that we should try?
You're right, it's probably in pre5 already..
Anyway, it would be interesting to see if the patch by Andrea (I think
he called it "zone-watermarks") that changes the zone allocators to take
other zones into account makes a difference. See separate thread with
the subject line "15pre6aa1 (fixes google VM problem)".
(I think the patch is overly complex as-is, but I htink the _ideas_ in
it are fine).
Linus
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [VM] 2.4.14/15-pre4 too "swap-happy"?
2001-11-19 18:31 ` Ken Brownfield
2001-11-19 19:23 ` Linus Torvalds
@ 2001-11-19 19:30 ` Ken Brownfield
2001-11-19 18:26 ` Marcelo Tosatti
1 sibling, 1 reply; 20+ messages in thread
From: Ken Brownfield @ 2001-11-19 19:30 UTC (permalink / raw)
To: Ken Brownfield; +Cc: linux-kernel
Actually, I spoke too soon. We developed a quick stress test that
causes the problem immediately:
11:18am up 3 days, 1:36, 3 users, load average: 8.72, 7.18, 3.96
91 processes: 85 sleeping, 6 running, 0 zombie, 0 stopped
CPU states: 0.1% user, 93.4% system, 0.0% nice, 6.4% idle
Mem: 3343688K av, 3340784K used, 2904K free, 0K shrd, 308K buff
Swap: 1004052K av, 567404K used, 436648K free 2994288K cached
PID USER PRI NI SIZE RSS SHARE STAT LIB %CPU %MEM TIME COMMAND
12102 oracle 13 0 16320 15M 14868 R 5584 67.2 0.4 18:58 oracle
12365 oracle 18 5 39352 38M 37796 R N 30M 66.7 1.1 4:14 oracle
12353 oracle 18 5 39956 38M 38408 R N 31M 66.5 1.1 9:14 oracle
12191 root 13 0 892 852 672 R 0 66.4 0.0 6:09 top
12366 oracle 9 0 892 892 672 S 0 60.0 0.0 3:20 top
9 root 9 0 0 0 0 SW 0 49.0 0.0 9:27 kswapd
11 root 9 0 0 0 0 SW 0 38.3 0.0 3:58 kupdated
105 root 9 0 0 0 0 SW 0 28.8 0.0 4:56 kjournald
470 root 9 0 844 828 472 S 0 28.1 0.0 1:46 gamdrvd
12351 oracle 13 5 39956 38M 38408 S N 31M 25.6 1.1 3:08 oracle
669 oracle 9 0 4780 4780 4384 S 492 24.4 0.1 1:42 oracle
1 root 14 0 476 424 408 R 0 21.6 0.0 1:19 init
2 root 14 0 0 0 0 RW 0 20.8 0.0 1:29 keventd
615 oracle 9 0 8984 8984 8460 S 4380 16.3 0.2 2:41 oracle
388 root 9 0 732 728 592 S 0 11.5 0.0 0:17 syslogd
kswapd bounces up and down from 99%.
Keys for me are the full system time, the fact that the %CPUs seem to
add up to more than 6xCPUs (6-way Xeon), and that processes that aren't
really active show up as "active".
ASAP, I'll try -pre6 and then -aa1 to compare behavior.
The Oracle stress query looks like:
select /*+ parallel(mt,5) cache(mt) */ count(*) from mtable_units ;
Thanks much,
--
Ken.
On Mon, Nov 19, 2001 at 12:31:25PM -0600, Ken Brownfield wrote:
| Linus, so far 2.4.15-pre4 with your patch does not reproduce the kswapd
| issue with Oracle, but I do need to perform more deterministic tests
| before I can fully sign off on that.
|
| BTW, didn't your patch go into -pre5? Or is there an additional mod in
| -pre6 that we should try?
| --
| Ken.
| brownfld@irridia.com
|
| On Mon, Nov 19, 2001 at 10:07:58AM -0800, Linus Torvalds wrote:
| |
| | On Mon, 19 Nov 2001, Sebastian Dröge wrote:
| | > Hi,
| | > I couldn't answer ealier because I had some problems with my ISP
| | > the heavy swapping problem while burning a cd is solved in pre6aa1
| | > but if you want i can do some statistics tommorow
| |
| | Well, pre6aa1 performs really badly exactly because it by default doesn't
| | swap enough even on _normal_ loads because Andrea is playing with some
| | tuning (and see the bad results of that tuning in the VM testing by
| | rwhron@earthlink.net).
| |
| | So the pre6aa1 numbers are kind of suspect - lack of swapping may not be
| | due to fixing the problem, but due to bad tuning.
| |
| | Does plain pre6 solve it? Plain pre6 has a fix where a locked shared
| | memory area would previously cause unnecessary swapping, and maybe the CD
| | burning buffer is using shmlock..
| |
| | Linus
| |
| | -
| | To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
| | the body of a message to majordomo@vger.kernel.org
| | More majordomo info at http://vger.kernel.org/majordomo-info.html
| | Please read the FAQ at http://www.tux.org/lkml/
| -
| To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
| the body of a message to majordomo@vger.kernel.org
| More majordomo info at http://vger.kernel.org/majordomo-info.html
| Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [VM] 2.4.14/15-pre4 too "swap-happy"?
2001-11-19 18:07 ` [VM] 2.4.14/15-pre4 too "swap-happy"? Linus Torvalds
2001-11-19 18:31 ` Ken Brownfield
@ 2001-11-19 19:44 ` Slo Mo Snail
1 sibling, 0 replies; 20+ messages in thread
From: Slo Mo Snail @ 2001-11-19 19:44 UTC (permalink / raw)
To: linux-kernel, Linus Torvalds
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Am Montag, 19. November 2001 19:07 schrieb Linus Torvalds:
> On Mon, 19 Nov 2001, Sebastian Dröge wrote:
> > Hi,
> > I couldn't answer ealier because I had some problems with my ISP
> > the heavy swapping problem while burning a cd is solved in pre6aa1
> > but if you want i can do some statistics tommorow
>
> Well, pre6aa1 performs really badly exactly because it by default doesn't
> swap enough even on _normal_ loads because Andrea is playing with some
> tuning (and see the bad results of that tuning in the VM testing by
> rwhron@earthlink.net).
>
> So the pre6aa1 numbers are kind of suspect - lack of swapping may not be
> due to fixing the problem, but due to bad tuning.
>
> Does plain pre6 solve it? Plain pre6 has a fix where a locked shared
> memory area would previously cause unnecessary swapping, and maybe the CD
> burning buffer is using shmlock..
Hi,
yes plain pre6 seems to solve it, too. I can't be sure right now because I
have recorded only 3 CDs while running pre6
pre6 swaps more than aa1 but I had so far I had no buffer-underuns and much
of the swap appears in SwapCached
the interactive performance seems to be much better in pre6 than in aa1 so
I'll stay with pre6 ;)
Bye
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: For info see http://www.gnupg.org
iD8DBQE7+WEovIHrJes3kVIRAg+nAJ4issDSimDEal2I08CQHEoXBpGFLQCeNQ1x
AathQZ75U5nhnEZwTkR4WnI=
=lb0O
-----END PGP SIGNATURE-----
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [VM] 2.4.14/15-pre4 too "swap-happy"?
2001-11-19 19:23 ` Linus Torvalds
@ 2001-11-19 23:39 ` Ken Brownfield
2001-11-19 23:52 ` Linus Torvalds
0 siblings, 1 reply; 20+ messages in thread
From: Ken Brownfield @ 2001-11-19 23:39 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-kernel
I went straight to the aa patch, and it looks like it either fixes the
problem or (because of the side-effects Linus mentioned) otherwise
prevents the issue:
2:30pm up 11 min, 4 users, load average: 2.23, 2.18, 1.17
106 processes: 104 sleeping, 2 running, 0 zombie, 0 stopped
CPU states: 14.7% user, 10.3% system, 0.0% nice, 74.9% idle
Mem: 3342304K av, 3013888K used, 328416K free, 0K shrd, 1224K buff
Swap: 1004052K av, 276824K used, 727228K free 2862112K cached
PID USER PRI NI SIZE RSS SHARE STAT LIB %CPU %MEM TIME COMMAND
722 oracle 12 0 13364 12M 11856 S 9.9M 29.5 0.3 2:24 oracle
731 oracle 17 0 13488 12M 11980 D 10M 28.7 0.3 2:27 oracle
728 oracle 12 0 13048 12M 11540 R 9816 20.8 0.3 2:22 oracle
718 oracle 12 0 154M 153M 152M S 150M 17.9 4.7 2:22 oracle
725 oracle 14 0 13472 12M 11964 S 10M 17.9 0.3 2:20 oracle
734 oracle 12 0 13936 13M 12432 S 10M 15.3 0.4 2:27 oracle
9 root 9 0 0 0 0 SW 0 4.3 0.0 0:27 kswapd
The machine went into swap immediately when the page cache stopped
growing and hovered at 100-400MB. Also, in my experience the page cache
will grow until there's only 5ishMB of free RAM, but with the aa patch
it looks like it stops at 320MB or maybe 10% of RAM. Was that the aa
patch, or part of -pre6?
It would be nice if that number were modifyable via /proc (writable
freepages again? 10% seems a tad high for many boxes) but I think it's
better to have a bit more purely free RAM available than 5MB.
kswapd isn't going nuts, but it seems to still be eating quite a bit of
CPU given plenty of RAM. And it seems to go pretty hard into swap -- I
would imagine that it's disadvantageous to do significant swapping
(based on age only?) in the presence of a massive page cache. I would
imagine the performance hit of a 2GB vs. 3GB page cache would be less
egregious than the time and I/O kswapd is causing without memory
pressure.
The Oracle SGA is set to ~522MB, with nothing else running except a
couple of sshds, getty, etc. Now that I'm looking, 2.8GB page cache
plus 328MB free adds up to about 3.1GB of RAM -- where does the 512MB
shared memory segment fit? Is it being swapped out in deference to page
cache?
Just my USD$0.02. I'll try vanilla -pre6 with profiling soon and post
results. Thanks for the tip Marcelo.
Thanks,
--
Ken.
brownfld@irridia.com
On Mon, Nov 19, 2001 at 07:23:27PM +0000, Linus Torvalds wrote:
| In article <20011119123125.B1439@asooo.flowerfire.com>,
| Ken Brownfield <brownfld@irridia.com> wrote:
| >Linus, so far 2.4.15-pre4 with your patch does not reproduce the kswapd
| >issue with Oracle, but I do need to perform more deterministic tests
| >before I can fully sign off on that.
| >
| >BTW, didn't your patch go into -pre5? Or is there an additional mod in
| >-pre6 that we should try?
|
| You're right, it's probably in pre5 already..
|
| Anyway, it would be interesting to see if the patch by Andrea (I think
| he called it "zone-watermarks") that changes the zone allocators to take
| other zones into account makes a difference. See separate thread with
| the subject line "15pre6aa1 (fixes google VM problem)".
|
| (I think the patch is overly complex as-is, but I htink the _ideas_ in
| it are fine).
|
| Linus
| -
| To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
| the body of a message to majordomo@vger.kernel.org
| More majordomo info at http://vger.kernel.org/majordomo-info.html
| Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [VM] 2.4.14/15-pre4 too "swap-happy"?
2001-11-19 23:39 ` Ken Brownfield
@ 2001-11-19 23:52 ` Linus Torvalds
2001-11-20 0:18 ` M. Edward (Ed) Borasky
` (2 more replies)
0 siblings, 3 replies; 20+ messages in thread
From: Linus Torvalds @ 2001-11-19 23:52 UTC (permalink / raw)
To: Ken Brownfield; +Cc: linux-kernel, Andrea Arcangeli
[-- Attachment #1: Type: TEXT/PLAIN, Size: 1887 bytes --]
On Mon, 19 Nov 2001, Ken Brownfield wrote:
>
> I went straight to the aa patch, and it looks like it either fixes the
> problem or (because of the side-effects Linus mentioned) otherwise
> prevents the issue:
So is this pre6aa1, or pre6 + just the watermark patch?
> The machine went into swap immediately when the page cache stopped
> growing and hovered at 100-400MB. Also, in my experience the page cache
> will grow until there's only 5ishMB of free RAM, but with the aa patch
> it looks like it stops at 320MB or maybe 10% of RAM. Was that the aa
> patch, or part of -pre6?
That was the watermarking. The way Andrea did it, the page cache will
basically refuse to touch as much of the "normal" page zone, because it
would prefer to allocate more from highmem..
I think it's excessive to have 320MB free memory, though, that's just
an insane waste. I suspect that the real number should be somewhere
between the old behaviour and the new one. You can tweak the behaviour of
andrea's kernel by changing the "reserved" page numbers, but I'd like to
hear whether my simpler approach works too..
> The Oracle SGA is set to ~522MB, with nothing else running except a
> couple of sshds, getty, etc. Now that I'm looking, 2.8GB page cache
> plus 328MB free adds up to about 3.1GB of RAM -- where does the 512MB
> shared memory segment fit? Is it being swapped out in deference to page
> cache?
Shared memory actually uses the page cache too, so it will be accounted
for in the 2.8GB number.
Anyway, can you try plain vanilla pre6, with the appended patch? This is
my suggested simplified version of what Andrea tried to do, and it should
try to keep only a few extra megs of memory free in the low memory
regions, not 300+ MB.
(and the profiling would be interesting regardless, but I think Andrea did
find the real problem, his fix just seems a bit of an overkill ;)
Linus
[-- Attachment #2: Type: TEXT/PLAIN, Size: 1839 bytes --]
diff -u --recursive --new-file pre6/linux/mm/page_alloc.c linux/mm/page_alloc.c
--- pre6/linux/mm/page_alloc.c Sat Nov 17 19:07:43 2001
+++ linux/mm/page_alloc.c Mon Nov 19 15:13:36 2001
@@ -299,29 +299,26 @@
return page;
}
-static inline unsigned long zone_free_pages(zone_t * zone, unsigned int order)
-{
- long free = zone->free_pages - (1UL << order);
- return free >= 0 ? free : 0;
-}
-
/*
* This is the 'heart' of the zoned buddy allocator:
*/
struct page * __alloc_pages(unsigned int gfp_mask, unsigned int order, zonelist_t *zonelist)
{
+ unsigned long min;
zone_t **zone, * classzone;
struct page * page;
int freed;
zone = zonelist->zones;
classzone = *zone;
+ min = 1UL << order;
for (;;) {
zone_t *z = *(zone++);
if (!z)
break;
- if (zone_free_pages(z, order) > z->pages_low) {
+ min += z->pages_low;
+ if (z->free_pages > min) {
page = rmqueue(z, order);
if (page)
return page;
@@ -334,16 +331,18 @@
wake_up_interruptible(&kswapd_wait);
zone = zonelist->zones;
+ min = 1UL << order;
for (;;) {
- unsigned long min;
+ unsigned long local_min;
zone_t *z = *(zone++);
if (!z)
break;
- min = z->pages_min;
+ local_min = z->pages_min;
if (!(gfp_mask & __GFP_WAIT))
- min >>= 2;
- if (zone_free_pages(z, order) > min) {
+ local_min >>= 2;
+ min += local_min;
+ if (z->free_pages > min) {
page = rmqueue(z, order);
if (page)
return page;
@@ -376,12 +375,14 @@
return page;
zone = zonelist->zones;
+ min = 1UL << order;
for (;;) {
zone_t *z = *(zone++);
if (!z)
break;
- if (zone_free_pages(z, order) > z->pages_min) {
+ min += z->pages_min;
+ if (z->free_pages > min) {
page = rmqueue(z, order);
if (page)
return page;
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [VM] 2.4.14/15-pre4 too "swap-happy"?
2001-11-19 23:52 ` Linus Torvalds
@ 2001-11-20 0:18 ` M. Edward (Ed) Borasky
2001-11-20 0:25 ` Ken Brownfield
2001-11-20 3:09 ` Ken Brownfield
2 siblings, 0 replies; 20+ messages in thread
From: M. Edward (Ed) Borasky @ 2001-11-20 0:18 UTC (permalink / raw)
To: linux-kernel
On a related note, the files "/usr/src/linux/Documentation/filesystems/proc.txt"
and "sysctl/vm.txt" refer to some variables I need to be able to set on a
system running 2.4.12. In particular, I need to be able to get to the values
in "/proc/sys/vm/freepages", "/proc/sys/vm/buffermem" and
"/proc/sys/vm/pagecache". However, despite their existence in the documentation
files, these files don't exist on a 2.4.12 system. How can I read and set these
values on a 2.4.12 system?
--
znmeb@aracnet.com (M. Edward Borasky) http://www.aracnet.com/~znmeb
Relax! Run Your Own Brain with Neuro-Semantics!
http://www.meta-trading-coach.com
"Outside of a dog, a book is a man's best friend. Inside a dog, it's
too dark to read." -- Marx
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [VM] 2.4.14/15-pre4 too "swap-happy"?
2001-11-19 23:52 ` Linus Torvalds
2001-11-20 0:18 ` M. Edward (Ed) Borasky
@ 2001-11-20 0:25 ` Ken Brownfield
2001-11-20 0:31 ` Linus Torvalds
2001-11-20 3:09 ` Ken Brownfield
2 siblings, 1 reply; 20+ messages in thread
From: Ken Brownfield @ 2001-11-20 0:25 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-kernel, Andrea Arcangeli
On Mon, Nov 19, 2001 at 03:52:44PM -0800, Linus Torvalds wrote:
|
| On Mon, 19 Nov 2001, Ken Brownfield wrote:
| >
| > I went straight to the aa patch, and it looks like it either fixes the
| > problem or (because of the side-effects Linus mentioned) otherwise
| > prevents the issue:
|
| So is this pre6aa1, or pre6 + just the watermark patch?
I'm currently using -pre6 with his separately-posted zone-watermark-1
patch. Sorry, I should have been clearer.
| > The machine went into swap immediately when the page cache stopped
| > growing and hovered at 100-400MB. Also, in my experience the page cache
| > will grow until there's only 5ishMB of free RAM, but with the aa patch
| > it looks like it stops at 320MB or maybe 10% of RAM. Was that the aa
| > patch, or part of -pre6?
|
| That was the watermarking. The way Andrea did it, the page cache will
| basically refuse to touch as much of the "normal" page zone, because it
| would prefer to allocate more from highmem..
|
| I think it's excessive to have 320MB free memory, though, that's just
| an insane waste. I suspect that the real number should be somewhere
| between the old behaviour and the new one. You can tweak the behaviour of
| andrea's kernel by changing the "reserved" page numbers, but I'd like to
| hear whether my simpler approach works too..
Yeah, maybe a tiered default would be best, IMHO. 5MB on a 3GB box
does, on the other hand, seem anemic.
| > The Oracle SGA is set to ~522MB, with nothing else running except a
| > couple of sshds, getty, etc. Now that I'm looking, 2.8GB page cache
| > plus 328MB free adds up to about 3.1GB of RAM -- where does the 512MB
| > shared memory segment fit? Is it being swapped out in deference to page
| > cache?
|
| Shared memory actually uses the page cache too, so it will be accounted
| for in the 2.8GB number.
My bad, should have realized.
| Anyway, can you try plain vanilla pre6, with the appended patch? This is
| my suggested simplified version of what Andrea tried to do, and it should
| try to keep only a few extra megs of memory free in the low memory
| regions, not 300+ MB.
|
| (and the profiling would be interesting regardless, but I think Andrea did
| find the real problem, his fix just seems a bit of an overkill ;)
|
| Linus
I'll try this patch ASAP.
Thanks a LOT to all involved,
--
Ken.
brownfld@irridia.com
| diff -u --recursive --new-file pre6/linux/mm/page_alloc.c linux/mm/page_alloc.c
| --- pre6/linux/mm/page_alloc.c Sat Nov 17 19:07:43 2001
| +++ linux/mm/page_alloc.c Mon Nov 19 15:13:36 2001
| @@ -299,29 +299,26 @@
| return page;
| }
|
| -static inline unsigned long zone_free_pages(zone_t * zone, unsigned int order)
| -{
| - long free = zone->free_pages - (1UL << order);
| - return free >= 0 ? free : 0;
| -}
| -
| /*
| * This is the 'heart' of the zoned buddy allocator:
| */
| struct page * __alloc_pages(unsigned int gfp_mask, unsigned int order, zonelist_t *zonelist)
| {
| + unsigned long min;
| zone_t **zone, * classzone;
| struct page * page;
| int freed;
|
| zone = zonelist->zones;
| classzone = *zone;
| + min = 1UL << order;
| for (;;) {
| zone_t *z = *(zone++);
| if (!z)
| break;
|
| - if (zone_free_pages(z, order) > z->pages_low) {
| + min += z->pages_low;
| + if (z->free_pages > min) {
| page = rmqueue(z, order);
| if (page)
| return page;
| @@ -334,16 +331,18 @@
| wake_up_interruptible(&kswapd_wait);
|
| zone = zonelist->zones;
| + min = 1UL << order;
| for (;;) {
| - unsigned long min;
| + unsigned long local_min;
| zone_t *z = *(zone++);
| if (!z)
| break;
|
| - min = z->pages_min;
| + local_min = z->pages_min;
| if (!(gfp_mask & __GFP_WAIT))
| - min >>= 2;
| - if (zone_free_pages(z, order) > min) {
| + local_min >>= 2;
| + min += local_min;
| + if (z->free_pages > min) {
| page = rmqueue(z, order);
| if (page)
| return page;
| @@ -376,12 +375,14 @@
| return page;
|
| zone = zonelist->zones;
| + min = 1UL << order;
| for (;;) {
| zone_t *z = *(zone++);
| if (!z)
| break;
|
| - if (zone_free_pages(z, order) > z->pages_min) {
| + min += z->pages_min;
| + if (z->free_pages > min) {
| page = rmqueue(z, order);
| if (page)
| return page;
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [VM] 2.4.14/15-pre4 too "swap-happy"?
2001-11-20 0:25 ` Ken Brownfield
@ 2001-11-20 0:31 ` Linus Torvalds
0 siblings, 0 replies; 20+ messages in thread
From: Linus Torvalds @ 2001-11-20 0:31 UTC (permalink / raw)
To: Ken Brownfield; +Cc: linux-kernel, Andrea Arcangeli
On Mon, 19 Nov 2001, Ken Brownfield wrote:
> |
> | So is this pre6aa1, or pre6 + just the watermark patch?
>
> I'm currently using -pre6 with his separately-posted zone-watermark-1
> patch. Sorry, I should have been clearer.
Good. That removes the other variables from the equation, ie it's not an
effect of some of the other tweaking in the -aa patches.
> Yeah, maybe a tiered default would be best, IMHO. 5MB on a 3GB box
> does, on the other hand, seem anemic.
Yeah, the 5MB _is_ anemic. It comes from the fact that we decide to never
bother having more than zone_balance_max[] pages free, even if we have
tons of memory. And zone_balance_max[] is fairly small, it limits us to
255 free pages per zone (for page_min - wth "page_low" being twice that).
So you get 3 zones, with 255*2 pages free max each, except the DMA zone
has much less just because it's smaller. Thus 5MB.
There's no real reason for having zone_balance_max[] at all - without it
we'd just always try to keep about 1/128th of memory free, which would be
about 24MB on a 3GB box. Which is probably not a bad idea.
With my "simplified-Andrea" patch, you should see slightly more than 5MB
free, but not a lot more. A HIGHMEM allocation now wants to leave an
"extra" 510 pages in NORMAL, and even more in the DMA zone, so you should
see something like maybe 12-15 MB free instead of 300MB.
(Wild hand-waving number, I'm too lazy to actually do the math, and I
haven't even tested that the simple patch works at all - I think I forgot
to mention that small detail ;)
Linus
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [VM] 2.4.14/15-pre4 too "swap-happy"?
2001-11-19 23:52 ` Linus Torvalds
2001-11-20 0:18 ` M. Edward (Ed) Borasky
2001-11-20 0:25 ` Ken Brownfield
@ 2001-11-20 3:09 ` Ken Brownfield
2001-11-20 3:30 ` Linus Torvalds
2001-11-20 3:32 ` Andrea Arcangeli
2 siblings, 2 replies; 20+ messages in thread
From: Ken Brownfield @ 2001-11-20 3:09 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-kernel, Andrea Arcangeli
Well, I think you'll be pleased to hear that your untested patch
compiled, booted, _and_ fixed the problem. :)
The minimum free RAM was about 9.8-11MB (matching your guestimate) and
kswapd seemed to behave the same as the watermark patch. The results of
top were basically the same, so I'm omitting it.
However, I do have some profiling numbers, thanks to Marcelo. Attached
are numbers from "readprofile | sort -nr +2 | head -20". I think the
pre4 numbers point to shrink_cache, prune_icache, and statm_pgd_range.
The other two might have significance for wizards, but statistically
don't stand out to me, except maybe statm_pgd_range.
I reset the counters just before starting Oracle and the stress test. I
think a -pre7 with a blessed patch would be good, since my testing was
very narrow.
I'll test new kernels as I hear new info.
Thanks much!
--
Ken.
brownfld@irridia.com
2.4.15-pre4 with your original patch:
(shorter time period since the machine went to hell fast)
(matches vanilla behaviour)
164536 default_idle 3164.1538
101562 shrink_cache 113.8587
3683 prune_icache 13.5404
3034 file_read_actor 12.2339
914 DAC960_BA_InterruptHandler 5.5732
1128 statm_pgd_range 2.9072
40 page_cache_release 0.8333
31 add_page_to_hash_queue 0.5167
89 page_cache_read 0.4363
25 remove_inode_page 0.4167
26 unlock_page 0.3095
509 __make_request 0.3008
66 smp_call_function 0.2946
21 set_bh_page 0.2917
9 __brelse 0.2812
90 try_to_free_buffers 0.2778
13 mark_page_accessed 0.2708
8 __free_pages 0.2500
43 get_hash_table 0.2443
42 activate_page 0.2234
2.4.15-pre6 with watermark patch:
1617446 default_idle 31104.7308
27599 DAC960_BA_InterruptHandler 168.2866
38918 file_read_actor 156.9274
528 page_cache_release 11.0000
554 add_page_to_hash_queue 9.2333
15487 __make_request 9.1531
3453 statm_pgd_range 8.8995
514 remove_inode_page 8.5667
1453 blk_init_free_list 7.2650
377 set_bh_page 5.2361
898 page_cache_read 4.4020
590 add_to_page_cache_unique 4.3382
136 __brelse 4.2500
1120 kmem_cache_alloc 3.8356
628 kunmap_high 3.7381
1189 try_to_free_buffers 3.6698
625 get_hash_table 3.5511
439 lru_cache_add 3.4297
1715 rmqueue 3.0194
105 remove_wait_queue 2.9167
2.4.15-pre6 with Linus patch:
1249875 default_idle 24036.0577
65324 file_read_actor 263.4032
36979 DAC960_BA_InterruptHandler 225.4817
9809 statm_pgd_range 25.2809
1039 page_cache_release 21.6458
994 add_page_to_hash_queue 16.5667
922 remove_inode_page 15.3667
2409 blk_init_free_list 12.0450
20159 __make_request 11.9143
1198 lru_cache_add 9.3594
1628 page_cache_read 7.9804
987 add_to_page_cache_unique 7.2574
2202 try_to_free_buffers 6.7963
1038 get_unused_buffer_head 6.6538
484 unlock_page 5.7619
3182 rmqueue 5.6021
874 kunmap_high 5.2024
164 __brelse 5.1250
900 get_hash_table 5.1136
357 set_bh_page 4.9583
On Mon, Nov 19, 2001 at 03:52:44PM -0800, Linus Torvalds wrote:
|
| On Mon, 19 Nov 2001, Ken Brownfield wrote:
| >
| > I went straight to the aa patch, and it looks like it either fixes the
| > problem or (because of the side-effects Linus mentioned) otherwise
| > prevents the issue:
|
| So is this pre6aa1, or pre6 + just the watermark patch?
|
| > The machine went into swap immediately when the page cache stopped
| > growing and hovered at 100-400MB. Also, in my experience the page cache
| > will grow until there's only 5ishMB of free RAM, but with the aa patch
| > it looks like it stops at 320MB or maybe 10% of RAM. Was that the aa
| > patch, or part of -pre6?
|
| That was the watermarking. The way Andrea did it, the page cache will
| basically refuse to touch as much of the "normal" page zone, because it
| would prefer to allocate more from highmem..
|
| I think it's excessive to have 320MB free memory, though, that's just
| an insane waste. I suspect that the real number should be somewhere
| between the old behaviour and the new one. You can tweak the behaviour of
| andrea's kernel by changing the "reserved" page numbers, but I'd like to
| hear whether my simpler approach works too..
|
| > The Oracle SGA is set to ~522MB, with nothing else running except a
| > couple of sshds, getty, etc. Now that I'm looking, 2.8GB page cache
| > plus 328MB free adds up to about 3.1GB of RAM -- where does the 512MB
| > shared memory segment fit? Is it being swapped out in deference to page
| > cache?
|
| Shared memory actually uses the page cache too, so it will be accounted
| for in the 2.8GB number.
|
| Anyway, can you try plain vanilla pre6, with the appended patch? This is
| my suggested simplified version of what Andrea tried to do, and it should
| try to keep only a few extra megs of memory free in the low memory
| regions, not 300+ MB.
|
| (and the profiling would be interesting regardless, but I think Andrea did
| find the real problem, his fix just seems a bit of an overkill ;)
|
| Linus
| diff -u --recursive --new-file pre6/linux/mm/page_alloc.c linux/mm/page_alloc.c
| --- pre6/linux/mm/page_alloc.c Sat Nov 17 19:07:43 2001
| +++ linux/mm/page_alloc.c Mon Nov 19 15:13:36 2001
| @@ -299,29 +299,26 @@
| return page;
| }
|
| -static inline unsigned long zone_free_pages(zone_t * zone, unsigned int order)
| -{
| - long free = zone->free_pages - (1UL << order);
| - return free >= 0 ? free : 0;
| -}
| -
| /*
| * This is the 'heart' of the zoned buddy allocator:
| */
| struct page * __alloc_pages(unsigned int gfp_mask, unsigned int order, zonelist_t *zonelist)
| {
| + unsigned long min;
| zone_t **zone, * classzone;
| struct page * page;
| int freed;
|
| zone = zonelist->zones;
| classzone = *zone;
| + min = 1UL << order;
| for (;;) {
| zone_t *z = *(zone++);
| if (!z)
| break;
|
| - if (zone_free_pages(z, order) > z->pages_low) {
| + min += z->pages_low;
| + if (z->free_pages > min) {
| page = rmqueue(z, order);
| if (page)
| return page;
| @@ -334,16 +331,18 @@
| wake_up_interruptible(&kswapd_wait);
|
| zone = zonelist->zones;
| + min = 1UL << order;
| for (;;) {
| - unsigned long min;
| + unsigned long local_min;
| zone_t *z = *(zone++);
| if (!z)
| break;
|
| - min = z->pages_min;
| + local_min = z->pages_min;
| if (!(gfp_mask & __GFP_WAIT))
| - min >>= 2;
| - if (zone_free_pages(z, order) > min) {
| + local_min >>= 2;
| + min += local_min;
| + if (z->free_pages > min) {
| page = rmqueue(z, order);
| if (page)
| return page;
| @@ -376,12 +375,14 @@
| return page;
|
| zone = zonelist->zones;
| + min = 1UL << order;
| for (;;) {
| zone_t *z = *(zone++);
| if (!z)
| break;
|
| - if (zone_free_pages(z, order) > z->pages_min) {
| + min += z->pages_min;
| + if (z->free_pages > min) {
| page = rmqueue(z, order);
| if (page)
| return page;
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [VM] 2.4.14/15-pre4 too "swap-happy"?
2001-11-20 3:09 ` Ken Brownfield
@ 2001-11-20 3:30 ` Linus Torvalds
2001-11-20 3:32 ` Andrea Arcangeli
1 sibling, 0 replies; 20+ messages in thread
From: Linus Torvalds @ 2001-11-20 3:30 UTC (permalink / raw)
To: Ken Brownfield; +Cc: linux-kernel, Andrea Arcangeli
On Mon, 19 Nov 2001, Ken Brownfield wrote:
>
> Well, I think you'll be pleased to hear that your untested patch
> compiled, booted, _and_ fixed the problem. :)
Good. The patch itself was fairly simple, and the problem was
straightforward, the real credit for the fix goes to Andrea for thinking
about what was wrong with the old code..
> The minimum free RAM was about 9.8-11MB (matching your guestimate) and
> kswapd seemed to behave the same as the watermark patch. The results of
> top were basically the same, so I'm omitting it.
All right. I think 10MB free for a 3GB machine is good - and we can easily
tweak the zone_balance_max[] numbers if somebody comes to the conclusion
that it's better to have more free. It's about .3% of RAM, so it's small
enough that it's certainly not too much, and yet at the same time it's
probably enough to give reasonable behaviour in a temporary memory crunch.
> However, I do have some profiling numbers, thanks to Marcelo. Attached
> are numbers from "readprofile | sort -nr +2 | head -20". I think the
> pre4 numbers point to shrink_cache, prune_icache, and statm_pgd_range.
> The other two might have significance for wizards, but statistically
> don't stand out to me, except maybe statm_pgd_range.
I'd say that this clearly shows that yes, 2.4.14 did the wrong thing, and
wasted time in shrink_cache() without making any real progress. The two
other profiles look reasonable to me - nothing stands out that shouldn't.
(yeah, we spend _much_ too much time doing VM statistics with "top", and
the only way to get rid of that would be to add a per-vma "rss" field.
Which might not be a bad idea, but it's not a high priority for me).
> I reset the counters just before starting Oracle and the stress test. I
> think a -pre7 with a blessed patch would be good, since my testing was
> very narrow.
Sude, I'll do a pre7. This closes my last behaviour issue with the VM,
although I'm sure we'll end up spending tons of time chasing bugs still
(both VM and not).
Linus
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [VM] 2.4.14/15-pre4 too "swap-happy"?
2001-11-20 3:09 ` Ken Brownfield
2001-11-20 3:30 ` Linus Torvalds
@ 2001-11-20 3:32 ` Andrea Arcangeli
2001-11-20 5:54 ` Ken Brownfield
1 sibling, 1 reply; 20+ messages in thread
From: Andrea Arcangeli @ 2001-11-20 3:32 UTC (permalink / raw)
To: Ken Brownfield; +Cc: Linus Torvalds, linux-kernel
On Mon, Nov 19, 2001 at 09:09:41PM -0600, Ken Brownfield wrote:
> Well, I think you'll be pleased to hear that your untested patch
> compiled, booted, _and_ fixed the problem. :)
Can you try to run an updatedb constantly in background?
Andrea
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [VM] 2.4.14/15-pre4 too "swap-happy"?
2001-11-20 3:32 ` Andrea Arcangeli
@ 2001-11-20 5:54 ` Ken Brownfield
2001-11-20 6:50 ` Linus Torvalds
0 siblings, 1 reply; 20+ messages in thread
From: Ken Brownfield @ 2001-11-20 5:54 UTC (permalink / raw)
To: Andrea Arcangeli; +Cc: Linus Torvalds, linux-kernel
kswapd goes up to 5-10% CPU (vs 3-6) but it finishes without issue or
apparent interactivity problems. I'm keeping it in while( 1 ), but it's
been predictable so far.
3-10 is a lot better than 99, but is kswapd really going to eat that
much CPU in an essentially allocation-less state?
But certainly you found the right thing.
Thx all!
--
Ken.
brownfld@irridia.com
On Tue, Nov 20, 2001 at 04:32:23AM +0100, Andrea Arcangeli wrote:
| On Mon, Nov 19, 2001 at 09:09:41PM -0600, Ken Brownfield wrote:
| > Well, I think you'll be pleased to hear that your untested patch
| > compiled, booted, _and_ fixed the problem. :)
|
| Can you try to run an updatedb constantly in background?
|
| Andrea
| -
| To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
| the body of a message to majordomo@vger.kernel.org
| More majordomo info at http://vger.kernel.org/majordomo-info.html
| Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [VM] 2.4.14/15-pre4 too "swap-happy"?
2001-11-20 5:54 ` Ken Brownfield
@ 2001-11-20 6:50 ` Linus Torvalds
2001-12-01 13:15 ` Slight Return (was Re: [VM] 2.4.14/15-pre4 too "swap-happy"?) Ken Brownfield
0 siblings, 1 reply; 20+ messages in thread
From: Linus Torvalds @ 2001-11-20 6:50 UTC (permalink / raw)
To: linux-kernel
In article <20011119235422.F10597@asooo.flowerfire.com>,
Ken Brownfield <brownfld@irridia.com> wrote:
>kswapd goes up to 5-10% CPU (vs 3-6) but it finishes without issue or
>apparent interactivity problems. I'm keeping it in while( 1 ), but it's
>been predictable so far.
>
>3-10 is a lot better than 99, but is kswapd really going to eat that
>much CPU in an essentially allocation-less state?
Well, it's obviously not allocation-less: updatedb will really hit on
the dcache and icache (which are both in the NORMAL zone only, which is
why Andrea asked for it), and obviously your Oracle load itself seems to
be happily paging stuff around, which causes a lot of allocations for
page-ins.
It only _looks_ static, because once you find the proper "balance", the
VM numbers themselves shouldn't change under a constant load.
We could make kswapd use less CPU time, of course, simply by making the
actual working processes do more of the work to free memory. The total
work ends up being the same, though, and the advantage of kswapd is that
it tends to make the freeing slightly more asynchronous, which helps
throughput.
The _disadvantage_ of kswapd is that if it goes crazy and uses up all
CPU time, you get bad results ;)
But it doesn't sound crazy in your load. I'd be happier if the VM took
less CPU, of course, but for now we seem to be doing ok.
Linus
^ permalink raw reply [flat|nested] 20+ messages in thread
* Slight Return (was Re: [VM] 2.4.14/15-pre4 too "swap-happy"?)
2001-11-20 6:50 ` Linus Torvalds
@ 2001-12-01 13:15 ` Ken Brownfield
2001-12-08 13:12 ` Ken Brownfield
0 siblings, 1 reply; 20+ messages in thread
From: Ken Brownfield @ 2001-12-01 13:15 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-kernel
When updatedb kicked off on my 2.4.16 6-way Xeon 4GB box this morning, I
had an unfortunate flashback:
5:02am up 2 days, 1 min, 59 users, load average: 5.66, 4.86, 3.60
741 processes: 723 sleeping, 4 running, 0 zombie, 14 stopped
CPU states: 0.2% user, 77.3% system, 0.0% nice, 22.3% idle
Mem: 3351664K av, 3346504K used, 5160K free, 0K shrd, 498048K buff
Swap: 1052248K av, 282608K used, 769640K free 2531892K cached
PID USER PRI NI SIZE RSS SHARE STAT LIB %CPU %MEM TIME COMMAND
2117 root 15 5 580 580 408 R N 0 99.9 0.0 17:19 updatedb
2635 kb 12 0 1696 1556 1216 R 0 99.9 0.0 4:16 smbd
2672 root 17 10 4212 4212 492 D N 0 94.7 0.1 1:39 rsync
2609 root 2 -20 1284 1284 672 R < 0 81.2 0.0 4:02 top
9 root 9 0 0 0 0 SW 0 80.7 0.0 42:50 kswapd
22879 kb 9 0 11548 6316 1684 S 0 11.8 0.1 7:33 smbd
Under varied load I'm not seeing the kswapd issue, but it looks like
updatedb combined with one or two samba transfers does still reproduce
the problem easily, and adding rsync or NFS transfers to the mix makes
kswapd peg at 99%.
I noticed because I was trying to do kernel patches and compiles using a
partition NFS-mounted from this machine. I guess it sometimes pays to
be up at 5am...
Unfortunately it's difficult for me to reboot this machine to update the
kernel (59 users) but I will try to reproduce the problem on a separate
machine this weekend or early next week. And I don't have profiling on,
so that will have to wait as well. :-(
Andrea, do you have a patch vs. 2.4.16 of your original solution to this
problem that I could test out? I'd rather just change one thing at a
time rather than switching completely to an -aa kernel.
Grrrr!
Thanks much,
--
Ken.
brownfld@irridia.com
On Tue, Nov 20, 2001 at 06:50:50AM +0000, Linus Torvalds wrote:
| In article <20011119235422.F10597@asooo.flowerfire.com>,
| Ken Brownfield <brownfld@irridia.com> wrote:
| >kswapd goes up to 5-10% CPU (vs 3-6) but it finishes without issue or
| >apparent interactivity problems. I'm keeping it in while( 1 ), but it's
| >been predictable so far.
| >
| >3-10 is a lot better than 99, but is kswapd really going to eat that
| >much CPU in an essentially allocation-less state?
|
| Well, it's obviously not allocation-less: updatedb will really hit on
| the dcache and icache (which are both in the NORMAL zone only, which is
| why Andrea asked for it), and obviously your Oracle load itself seems to
| be happily paging stuff around, which causes a lot of allocations for
| page-ins.
|
| It only _looks_ static, because once you find the proper "balance", the
| VM numbers themselves shouldn't change under a constant load.
|
| We could make kswapd use less CPU time, of course, simply by making the
| actual working processes do more of the work to free memory. The total
| work ends up being the same, though, and the advantage of kswapd is that
| it tends to make the freeing slightly more asynchronous, which helps
| throughput.
|
| The _disadvantage_ of kswapd is that if it goes crazy and uses up all
| CPU time, you get bad results ;)
|
| But it doesn't sound crazy in your load. I'd be happier if the VM took
| less CPU, of course, but for now we seem to be doing ok.
|
| Linus
| -
| To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
| the body of a message to majordomo@vger.kernel.org
| More majordomo info at http://vger.kernel.org/majordomo-info.html
| Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Slight Return (was Re: [VM] 2.4.14/15-pre4 too "swap-happy"?)
2001-12-01 13:15 ` Slight Return (was Re: [VM] 2.4.14/15-pre4 too "swap-happy"?) Ken Brownfield
@ 2001-12-08 13:12 ` Ken Brownfield
2001-12-09 18:51 ` Marcelo Tosatti
0 siblings, 1 reply; 20+ messages in thread
From: Ken Brownfield @ 2001-12-08 13:12 UTC (permalink / raw)
To: linux-kernel
Just a quick followup to this, which is still a near show-stopper issue
for me.
This is easy to reproduce for me if I run updatedb locally, and then run
updatedb on a remote machine that's scanning an NFS-mounted filesystem
from the original local machine. Instant kswapd saturation, especially
on large filesystems.
Doing updatedb on NFS-mounted filesystems also seems to cause kswapd to
peg on the NFS-client side as well.
I recently realized that slocate (at least on RH6.2 w/ 2.4 kernels) does
not seem to properly detect NFS when provided "-f nfs"... Urgh.
Also something I noticed in slab_info (other info below):
inode_cache 369188 1027256 480 59716 128407 1 : 124 62
dentry_cache 256380 705510 128 14946 23517 1 : 252 126
buffer_head 46961 47800 96 1195 1195 1 : 252 126
That seems like a TON of {dentry,inode}_cache on a 1GB (HIMEM) machine.
I'd try 10_vm-19 but it doesn't apply cleanly for me.
Thanks for any input or ports of 10_vm-19 to 2.4.17-pre6. ;)
--
Ken.
brownfld@irridia.com
total: used: free: shared: buffers: cached:
Mem: 1054011392 900526080 153485312 0 67829760 174866432
Swap: 2149548032 581632 2148966400
MemTotal: 1029308 kB
MemFree: 149888 kB
MemShared: 0 kB
Buffers: 66240 kB
Cached: 170376 kB
SwapCached: 392 kB
Active: 202008 kB
Inactive: 40380 kB
HighTotal: 131008 kB
HighFree: 30604 kB
LowTotal: 898300 kB
LowFree: 119284 kB
SwapTotal: 2099168 kB
SwapFree: 2098600 kB
Mem: 1029308K av, 886144K used, 143164K free, 0K shrd, 66240K buff
Swap: 2099168K av, 568K used, 2098600K free 170872K cached
On Sat, Dec 01, 2001 at 07:15:02AM -0600, Ken Brownfield wrote:
| When updatedb kicked off on my 2.4.16 6-way Xeon 4GB box this morning, I
| had an unfortunate flashback:
|
| 5:02am up 2 days, 1 min, 59 users, load average: 5.66, 4.86, 3.60
| 741 processes: 723 sleeping, 4 running, 0 zombie, 14 stopped
| CPU states: 0.2% user, 77.3% system, 0.0% nice, 22.3% idle
| Mem: 3351664K av, 3346504K used, 5160K free, 0K shrd, 498048K buff
| Swap: 1052248K av, 282608K used, 769640K free 2531892K cached
|
| PID USER PRI NI SIZE RSS SHARE STAT LIB %CPU %MEM TIME COMMAND
| 2117 root 15 5 580 580 408 R N 0 99.9 0.0 17:19 updatedb
| 2635 kb 12 0 1696 1556 1216 R 0 99.9 0.0 4:16 smbd
| 2672 root 17 10 4212 4212 492 D N 0 94.7 0.1 1:39 rsync
| 2609 root 2 -20 1284 1284 672 R < 0 81.2 0.0 4:02 top
| 9 root 9 0 0 0 0 SW 0 80.7 0.0 42:50 kswapd
| 22879 kb 9 0 11548 6316 1684 S 0 11.8 0.1 7:33 smbd
|
| Under varied load I'm not seeing the kswapd issue, but it looks like
| updatedb combined with one or two samba transfers does still reproduce
| the problem easily, and adding rsync or NFS transfers to the mix makes
| kswapd peg at 99%.
|
| I noticed because I was trying to do kernel patches and compiles using a
| partition NFS-mounted from this machine. I guess it sometimes pays to
| be up at 5am...
|
| Unfortunately it's difficult for me to reboot this machine to update the
| kernel (59 users) but I will try to reproduce the problem on a separate
| machine this weekend or early next week. And I don't have profiling on,
| so that will have to wait as well. :-(
|
| Andrea, do you have a patch vs. 2.4.16 of your original solution to this
| problem that I could test out? I'd rather just change one thing at a
| time rather than switching completely to an -aa kernel.
|
| Grrrr!
|
| Thanks much,
| --
| Ken.
| brownfld@irridia.com
|
|
| On Tue, Nov 20, 2001 at 06:50:50AM +0000, Linus Torvalds wrote:
| | In article <20011119235422.F10597@asooo.flowerfire.com>,
| | Ken Brownfield <brownfld@irridia.com> wrote:
| | >kswapd goes up to 5-10% CPU (vs 3-6) but it finishes without issue or
| | >apparent interactivity problems. I'm keeping it in while( 1 ), but it's
| | >been predictable so far.
| | >
| | >3-10 is a lot better than 99, but is kswapd really going to eat that
| | >much CPU in an essentially allocation-less state?
| |
| | Well, it's obviously not allocation-less: updatedb will really hit on
| | the dcache and icache (which are both in the NORMAL zone only, which is
| | why Andrea asked for it), and obviously your Oracle load itself seems to
| | be happily paging stuff around, which causes a lot of allocations for
| | page-ins.
| |
| | It only _looks_ static, because once you find the proper "balance", the
| | VM numbers themselves shouldn't change under a constant load.
| |
| | We could make kswapd use less CPU time, of course, simply by making the
| | actual working processes do more of the work to free memory. The total
| | work ends up being the same, though, and the advantage of kswapd is that
| | it tends to make the freeing slightly more asynchronous, which helps
| | throughput.
| |
| | The _disadvantage_ of kswapd is that if it goes crazy and uses up all
| | CPU time, you get bad results ;)
| |
| | But it doesn't sound crazy in your load. I'd be happier if the VM took
| | less CPU, of course, but for now we seem to be doing ok.
| |
| | Linus
| | -
| | To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
| | the body of a message to majordomo@vger.kernel.org
| | More majordomo info at http://vger.kernel.org/majordomo-info.html
| | Please read the FAQ at http://www.tux.org/lkml/
| -
| To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
| the body of a message to majordomo@vger.kernel.org
| More majordomo info at http://vger.kernel.org/majordomo-info.html
| Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Slight Return (was Re: [VM] 2.4.14/15-pre4 too "swap-happy"?)
2001-12-08 13:12 ` Ken Brownfield
@ 2001-12-09 18:51 ` Marcelo Tosatti
2001-12-10 6:56 ` Ken Brownfield
0 siblings, 1 reply; 20+ messages in thread
From: Marcelo Tosatti @ 2001-12-09 18:51 UTC (permalink / raw)
To: Ken Brownfield; +Cc: linux-kernel
On Sat, 8 Dec 2001, Ken Brownfield wrote:
> Just a quick followup to this, which is still a near show-stopper issue
> for me.
>
> This is easy to reproduce for me if I run updatedb locally, and then run
> updatedb on a remote machine that's scanning an NFS-mounted filesystem
> from the original local machine. Instant kswapd saturation, especially
> on large filesystems.
>
> Doing updatedb on NFS-mounted filesystems also seems to cause kswapd to
> peg on the NFS-client side as well.
Can you reproduce the problem without the over NFS updatedb?
Thanks
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Slight Return (was Re: [VM] 2.4.14/15-pre4 too "swap-happy"?)
2001-12-09 18:51 ` Marcelo Tosatti
@ 2001-12-10 6:56 ` Ken Brownfield
0 siblings, 0 replies; 20+ messages in thread
From: Ken Brownfield @ 2001-12-10 6:56 UTC (permalink / raw)
To: Marcelo Tosatti; +Cc: linux-kernel
Yes, any kind of fairly heavy, spread-out I/O combined with updatedb
will do the trick, like samba. NFS isn't required, it just seems to be
a particularly good trigger.
It seems like anything that hits the inode/dentry caches hard, actually,
and doesn't always happen when freepages (or its 2.4.x equivalent) has
been hit. I had a little applet that malloc'ed and memcpy'ed 1GB of RAM
and exited, which doesn't really help like it did before 2.4.15-pre[56].
It also happens for me a lot more with my 4GB machines, though I have
seen it on my 1GB HIGHMEM boxes as well. If the problem is related to
scanning the cache, perhaps more RAM simply makes it worse.
I'm planning on trying Andrew Morton's patches as soon as I'm able.
Thanks,
--
Ken.
brownfld@irridia.com
On Sun, Dec 09, 2001 at 04:51:14PM -0200, Marcelo Tosatti wrote:
|
|
| On Sat, 8 Dec 2001, Ken Brownfield wrote:
|
| > Just a quick followup to this, which is still a near show-stopper issue
| > for me.
| >
| > This is easy to reproduce for me if I run updatedb locally, and then run
| > updatedb on a remote machine that's scanning an NFS-mounted filesystem
| > from the original local machine. Instant kswapd saturation, especially
| > on large filesystems.
| >
| > Doing updatedb on NFS-mounted filesystems also seems to cause kswapd to
| > peg on the NFS-client side as well.
|
| Can you reproduce the problem without the over NFS updatedb?
|
| Thanks
|
| -
| To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
| the body of a message to majordomo@vger.kernel.org
| More majordomo info at http://vger.kernel.org/majordomo-info.html
| Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2001-12-10 6:57 UTC | newest]
Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <200111191801.fAJI1l922388@neosilicon.transmeta.com>
2001-11-19 18:07 ` [VM] 2.4.14/15-pre4 too "swap-happy"? Linus Torvalds
2001-11-19 18:31 ` Ken Brownfield
2001-11-19 19:23 ` Linus Torvalds
2001-11-19 23:39 ` Ken Brownfield
2001-11-19 23:52 ` Linus Torvalds
2001-11-20 0:18 ` M. Edward (Ed) Borasky
2001-11-20 0:25 ` Ken Brownfield
2001-11-20 0:31 ` Linus Torvalds
2001-11-20 3:09 ` Ken Brownfield
2001-11-20 3:30 ` Linus Torvalds
2001-11-20 3:32 ` Andrea Arcangeli
2001-11-20 5:54 ` Ken Brownfield
2001-11-20 6:50 ` Linus Torvalds
2001-12-01 13:15 ` Slight Return (was Re: [VM] 2.4.14/15-pre4 too "swap-happy"?) Ken Brownfield
2001-12-08 13:12 ` Ken Brownfield
2001-12-09 18:51 ` Marcelo Tosatti
2001-12-10 6:56 ` Ken Brownfield
2001-11-19 19:30 ` [VM] 2.4.14/15-pre4 too "swap-happy"? Ken Brownfield
2001-11-19 18:26 ` Marcelo Tosatti
2001-11-19 19:44 ` Slo Mo Snail
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox