Re: [VM] 2.4.14/15-pre4 too "swap-happy"?

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Re: [VM] 2.4.14/15-pre4 too "swap-happy"?
       [not found] <200111191801.fAJI1l922388@neosilicon.transmeta.com>
@ 2001-11-19 18:07 ` Linus Torvalds
  2001-11-19 18:31   ` Ken Brownfield
  2001-11-19 19:44   ` Slo Mo Snail
  0 siblings, 2 replies; 20+ messages in thread
From: Linus Torvalds @ 2001-11-19 18:07 UTC (permalink / raw)
  To: Sebastian Dröge; +Cc: linux-kernel

On Mon, 19 Nov 2001, Sebastian Dröge wrote:
> Hi,
> I couldn't answer ealier because I had some problems with my ISP
> the heavy swapping problem while burning a cd is solved in pre6aa1
> but if you want i can do some statistics tommorow

Well, pre6aa1 performs really badly exactly because it by default doesn't
swap enough even on _normal_ loads because Andrea is playing with some
tuning (and see the bad results of that tuning in the VM testing by
rwhron@earthlink.net).

So the pre6aa1 numbers are kind of suspect - lack of swapping may not be
due to fixing the problem, but due to bad tuning.

Does plain pre6 solve it? Plain pre6 has a fix where a locked shared
memory area would previously cause unnecessary swapping, and maybe the CD
burning buffer is using shmlock..

		Linus

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [VM] 2.4.14/15-pre4 too "swap-happy"?
  2001-11-19 19:30     ` [VM] 2.4.14/15-pre4 too "swap-happy"? Ken Brownfield
@ 2001-11-19 18:26       ` Marcelo Tosatti
  0 siblings, 0 replies; 20+ messages in thread
From: Marcelo Tosatti @ 2001-11-19 18:26 UTC (permalink / raw)
  To: Ken Brownfield; +Cc: linux-kernel



On Mon, 19 Nov 2001, Ken Brownfield wrote:

> Actually, I spoke too soon.  We developed a quick stress test that
> causes the problem immediately:
> 
>  11:18am  up 3 days,  1:36,  3 users,  load average: 8.72, 7.18, 3.96
> 91 processes: 85 sleeping, 6 running, 0 zombie, 0 stopped
> CPU states:  0.1% user, 93.4% system,  0.0% nice,  6.4% idle
> Mem:  3343688K av, 3340784K used,    2904K free,       0K shrd,     308K buff
> Swap: 1004052K av,  567404K used,  436648K free                 2994288K cached
> 
>   PID USER     PRI  NI  SIZE  RSS SHARE STAT  LIB %CPU %MEM   TIME COMMAND
> 12102 oracle    13   0 16320  15M 14868 R    5584 67.2  0.4  18:58 oracle
> 12365 oracle    18   5 39352  38M 37796 R N   30M 66.7  1.1   4:14 oracle
> 12353 oracle    18   5 39956  38M 38408 R N   31M 66.5  1.1   9:14 oracle
> 12191 root      13   0   892  852   672 R       0 66.4  0.0   6:09 top
> 12366 oracle     9   0   892  892   672 S       0 60.0  0.0   3:20 top
>     9 root       9   0     0    0     0 SW      0 49.0  0.0   9:27 kswapd
>    11 root       9   0     0    0     0 SW      0 38.3  0.0   3:58 kupdated
>   105 root       9   0     0    0     0 SW      0 28.8  0.0   4:56 kjournald
>   470 root       9   0   844  828   472 S       0 28.1  0.0   1:46 gamdrvd
> 12351 oracle    13   5 39956  38M 38408 S N   31M 25.6  1.1   3:08 oracle
>   669 oracle     9   0  4780 4780  4384 S     492 24.4  0.1   1:42 oracle
>     1 root      14   0   476  424   408 R       0 21.6  0.0   1:19 init
>     2 root      14   0     0    0     0 RW      0 20.8  0.0   1:29 keventd
>   615 oracle     9   0  8984 8984  8460 S    4380 16.3  0.2   2:41 oracle
>   388 root       9   0   732  728   592 S       0 11.5  0.0   0:17 syslogd
> 
> kswapd bounces up and down from 99%.

Ken,

Could you please check _where_ kswapd is spending its time ? 

(you can use kernel profiling and the "readprofile" tool to report us the
functions which are wasting more CPU cycles in the kernel)



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [VM] 2.4.14/15-pre4 too "swap-happy"?
  2001-11-19 18:07 ` [VM] 2.4.14/15-pre4 too "swap-happy"? Linus Torvalds
@ 2001-11-19 18:31   ` Ken Brownfield
  2001-11-19 19:23     ` Linus Torvalds
  2001-11-19 19:30     ` [VM] 2.4.14/15-pre4 too "swap-happy"? Ken Brownfield
  2001-11-19 19:44   ` Slo Mo Snail
  1 sibling, 2 replies; 20+ messages in thread
From: Ken Brownfield @ 2001-11-19 18:31 UTC (permalink / raw)
  To: linux-kernel

Linus, so far 2.4.15-pre4 with your patch does not reproduce the kswapd
issue with Oracle, but I do need to perform more deterministic tests
before I can fully sign off on that.

BTW, didn't your patch go into -pre5?  Or is there an additional mod in
-pre6 that we should try?
-- 
Ken.
brownfld@irridia.com

On Mon, Nov 19, 2001 at 10:07:58AM -0800, Linus Torvalds wrote:
| 
| On Mon, 19 Nov 2001, Sebastian Dröge wrote:
| > Hi,
| > I couldn't answer ealier because I had some problems with my ISP
| > the heavy swapping problem while burning a cd is solved in pre6aa1
| > but if you want i can do some statistics tommorow
| 
| Well, pre6aa1 performs really badly exactly because it by default doesn't
| swap enough even on _normal_ loads because Andrea is playing with some
| tuning (and see the bad results of that tuning in the VM testing by
| rwhron@earthlink.net).
| 
| So the pre6aa1 numbers are kind of suspect - lack of swapping may not be
| due to fixing the problem, but due to bad tuning.
| 
| Does plain pre6 solve it? Plain pre6 has a fix where a locked shared
| memory area would previously cause unnecessary swapping, and maybe the CD
| burning buffer is using shmlock..
| 
| 		Linus
| 
| -
| To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
| the body of a message to majordomo@vger.kernel.org
| More majordomo info at  http://vger.kernel.org/majordomo-info.html
| Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [VM] 2.4.14/15-pre4 too "swap-happy"?
  2001-11-19 18:31   ` Ken Brownfield
@ 2001-11-19 19:23     ` Linus Torvalds
  2001-11-19 23:39       ` Ken Brownfield
  2001-11-19 19:30     ` [VM] 2.4.14/15-pre4 too "swap-happy"? Ken Brownfield
  1 sibling, 1 reply; 20+ messages in thread
From: Linus Torvalds @ 2001-11-19 19:23 UTC (permalink / raw)
  To: linux-kernel

In article <20011119123125.B1439@asooo.flowerfire.com>,
Ken Brownfield  <brownfld@irridia.com> wrote:
>Linus, so far 2.4.15-pre4 with your patch does not reproduce the kswapd
>issue with Oracle, but I do need to perform more deterministic tests
>before I can fully sign off on that.
>
>BTW, didn't your patch go into -pre5?  Or is there an additional mod in
>-pre6 that we should try?

You're right, it's probably in pre5 already..

Anyway, it would be interesting to see if the patch by Andrea (I think
he called it "zone-watermarks") that changes the zone allocators to take
other zones into account makes a difference. See separate thread with
the subject line "15pre6aa1 (fixes google VM problem)". 

(I think the patch is overly complex as-is, but I htink the _ideas_ in
it are fine).

			Linus

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [VM] 2.4.14/15-pre4 too "swap-happy"?
  2001-11-19 18:31   ` Ken Brownfield
  2001-11-19 19:23     ` Linus Torvalds
@ 2001-11-19 19:30     ` Ken Brownfield
  2001-11-19 18:26       ` Marcelo Tosatti
  1 sibling, 1 reply; 20+ messages in thread
From: Ken Brownfield @ 2001-11-19 19:30 UTC (permalink / raw)
  To: Ken Brownfield; +Cc: linux-kernel

Actually, I spoke too soon.  We developed a quick stress test that
causes the problem immediately:

 11:18am  up 3 days,  1:36,  3 users,  load average: 8.72, 7.18, 3.96
91 processes: 85 sleeping, 6 running, 0 zombie, 0 stopped
CPU states:  0.1% user, 93.4% system,  0.0% nice,  6.4% idle
Mem:  3343688K av, 3340784K used,    2904K free,       0K shrd,     308K buff
Swap: 1004052K av,  567404K used,  436648K free                 2994288K cached

  PID USER     PRI  NI  SIZE  RSS SHARE STAT  LIB %CPU %MEM   TIME COMMAND
12102 oracle    13   0 16320  15M 14868 R    5584 67.2  0.4  18:58 oracle
12365 oracle    18   5 39352  38M 37796 R N   30M 66.7  1.1   4:14 oracle
12353 oracle    18   5 39956  38M 38408 R N   31M 66.5  1.1   9:14 oracle
12191 root      13   0   892  852   672 R       0 66.4  0.0   6:09 top
12366 oracle     9   0   892  892   672 S       0 60.0  0.0   3:20 top
    9 root       9   0     0    0     0 SW      0 49.0  0.0   9:27 kswapd
   11 root       9   0     0    0     0 SW      0 38.3  0.0   3:58 kupdated
  105 root       9   0     0    0     0 SW      0 28.8  0.0   4:56 kjournald
  470 root       9   0   844  828   472 S       0 28.1  0.0   1:46 gamdrvd
12351 oracle    13   5 39956  38M 38408 S N   31M 25.6  1.1   3:08 oracle
  669 oracle     9   0  4780 4780  4384 S     492 24.4  0.1   1:42 oracle
    1 root      14   0   476  424   408 R       0 21.6  0.0   1:19 init
    2 root      14   0     0    0     0 RW      0 20.8  0.0   1:29 keventd
  615 oracle     9   0  8984 8984  8460 S    4380 16.3  0.2   2:41 oracle
  388 root       9   0   732  728   592 S       0 11.5  0.0   0:17 syslogd

kswapd bounces up and down from 99%.

Keys for me are the full system time, the fact that the %CPUs seem to
add up to more than 6xCPUs (6-way Xeon), and that processes that aren't
really active show up as "active".

ASAP, I'll try -pre6 and then -aa1 to compare behavior.

The Oracle stress query looks like:

select /*+ parallel(mt,5) cache(mt) */ count(*) from mtable_units ;

Thanks much,
-- 
Ken.

On Mon, Nov 19, 2001 at 12:31:25PM -0600, Ken Brownfield wrote:
| Linus, so far 2.4.15-pre4 with your patch does not reproduce the kswapd
| issue with Oracle, but I do need to perform more deterministic tests
| before I can fully sign off on that.
| 
| BTW, didn't your patch go into -pre5?  Or is there an additional mod in
| -pre6 that we should try?
| -- 
| Ken.
| brownfld@irridia.com
| 
| On Mon, Nov 19, 2001 at 10:07:58AM -0800, Linus Torvalds wrote:
| | 
| | On Mon, 19 Nov 2001, Sebastian Dröge wrote:
| | > Hi,
| | > I couldn't answer ealier because I had some problems with my ISP
| | > the heavy swapping problem while burning a cd is solved in pre6aa1
| | > but if you want i can do some statistics tommorow
| | 
| | Well, pre6aa1 performs really badly exactly because it by default doesn't
| | swap enough even on _normal_ loads because Andrea is playing with some
| | tuning (and see the bad results of that tuning in the VM testing by
| | rwhron@earthlink.net).
| | 
| | So the pre6aa1 numbers are kind of suspect - lack of swapping may not be
| | due to fixing the problem, but due to bad tuning.
| | 
| | Does plain pre6 solve it? Plain pre6 has a fix where a locked shared
| | memory area would previously cause unnecessary swapping, and maybe the CD
| | burning buffer is using shmlock..
| | 
| | 		Linus
| | 
| | -
| | To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
| | the body of a message to majordomo@vger.kernel.org
| | More majordomo info at  http://vger.kernel.org/majordomo-info.html
| | Please read the FAQ at  http://www.tux.org/lkml/
| -
| To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
| the body of a message to majordomo@vger.kernel.org
| More majordomo info at  http://vger.kernel.org/majordomo-info.html
| Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [VM] 2.4.14/15-pre4 too "swap-happy"?
  2001-11-19 18:07 ` [VM] 2.4.14/15-pre4 too "swap-happy"? Linus Torvalds
  2001-11-19 18:31   ` Ken Brownfield
@ 2001-11-19 19:44   ` Slo Mo Snail
  1 sibling, 0 replies; 20+ messages in thread
From: Slo Mo Snail @ 2001-11-19 19:44 UTC (permalink / raw)
  To: linux-kernel, Linus Torvalds

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Am Montag, 19. November 2001 19:07 schrieb Linus Torvalds:
> On Mon, 19 Nov 2001, Sebastian Dröge wrote:
> > Hi,
> > I couldn't answer ealier because I had some problems with my ISP
> > the heavy swapping problem while burning a cd is solved in pre6aa1
> > but if you want i can do some statistics tommorow
>
> Well, pre6aa1 performs really badly exactly because it by default doesn't
> swap enough even on _normal_ loads because Andrea is playing with some
> tuning (and see the bad results of that tuning in the VM testing by
> rwhron@earthlink.net).
>
> So the pre6aa1 numbers are kind of suspect - lack of swapping may not be
> due to fixing the problem, but due to bad tuning.
>
> Does plain pre6 solve it? Plain pre6 has a fix where a locked shared
> memory area would previously cause unnecessary swapping, and maybe the CD
> burning buffer is using shmlock..

Hi,
yes plain pre6 seems to solve it, too. I can't be sure right now because I 
have recorded only 3 CDs while running pre6
pre6 swaps more than aa1 but I had so far I had no buffer-underuns and much 
of the swap appears in SwapCached
the interactive performance seems to be much better in pre6 than in aa1 so 
I'll stay with pre6 ;)
Bye
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: For info see http://www.gnupg.org

iD8DBQE7+WEovIHrJes3kVIRAg+nAJ4issDSimDEal2I08CQHEoXBpGFLQCeNQ1x
AathQZ75U5nhnEZwTkR4WnI=
=lb0O
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [VM] 2.4.14/15-pre4 too "swap-happy"?
  2001-11-19 19:23     ` Linus Torvalds
@ 2001-11-19 23:39       ` Ken Brownfield
  2001-11-19 23:52         ` Linus Torvalds
  0 siblings, 1 reply; 20+ messages in thread
From: Ken Brownfield @ 2001-11-19 23:39 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

I went straight to the aa patch, and it looks like it either fixes the
problem or (because of the side-effects Linus mentioned) otherwise
prevents the issue:

  2:30pm  up 11 min,  4 users,  load average: 2.23, 2.18, 1.17
106 processes: 104 sleeping, 2 running, 0 zombie, 0 stopped
CPU states: 14.7% user, 10.3% system,  0.0% nice, 74.9% idle
Mem:  3342304K av, 3013888K used,  328416K free,       0K shrd,    1224K buff
Swap: 1004052K av,  276824K used,  727228K free                 2862112K cached

  PID USER     PRI  NI  SIZE  RSS SHARE STAT  LIB %CPU %MEM   TIME COMMAND
  722 oracle    12   0 13364  12M 11856 S    9.9M 29.5  0.3   2:24 oracle
  731 oracle    17   0 13488  12M 11980 D     10M 28.7  0.3   2:27 oracle
  728 oracle    12   0 13048  12M 11540 R    9816 20.8  0.3   2:22 oracle
  718 oracle    12   0  154M 153M  152M S    150M 17.9  4.7   2:22 oracle
  725 oracle    14   0 13472  12M 11964 S     10M 17.9  0.3   2:20 oracle
  734 oracle    12   0 13936  13M 12432 S     10M 15.3  0.4   2:27 oracle
    9 root       9   0     0    0     0 SW      0  4.3  0.0   0:27 kswapd

The machine went into swap immediately when the page cache stopped
growing and hovered at 100-400MB.  Also, in my experience the page cache
will grow until there's only 5ishMB of free RAM, but with the aa patch
it looks like it stops at 320MB or maybe 10% of RAM.  Was that the aa
patch, or part of -pre6?

It would be nice if that number were modifyable via /proc (writable
freepages again? 10% seems a tad high for many boxes) but I think it's
better to have a bit more purely free RAM available than 5MB.

kswapd isn't going nuts, but it seems to still be eating quite a bit of
CPU given plenty of RAM.  And it seems to go pretty hard into swap -- I
would imagine that it's disadvantageous to do significant swapping
(based on age only?) in the presence of a massive page cache.  I would
imagine the performance hit of a 2GB vs. 3GB page cache would be less
egregious than the time and I/O kswapd is causing without memory
pressure.

The Oracle SGA is set to ~522MB, with nothing else running except a
couple of sshds, getty, etc.  Now that I'm looking, 2.8GB page cache
plus 328MB free adds up to about 3.1GB of RAM -- where does the 512MB
shared memory segment fit?  Is it being swapped out in deference to page
cache?

Just my USD$0.02.  I'll try vanilla -pre6 with profiling soon and post
results.  Thanks for the tip Marcelo.

Thanks,
-- 
Ken.
brownfld@irridia.com

On Mon, Nov 19, 2001 at 07:23:27PM +0000, Linus Torvalds wrote:
| In article <20011119123125.B1439@asooo.flowerfire.com>,
| Ken Brownfield  <brownfld@irridia.com> wrote:
| >Linus, so far 2.4.15-pre4 with your patch does not reproduce the kswapd
| >issue with Oracle, but I do need to perform more deterministic tests
| >before I can fully sign off on that.
| >
| >BTW, didn't your patch go into -pre5?  Or is there an additional mod in
| >-pre6 that we should try?
| 
| You're right, it's probably in pre5 already..
| 
| Anyway, it would be interesting to see if the patch by Andrea (I think
| he called it "zone-watermarks") that changes the zone allocators to take
| other zones into account makes a difference. See separate thread with
| the subject line "15pre6aa1 (fixes google VM problem)". 
| 
| (I think the patch is overly complex as-is, but I htink the _ideas_ in
| it are fine).
| 
| 			Linus
| -
| To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
| the body of a message to majordomo@vger.kernel.org
| More majordomo info at  http://vger.kernel.org/majordomo-info.html
| Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [VM] 2.4.14/15-pre4 too "swap-happy"?
  2001-11-19 23:39       ` Ken Brownfield
@ 2001-11-19 23:52         ` Linus Torvalds
  2001-11-20  0:18           ` M. Edward (Ed) Borasky
                             ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Linus Torvalds @ 2001-11-19 23:52 UTC (permalink / raw)
  To: Ken Brownfield; +Cc: linux-kernel, Andrea Arcangeli

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1887 bytes --]


On Mon, 19 Nov 2001, Ken Brownfield wrote:
>
> I went straight to the aa patch, and it looks like it either fixes the
> problem or (because of the side-effects Linus mentioned) otherwise
> prevents the issue:

So is this pre6aa1, or pre6 + just the watermark patch?

> The machine went into swap immediately when the page cache stopped
> growing and hovered at 100-400MB.  Also, in my experience the page cache
> will grow until there's only 5ishMB of free RAM, but with the aa patch
> it looks like it stops at 320MB or maybe 10% of RAM.  Was that the aa
> patch, or part of -pre6?

That was the watermarking. The way Andrea did it, the page cache will
basically refuse to touch as much of the "normal" page zone, because it
would prefer to allocate more from highmem..

I think it's excessive to have 320MB free memory, though, that's just
an insane waste. I suspect that the real number should be somewhere
between the old behaviour and the new one. You can tweak the behaviour of
andrea's kernel by changing the "reserved" page numbers, but I'd like to
hear whether my simpler approach works too..

> The Oracle SGA is set to ~522MB, with nothing else running except a
> couple of sshds, getty, etc.  Now that I'm looking, 2.8GB page cache
> plus 328MB free adds up to about 3.1GB of RAM -- where does the 512MB
> shared memory segment fit?  Is it being swapped out in deference to page
> cache?

Shared memory actually uses the page cache too, so it will be accounted
for in the 2.8GB number.

Anyway, can you try plain vanilla pre6, with the appended patch? This is
my suggested simplified version of what Andrea tried to do, and it should
try to keep only a few extra megs of memory free in the low memory
regions, not 300+ MB.

(and the profiling would be interesting regardless, but I think Andrea did
find the real problem, his fix just seems a bit of an overkill ;)

		Linus

[-- Attachment #2: Type: TEXT/PLAIN, Size: 1839 bytes --]

diff -u --recursive --new-file pre6/linux/mm/page_alloc.c linux/mm/page_alloc.c
--- pre6/linux/mm/page_alloc.c	Sat Nov 17 19:07:43 2001
+++ linux/mm/page_alloc.c	Mon Nov 19 15:13:36 2001
@@ -299,29 +299,26 @@
 	return page;
 }
 
-static inline unsigned long zone_free_pages(zone_t * zone, unsigned int order)
-{
-	long free = zone->free_pages - (1UL << order);
-	return free >= 0 ? free : 0;
-}
-
 /*
  * This is the 'heart' of the zoned buddy allocator:
  */
 struct page * __alloc_pages(unsigned int gfp_mask, unsigned int order, zonelist_t *zonelist)
 {
+	unsigned long min;
 	zone_t **zone, * classzone;
 	struct page * page;
 	int freed;
 
 	zone = zonelist->zones;
 	classzone = *zone;
+	min = 1UL << order;
 	for (;;) {
 		zone_t *z = *(zone++);
 		if (!z)
 			break;
 
-		if (zone_free_pages(z, order) > z->pages_low) {
+		min += z->pages_low;
+		if (z->free_pages > min) {
 			page = rmqueue(z, order);
 			if (page)
 				return page;
@@ -334,16 +331,18 @@
 		wake_up_interruptible(&kswapd_wait);
 
 	zone = zonelist->zones;
+	min = 1UL << order;
 	for (;;) {
-		unsigned long min;
+		unsigned long local_min;
 		zone_t *z = *(zone++);
 		if (!z)
 			break;
 
-		min = z->pages_min;
+		local_min = z->pages_min;
 		if (!(gfp_mask & __GFP_WAIT))
-			min >>= 2;
-		if (zone_free_pages(z, order) > min) {
+			local_min >>= 2;
+		min += local_min;
+		if (z->free_pages > min) {
 			page = rmqueue(z, order);
 			if (page)
 				return page;
@@ -376,12 +375,14 @@
 		return page;
 
 	zone = zonelist->zones;
+	min = 1UL << order;
 	for (;;) {
 		zone_t *z = *(zone++);
 		if (!z)
 			break;
 
-		if (zone_free_pages(z, order) > z->pages_min) {
+		min += z->pages_min;
+		if (z->free_pages > min) {
 			page = rmqueue(z, order);
 			if (page)
 				return page;

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [VM] 2.4.14/15-pre4 too "swap-happy"?
  2001-11-19 23:52         ` Linus Torvalds
@ 2001-11-20  0:18           ` M. Edward (Ed) Borasky
  2001-11-20  0:25           ` Ken Brownfield
  2001-11-20  3:09           ` Ken Brownfield
  2 siblings, 0 replies; 20+ messages in thread
From: M. Edward (Ed) Borasky @ 2001-11-20  0:18 UTC (permalink / raw)
  To: linux-kernel

On a related note, the files "/usr/src/linux/Documentation/filesystems/proc.txt"
and "sysctl/vm.txt" refer to some variables I need to be able to set on a
system running 2.4.12. In particular, I need to be able to get to the values
in "/proc/sys/vm/freepages", "/proc/sys/vm/buffermem" and
"/proc/sys/vm/pagecache". However, despite their existence in the documentation
files, these files don't exist on a 2.4.12 system. How can I read and set these
values on a 2.4.12 system?
--
znmeb@aracnet.com (M. Edward Borasky) http://www.aracnet.com/~znmeb
Relax! Run Your Own Brain with Neuro-Semantics!
http://www.meta-trading-coach.com

"Outside of a dog, a book is a man's best friend.  Inside a dog, it's
too dark to read." -- Marx


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [VM] 2.4.14/15-pre4 too "swap-happy"?
  2001-11-19 23:52         ` Linus Torvalds
  2001-11-20  0:18           ` M. Edward (Ed) Borasky
@ 2001-11-20  0:25           ` Ken Brownfield
  2001-11-20  0:31             ` Linus Torvalds
  2001-11-20  3:09           ` Ken Brownfield
  2 siblings, 1 reply; 20+ messages in thread
From: Ken Brownfield @ 2001-11-20  0:25 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, Andrea Arcangeli

On Mon, Nov 19, 2001 at 03:52:44PM -0800, Linus Torvalds wrote:
| 
| On Mon, 19 Nov 2001, Ken Brownfield wrote:
| >
| > I went straight to the aa patch, and it looks like it either fixes the
| > problem or (because of the side-effects Linus mentioned) otherwise
| > prevents the issue:
| 
| So is this pre6aa1, or pre6 + just the watermark patch?

I'm currently using -pre6 with his separately-posted zone-watermark-1
patch.  Sorry, I should have been clearer.

| > The machine went into swap immediately when the page cache stopped
| > growing and hovered at 100-400MB.  Also, in my experience the page cache
| > will grow until there's only 5ishMB of free RAM, but with the aa patch
| > it looks like it stops at 320MB or maybe 10% of RAM.  Was that the aa
| > patch, or part of -pre6?
| 
| That was the watermarking. The way Andrea did it, the page cache will
| basically refuse to touch as much of the "normal" page zone, because it
| would prefer to allocate more from highmem..
| 
| I think it's excessive to have 320MB free memory, though, that's just
| an insane waste. I suspect that the real number should be somewhere
| between the old behaviour and the new one. You can tweak the behaviour of
| andrea's kernel by changing the "reserved" page numbers, but I'd like to
| hear whether my simpler approach works too..

Yeah, maybe a tiered default would be best, IMHO.  5MB on a 3GB box
does, on the other hand, seem anemic.

| > The Oracle SGA is set to ~522MB, with nothing else running except a
| > couple of sshds, getty, etc.  Now that I'm looking, 2.8GB page cache
| > plus 328MB free adds up to about 3.1GB of RAM -- where does the 512MB
| > shared memory segment fit?  Is it being swapped out in deference to page
| > cache?
| 
| Shared memory actually uses the page cache too, so it will be accounted
| for in the 2.8GB number.

My bad, should have realized.

| Anyway, can you try plain vanilla pre6, with the appended patch? This is
| my suggested simplified version of what Andrea tried to do, and it should
| try to keep only a few extra megs of memory free in the low memory
| regions, not 300+ MB.
| 
| (and the profiling would be interesting regardless, but I think Andrea did
| find the real problem, his fix just seems a bit of an overkill ;)
| 
| 		Linus

I'll try this patch ASAP.

Thanks a LOT to all involved,
-- 
Ken.
brownfld@irridia.com

| diff -u --recursive --new-file pre6/linux/mm/page_alloc.c linux/mm/page_alloc.c
| --- pre6/linux/mm/page_alloc.c	Sat Nov 17 19:07:43 2001
| +++ linux/mm/page_alloc.c	Mon Nov 19 15:13:36 2001
| @@ -299,29 +299,26 @@
|  	return page;
|  }
|  
| -static inline unsigned long zone_free_pages(zone_t * zone, unsigned int order)
| -{
| -	long free = zone->free_pages - (1UL << order);
| -	return free >= 0 ? free : 0;
| -}
| -
|  /*
|   * This is the 'heart' of the zoned buddy allocator:
|   */
|  struct page * __alloc_pages(unsigned int gfp_mask, unsigned int order, zonelist_t *zonelist)
|  {
| +	unsigned long min;
|  	zone_t **zone, * classzone;
|  	struct page * page;
|  	int freed;
|  
|  	zone = zonelist->zones;
|  	classzone = *zone;
| +	min = 1UL << order;
|  	for (;;) {
|  		zone_t *z = *(zone++);
|  		if (!z)
|  			break;
|  
| -		if (zone_free_pages(z, order) > z->pages_low) {
| +		min += z->pages_low;
| +		if (z->free_pages > min) {
|  			page = rmqueue(z, order);
|  			if (page)
|  				return page;
| @@ -334,16 +331,18 @@
|  		wake_up_interruptible(&kswapd_wait);
|  
|  	zone = zonelist->zones;
| +	min = 1UL << order;
|  	for (;;) {
| -		unsigned long min;
| +		unsigned long local_min;
|  		zone_t *z = *(zone++);
|  		if (!z)
|  			break;
|  
| -		min = z->pages_min;
| +		local_min = z->pages_min;
|  		if (!(gfp_mask & __GFP_WAIT))
| -			min >>= 2;
| -		if (zone_free_pages(z, order) > min) {
| +			local_min >>= 2;
| +		min += local_min;
| +		if (z->free_pages > min) {
|  			page = rmqueue(z, order);
|  			if (page)
|  				return page;
| @@ -376,12 +375,14 @@
|  		return page;
|  
|  	zone = zonelist->zones;
| +	min = 1UL << order;
|  	for (;;) {
|  		zone_t *z = *(zone++);
|  		if (!z)
|  			break;
|  
| -		if (zone_free_pages(z, order) > z->pages_min) {
| +		min += z->pages_min;
| +		if (z->free_pages > min) {
|  			page = rmqueue(z, order);
|  			if (page)
|  				return page;

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [VM] 2.4.14/15-pre4 too "swap-happy"?
  2001-11-20  0:25           ` Ken Brownfield
@ 2001-11-20  0:31             ` Linus Torvalds
  0 siblings, 0 replies; 20+ messages in thread
From: Linus Torvalds @ 2001-11-20  0:31 UTC (permalink / raw)
  To: Ken Brownfield; +Cc: linux-kernel, Andrea Arcangeli

On Mon, 19 Nov 2001, Ken Brownfield wrote:
> |
> | So is this pre6aa1, or pre6 + just the watermark patch?
>
> I'm currently using -pre6 with his separately-posted zone-watermark-1
> patch.  Sorry, I should have been clearer.

Good. That removes the other variables from the equation, ie it's not an
effect of some of the other tweaking in the -aa patches.

> Yeah, maybe a tiered default would be best, IMHO.  5MB on a 3GB box
> does, on the other hand, seem anemic.

Yeah, the 5MB _is_ anemic. It comes from the fact that we decide to never
bother having more than zone_balance_max[] pages free, even if we have
tons of memory. And zone_balance_max[] is fairly small, it limits us to
255 free pages per zone (for page_min - wth "page_low" being twice that).
So you get 3 zones, with 255*2 pages free max each, except the DMA zone
has much less just because it's smaller. Thus 5MB.

There's no real reason for having zone_balance_max[] at all - without it
we'd just always try to keep about 1/128th of memory free, which would be
about 24MB on a 3GB box. Which is probably not a bad idea.

With my "simplified-Andrea" patch, you should see slightly more than 5MB
free, but not a lot more. A HIGHMEM allocation now wants to leave an
"extra" 510 pages in NORMAL, and even more in the DMA zone, so you should
see something like maybe 12-15 MB free instead of 300MB.

(Wild hand-waving number, I'm too lazy to actually do the math, and I
haven't even tested that the simple patch works at all - I think I forgot
to mention that small detail ;)

		Linus

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [VM] 2.4.14/15-pre4 too "swap-happy"?
  2001-11-19 23:52         ` Linus Torvalds
  2001-11-20  0:18           ` M. Edward (Ed) Borasky
  2001-11-20  0:25           ` Ken Brownfield
@ 2001-11-20  3:09           ` Ken Brownfield
  2001-11-20  3:30             ` Linus Torvalds
  2001-11-20  3:32             ` Andrea Arcangeli
  2 siblings, 2 replies; 20+ messages in thread
From: Ken Brownfield @ 2001-11-20  3:09 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, Andrea Arcangeli

Well, I think you'll be pleased to hear that your untested patch
compiled, booted, _and_ fixed the problem. :)

The minimum free RAM was about 9.8-11MB (matching your guestimate) and
kswapd seemed to behave the same as the watermark patch.  The results of
top were basically the same, so I'm omitting it.

However, I do have some profiling numbers, thanks to Marcelo.  Attached
are numbers from "readprofile | sort -nr +2 | head -20".  I think the
pre4 numbers point to shrink_cache, prune_icache, and statm_pgd_range.
The other two might have significance for wizards, but statistically
don't stand out to me, except maybe statm_pgd_range.

I reset the counters just before starting Oracle and the stress test.  I
think a -pre7 with a blessed patch would be good, since my testing was
very narrow.

I'll test new kernels as I hear new info.

Thanks much!
-- 
Ken.
brownfld@irridia.com

2.4.15-pre4 with your original patch:
(shorter time period since the machine went to hell fast)
(matches vanilla behaviour)

164536 default_idle                             3164.1538
101562 shrink_cache                             113.8587
  3683 prune_icache                              13.5404
  3034 file_read_actor                           12.2339
   914 DAC960_BA_InterruptHandler                 5.5732
  1128 statm_pgd_range                            2.9072
    40 page_cache_release                         0.8333
    31 add_page_to_hash_queue                     0.5167
    89 page_cache_read                            0.4363
    25 remove_inode_page                          0.4167
    26 unlock_page                                0.3095
   509 __make_request                             0.3008
    66 smp_call_function                          0.2946
    21 set_bh_page                                0.2917
     9 __brelse                                   0.2812
    90 try_to_free_buffers                        0.2778
    13 mark_page_accessed                         0.2708
     8 __free_pages                               0.2500
    43 get_hash_table                             0.2443
    42 activate_page                              0.2234

2.4.15-pre6 with watermark patch:

1617446 default_idle                             31104.7308
 27599 DAC960_BA_InterruptHandler               168.2866
 38918 file_read_actor                          156.9274
   528 page_cache_release                        11.0000
   554 add_page_to_hash_queue                     9.2333
 15487 __make_request                             9.1531
  3453 statm_pgd_range                            8.8995
   514 remove_inode_page                          8.5667
  1453 blk_init_free_list                         7.2650
   377 set_bh_page                                5.2361
   898 page_cache_read                            4.4020
   590 add_to_page_cache_unique                   4.3382
   136 __brelse                                   4.2500
  1120 kmem_cache_alloc                           3.8356
   628 kunmap_high                                3.7381
  1189 try_to_free_buffers                        3.6698
   625 get_hash_table                             3.5511
   439 lru_cache_add                              3.4297
  1715 rmqueue                                    3.0194
   105 remove_wait_queue                          2.9167

2.4.15-pre6 with Linus patch:

1249875 default_idle                             24036.0577
 65324 file_read_actor                          263.4032
 36979 DAC960_BA_InterruptHandler               225.4817
  9809 statm_pgd_range                           25.2809
  1039 page_cache_release                        21.6458
   994 add_page_to_hash_queue                    16.5667
   922 remove_inode_page                         15.3667
  2409 blk_init_free_list                        12.0450
 20159 __make_request                            11.9143
  1198 lru_cache_add                              9.3594
  1628 page_cache_read                            7.9804
   987 add_to_page_cache_unique                   7.2574
  2202 try_to_free_buffers                        6.7963
  1038 get_unused_buffer_head                     6.6538
   484 unlock_page                                5.7619
  3182 rmqueue                                    5.6021
   874 kunmap_high                                5.2024
   164 __brelse                                   5.1250
   900 get_hash_table                             5.1136
   357 set_bh_page                                4.9583

On Mon, Nov 19, 2001 at 03:52:44PM -0800, Linus Torvalds wrote:
| 
| On Mon, 19 Nov 2001, Ken Brownfield wrote:
| >
| > I went straight to the aa patch, and it looks like it either fixes the
| > problem or (because of the side-effects Linus mentioned) otherwise
| > prevents the issue:
| 
| So is this pre6aa1, or pre6 + just the watermark patch?
| 
| > The machine went into swap immediately when the page cache stopped
| > growing and hovered at 100-400MB.  Also, in my experience the page cache
| > will grow until there's only 5ishMB of free RAM, but with the aa patch
| > it looks like it stops at 320MB or maybe 10% of RAM.  Was that the aa
| > patch, or part of -pre6?
| 
| That was the watermarking. The way Andrea did it, the page cache will
| basically refuse to touch as much of the "normal" page zone, because it
| would prefer to allocate more from highmem..
| 
| I think it's excessive to have 320MB free memory, though, that's just
| an insane waste. I suspect that the real number should be somewhere
| between the old behaviour and the new one. You can tweak the behaviour of
| andrea's kernel by changing the "reserved" page numbers, but I'd like to
| hear whether my simpler approach works too..
| 
| > The Oracle SGA is set to ~522MB, with nothing else running except a
| > couple of sshds, getty, etc.  Now that I'm looking, 2.8GB page cache
| > plus 328MB free adds up to about 3.1GB of RAM -- where does the 512MB
| > shared memory segment fit?  Is it being swapped out in deference to page
| > cache?
| 
| Shared memory actually uses the page cache too, so it will be accounted
| for in the 2.8GB number.
| 
| Anyway, can you try plain vanilla pre6, with the appended patch? This is
| my suggested simplified version of what Andrea tried to do, and it should
| try to keep only a few extra megs of memory free in the low memory
| regions, not 300+ MB.
| 
| (and the profiling would be interesting regardless, but I think Andrea did
| find the real problem, his fix just seems a bit of an overkill ;)
| 
| 		Linus

| diff -u --recursive --new-file pre6/linux/mm/page_alloc.c linux/mm/page_alloc.c
| --- pre6/linux/mm/page_alloc.c	Sat Nov 17 19:07:43 2001
| +++ linux/mm/page_alloc.c	Mon Nov 19 15:13:36 2001
| @@ -299,29 +299,26 @@
|  	return page;
|  }
|  
| -static inline unsigned long zone_free_pages(zone_t * zone, unsigned int order)
| -{
| -	long free = zone->free_pages - (1UL << order);
| -	return free >= 0 ? free : 0;
| -}
| -
|  /*
|   * This is the 'heart' of the zoned buddy allocator:
|   */
|  struct page * __alloc_pages(unsigned int gfp_mask, unsigned int order, zonelist_t *zonelist)
|  {
| +	unsigned long min;
|  	zone_t **zone, * classzone;
|  	struct page * page;
|  	int freed;
|  
|  	zone = zonelist->zones;
|  	classzone = *zone;
| +	min = 1UL << order;
|  	for (;;) {
|  		zone_t *z = *(zone++);
|  		if (!z)
|  			break;
|  
| -		if (zone_free_pages(z, order) > z->pages_low) {
| +		min += z->pages_low;
| +		if (z->free_pages > min) {
|  			page = rmqueue(z, order);
|  			if (page)
|  				return page;
| @@ -334,16 +331,18 @@
|  		wake_up_interruptible(&kswapd_wait);
|  
|  	zone = zonelist->zones;
| +	min = 1UL << order;
|  	for (;;) {
| -		unsigned long min;
| +		unsigned long local_min;
|  		zone_t *z = *(zone++);
|  		if (!z)
|  			break;
|  
| -		min = z->pages_min;
| +		local_min = z->pages_min;
|  		if (!(gfp_mask & __GFP_WAIT))
| -			min >>= 2;
| -		if (zone_free_pages(z, order) > min) {
| +			local_min >>= 2;
| +		min += local_min;
| +		if (z->free_pages > min) {
|  			page = rmqueue(z, order);
|  			if (page)
|  				return page;
| @@ -376,12 +375,14 @@
|  		return page;
|  
|  	zone = zonelist->zones;
| +	min = 1UL << order;
|  	for (;;) {
|  		zone_t *z = *(zone++);
|  		if (!z)
|  			break;
|  
| -		if (zone_free_pages(z, order) > z->pages_min) {
| +		min += z->pages_min;
| +		if (z->free_pages > min) {
|  			page = rmqueue(z, order);
|  			if (page)
|  				return page;

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [VM] 2.4.14/15-pre4 too "swap-happy"?
  2001-11-20  3:09           ` Ken Brownfield
@ 2001-11-20  3:30             ` Linus Torvalds
  2001-11-20  3:32             ` Andrea Arcangeli
  1 sibling, 0 replies; 20+ messages in thread
From: Linus Torvalds @ 2001-11-20  3:30 UTC (permalink / raw)
  To: Ken Brownfield; +Cc: linux-kernel, Andrea Arcangeli

On Mon, 19 Nov 2001, Ken Brownfield wrote:
>
> Well, I think you'll be pleased to hear that your untested patch
> compiled, booted, _and_ fixed the problem. :)

Good. The patch itself was fairly simple, and the problem was
straightforward, the real credit for the fix goes to Andrea for thinking
about what was wrong with the old code..

> The minimum free RAM was about 9.8-11MB (matching your guestimate) and
> kswapd seemed to behave the same as the watermark patch.  The results of
> top were basically the same, so I'm omitting it.

All right. I think 10MB free for a 3GB machine is good - and we can easily
tweak the zone_balance_max[] numbers if somebody comes to the conclusion
that it's better to have more free. It's about .3% of RAM, so it's small
enough that it's certainly not too much, and yet at the same time it's
probably enough to give reasonable behaviour in a temporary memory crunch.

> However, I do have some profiling numbers, thanks to Marcelo.  Attached
> are numbers from "readprofile | sort -nr +2 | head -20".  I think the
> pre4 numbers point to shrink_cache, prune_icache, and statm_pgd_range.
> The other two might have significance for wizards, but statistically
> don't stand out to me, except maybe statm_pgd_range.

I'd say that this clearly shows that yes, 2.4.14 did the wrong thing, and
wasted time in shrink_cache() without making any real progress. The two
other profiles look reasonable to me - nothing stands out that shouldn't.

(yeah, we spend _much_ too much time doing VM statistics with "top", and
the only way to get rid of that would be to add a per-vma "rss" field.
Which might not be a bad idea, but it's not a high priority for me).

> I reset the counters just before starting Oracle and the stress test.  I
> think a -pre7 with a blessed patch would be good, since my testing was
> very narrow.

Sude, I'll do a pre7. This closes my last behaviour issue with the VM,
although I'm sure we'll end up spending tons of time chasing bugs still
(both VM and not).

		Linus

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [VM] 2.4.14/15-pre4 too "swap-happy"?
  2001-11-20  3:09           ` Ken Brownfield
  2001-11-20  3:30             ` Linus Torvalds
@ 2001-11-20  3:32             ` Andrea Arcangeli
  2001-11-20  5:54               ` Ken Brownfield
  1 sibling, 1 reply; 20+ messages in thread
From: Andrea Arcangeli @ 2001-11-20  3:32 UTC (permalink / raw)
  To: Ken Brownfield; +Cc: Linus Torvalds, linux-kernel

On Mon, Nov 19, 2001 at 09:09:41PM -0600, Ken Brownfield wrote:
> Well, I think you'll be pleased to hear that your untested patch
> compiled, booted, _and_ fixed the problem. :)

Can you try to run an updatedb constantly in background?

Andrea

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [VM] 2.4.14/15-pre4 too "swap-happy"?
  2001-11-20  3:32             ` Andrea Arcangeli
@ 2001-11-20  5:54               ` Ken Brownfield
  2001-11-20  6:50                 ` Linus Torvalds
  0 siblings, 1 reply; 20+ messages in thread
From: Ken Brownfield @ 2001-11-20  5:54 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Linus Torvalds, linux-kernel

kswapd goes up to 5-10% CPU (vs 3-6) but it finishes without issue or
apparent interactivity problems.  I'm keeping it in while( 1 ), but it's
been predictable so far.

3-10 is a lot better than 99, but is kswapd really going to eat that
much CPU in an essentially allocation-less state?

But certainly you found the right thing.

Thx all!
-- 
Ken.
brownfld@irridia.com

On Tue, Nov 20, 2001 at 04:32:23AM +0100, Andrea Arcangeli wrote:
| On Mon, Nov 19, 2001 at 09:09:41PM -0600, Ken Brownfield wrote:
| > Well, I think you'll be pleased to hear that your untested patch
| > compiled, booted, _and_ fixed the problem. :)
| 
| Can you try to run an updatedb constantly in background?
| 
| Andrea
| -
| To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
| the body of a message to majordomo@vger.kernel.org
| More majordomo info at  http://vger.kernel.org/majordomo-info.html
| Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [VM] 2.4.14/15-pre4 too "swap-happy"?
  2001-11-20  5:54               ` Ken Brownfield
@ 2001-11-20  6:50                 ` Linus Torvalds
  2001-12-01 13:15                   ` Slight Return (was Re: [VM] 2.4.14/15-pre4 too "swap-happy"?) Ken Brownfield
  0 siblings, 1 reply; 20+ messages in thread
From: Linus Torvalds @ 2001-11-20  6:50 UTC (permalink / raw)
  To: linux-kernel

In article <20011119235422.F10597@asooo.flowerfire.com>,
Ken Brownfield  <brownfld@irridia.com> wrote:
>kswapd goes up to 5-10% CPU (vs 3-6) but it finishes without issue or
>apparent interactivity problems.  I'm keeping it in while( 1 ), but it's
>been predictable so far.
>
>3-10 is a lot better than 99, but is kswapd really going to eat that
>much CPU in an essentially allocation-less state?

Well, it's obviously not allocation-less: updatedb will really hit on
the dcache and icache (which are both in the NORMAL zone only, which is
why Andrea asked for it), and obviously your Oracle load itself seems to
be happily paging stuff around, which causes a lot of allocations for
page-ins. 

It only _looks_ static, because once you find the proper "balance", the
VM numbers themselves shouldn't change under a constant load.

We could make kswapd use less CPU time, of course, simply by making the
actual working processes do more of the work to free memory.  The total
work ends up being the same, though, and the advantage of kswapd is that
it tends to make the freeing slightly more asynchronous, which helps
throughput. 

The _disadvantage_ of kswapd is that if it goes crazy and uses up all
CPU time, you get bad results ;)

But it doesn't sound crazy in your load.  I'd be happier if the VM took
less CPU, of course, but for now we seem to be doing ok.

		Linus

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Slight Return (was Re: [VM] 2.4.14/15-pre4 too "swap-happy"?)
  2001-11-20  6:50                 ` Linus Torvalds
@ 2001-12-01 13:15                   ` Ken Brownfield
  2001-12-08 13:12                     ` Ken Brownfield
  0 siblings, 1 reply; 20+ messages in thread
From: Ken Brownfield @ 2001-12-01 13:15 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

When updatedb kicked off on my 2.4.16 6-way Xeon 4GB box this morning, I
had an unfortunate flashback:

  5:02am  up 2 days, 1 min, 59 users,  load average: 5.66, 4.86, 3.60
741 processes: 723 sleeping, 4 running, 0 zombie, 14 stopped
CPU states:  0.2% user, 77.3% system,  0.0% nice, 22.3% idle
Mem:  3351664K av, 3346504K used,    5160K free,       0K shrd,  498048K buff
Swap: 1052248K av,  282608K used,  769640K free                 2531892K cached

  PID USER     PRI  NI  SIZE  RSS SHARE STAT  LIB %CPU %MEM   TIME COMMAND
 2117 root      15   5   580  580   408 R N     0 99.9  0.0  17:19 updatedb
 2635 kb        12   0  1696 1556  1216 R       0 99.9  0.0   4:16 smbd
 2672 root      17  10  4212 4212   492 D N     0 94.7  0.1   1:39 rsync
 2609 root       2 -20  1284 1284   672 R <     0 81.2  0.0   4:02 top
    9 root       9   0     0    0     0 SW      0 80.7  0.0  42:50 kswapd
22879 kb         9   0 11548 6316  1684 S       0 11.8  0.1   7:33 smbd

Under varied load I'm not seeing the kswapd issue, but it looks like
updatedb combined with one or two samba transfers does still reproduce
the problem easily, and adding rsync or NFS transfers to the mix makes
kswapd peg at 99%.

I noticed because I was trying to do kernel patches and compiles using a
partition NFS-mounted from this machine.  I guess it sometimes pays to
be up at 5am...

Unfortunately it's difficult for me to reboot this machine to update the
kernel (59 users) but I will try to reproduce the problem on a separate
machine this weekend or early next week.  And I don't have profiling on,
so that will have to wait as well. :-(

Andrea, do you have a patch vs. 2.4.16 of your original solution to this
problem that I could test out?  I'd rather just change one thing at a
time rather than switching completely to an -aa kernel.

Grrrr!

Thanks much,
-- 
Ken.
brownfld@irridia.com

On Tue, Nov 20, 2001 at 06:50:50AM +0000, Linus Torvalds wrote:
| In article <20011119235422.F10597@asooo.flowerfire.com>,
| Ken Brownfield  <brownfld@irridia.com> wrote:
| >kswapd goes up to 5-10% CPU (vs 3-6) but it finishes without issue or
| >apparent interactivity problems.  I'm keeping it in while( 1 ), but it's
| >been predictable so far.
| >
| >3-10 is a lot better than 99, but is kswapd really going to eat that
| >much CPU in an essentially allocation-less state?
| 
| Well, it's obviously not allocation-less: updatedb will really hit on
| the dcache and icache (which are both in the NORMAL zone only, which is
| why Andrea asked for it), and obviously your Oracle load itself seems to
| be happily paging stuff around, which causes a lot of allocations for
| page-ins. 
| 
| It only _looks_ static, because once you find the proper "balance", the
| VM numbers themselves shouldn't change under a constant load.
| 
| We could make kswapd use less CPU time, of course, simply by making the
| actual working processes do more of the work to free memory.  The total
| work ends up being the same, though, and the advantage of kswapd is that
| it tends to make the freeing slightly more asynchronous, which helps
| throughput. 
| 
| The _disadvantage_ of kswapd is that if it goes crazy and uses up all
| CPU time, you get bad results ;)
| 
| But it doesn't sound crazy in your load.  I'd be happier if the VM took
| less CPU, of course, but for now we seem to be doing ok.
| 
| 		Linus
| -
| To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
| the body of a message to majordomo@vger.kernel.org
| More majordomo info at  http://vger.kernel.org/majordomo-info.html
| Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Slight Return (was Re: [VM] 2.4.14/15-pre4 too "swap-happy"?)
  2001-12-01 13:15                   ` Slight Return (was Re: [VM] 2.4.14/15-pre4 too "swap-happy"?) Ken Brownfield
@ 2001-12-08 13:12                     ` Ken Brownfield
  2001-12-09 18:51                       ` Marcelo Tosatti
  0 siblings, 1 reply; 20+ messages in thread
From: Ken Brownfield @ 2001-12-08 13:12 UTC (permalink / raw)
  To: linux-kernel

Just a quick followup to this, which is still a near show-stopper issue
for me.

This is easy to reproduce for me if I run updatedb locally, and then run
updatedb on a remote machine that's scanning an NFS-mounted filesystem
from the original local machine.  Instant kswapd saturation, especially
on large filesystems.

Doing updatedb on NFS-mounted filesystems also seems to cause kswapd to
peg on the NFS-client side as well.

I recently realized that slocate (at least on RH6.2 w/ 2.4 kernels) does
not seem to properly detect NFS when provided "-f nfs"...  Urgh.

Also something I noticed in slab_info (other info below):

inode_cache       369188 1027256    480 59716 128407    1 :  124   62
dentry_cache      256380 705510    128 14946 23517    1 :  252  126
buffer_head        46961  47800     96 1195 1195    1 :  252  126

That seems like a TON of {dentry,inode}_cache on a 1GB (HIMEM) machine.

I'd try 10_vm-19 but it doesn't apply cleanly for me.

Thanks for any input or ports of 10_vm-19 to 2.4.17-pre6. ;)
-- 
Ken.
brownfld@irridia.com

        total:    used:    free:  shared: buffers:  cached:
Mem:  1054011392 900526080 153485312        0 67829760 174866432
Swap: 2149548032   581632 2148966400
MemTotal:      1029308 kB
MemFree:        149888 kB
MemShared:           0 kB
Buffers:         66240 kB
Cached:         170376 kB
SwapCached:        392 kB
Active:         202008 kB
Inactive:        40380 kB
HighTotal:      131008 kB
HighFree:        30604 kB
LowTotal:       898300 kB
LowFree:        119284 kB
SwapTotal:     2099168 kB
SwapFree:      2098600 kB

Mem:  1029308K av,  886144K used,  143164K free,       0K shrd,   66240K buff
Swap: 2099168K av,     568K used, 2098600K free                  170872K cached

On Sat, Dec 01, 2001 at 07:15:02AM -0600, Ken Brownfield wrote:
| When updatedb kicked off on my 2.4.16 6-way Xeon 4GB box this morning, I
| had an unfortunate flashback:
| 
|   5:02am  up 2 days, 1 min, 59 users,  load average: 5.66, 4.86, 3.60
| 741 processes: 723 sleeping, 4 running, 0 zombie, 14 stopped
| CPU states:  0.2% user, 77.3% system,  0.0% nice, 22.3% idle
| Mem:  3351664K av, 3346504K used,    5160K free,       0K shrd,  498048K buff
| Swap: 1052248K av,  282608K used,  769640K free                 2531892K cached
| 
|   PID USER     PRI  NI  SIZE  RSS SHARE STAT  LIB %CPU %MEM   TIME COMMAND
|  2117 root      15   5   580  580   408 R N     0 99.9  0.0  17:19 updatedb
|  2635 kb        12   0  1696 1556  1216 R       0 99.9  0.0   4:16 smbd
|  2672 root      17  10  4212 4212   492 D N     0 94.7  0.1   1:39 rsync
|  2609 root       2 -20  1284 1284   672 R <     0 81.2  0.0   4:02 top
|     9 root       9   0     0    0     0 SW      0 80.7  0.0  42:50 kswapd
| 22879 kb         9   0 11548 6316  1684 S       0 11.8  0.1   7:33 smbd
| 
| Under varied load I'm not seeing the kswapd issue, but it looks like
| updatedb combined with one or two samba transfers does still reproduce
| the problem easily, and adding rsync or NFS transfers to the mix makes
| kswapd peg at 99%.
| 
| I noticed because I was trying to do kernel patches and compiles using a
| partition NFS-mounted from this machine.  I guess it sometimes pays to
| be up at 5am...
| 
| Unfortunately it's difficult for me to reboot this machine to update the
| kernel (59 users) but I will try to reproduce the problem on a separate
| machine this weekend or early next week.  And I don't have profiling on,
| so that will have to wait as well. :-(
| 
| Andrea, do you have a patch vs. 2.4.16 of your original solution to this
| problem that I could test out?  I'd rather just change one thing at a
| time rather than switching completely to an -aa kernel.
| 
| Grrrr!
| 
| Thanks much,
| -- 
| Ken.
| brownfld@irridia.com
| 
| 
| On Tue, Nov 20, 2001 at 06:50:50AM +0000, Linus Torvalds wrote:
| | In article <20011119235422.F10597@asooo.flowerfire.com>,
| | Ken Brownfield  <brownfld@irridia.com> wrote:
| | >kswapd goes up to 5-10% CPU (vs 3-6) but it finishes without issue or
| | >apparent interactivity problems.  I'm keeping it in while( 1 ), but it's
| | >been predictable so far.
| | >
| | >3-10 is a lot better than 99, but is kswapd really going to eat that
| | >much CPU in an essentially allocation-less state?
| | 
| | Well, it's obviously not allocation-less: updatedb will really hit on
| | the dcache and icache (which are both in the NORMAL zone only, which is
| | why Andrea asked for it), and obviously your Oracle load itself seems to
| | be happily paging stuff around, which causes a lot of allocations for
| | page-ins. 
| | 
| | It only _looks_ static, because once you find the proper "balance", the
| | VM numbers themselves shouldn't change under a constant load.
| | 
| | We could make kswapd use less CPU time, of course, simply by making the
| | actual working processes do more of the work to free memory.  The total
| | work ends up being the same, though, and the advantage of kswapd is that
| | it tends to make the freeing slightly more asynchronous, which helps
| | throughput. 
| | 
| | The _disadvantage_ of kswapd is that if it goes crazy and uses up all
| | CPU time, you get bad results ;)
| | 
| | But it doesn't sound crazy in your load.  I'd be happier if the VM took
| | less CPU, of course, but for now we seem to be doing ok.
| | 
| | 		Linus
| | -
| | To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
| | the body of a message to majordomo@vger.kernel.org
| | More majordomo info at  http://vger.kernel.org/majordomo-info.html
| | Please read the FAQ at  http://www.tux.org/lkml/
| -
| To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
| the body of a message to majordomo@vger.kernel.org
| More majordomo info at  http://vger.kernel.org/majordomo-info.html
| Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Slight Return (was Re: [VM] 2.4.14/15-pre4 too "swap-happy"?)
  2001-12-08 13:12                     ` Ken Brownfield
@ 2001-12-09 18:51                       ` Marcelo Tosatti
  2001-12-10  6:56                         ` Ken Brownfield
  0 siblings, 1 reply; 20+ messages in thread
From: Marcelo Tosatti @ 2001-12-09 18:51 UTC (permalink / raw)
  To: Ken Brownfield; +Cc: linux-kernel



On Sat, 8 Dec 2001, Ken Brownfield wrote:

> Just a quick followup to this, which is still a near show-stopper issue
> for me.
> 
> This is easy to reproduce for me if I run updatedb locally, and then run
> updatedb on a remote machine that's scanning an NFS-mounted filesystem
> from the original local machine.  Instant kswapd saturation, especially
> on large filesystems.
> 
> Doing updatedb on NFS-mounted filesystems also seems to cause kswapd to
> peg on the NFS-client side as well.

Can you reproduce the problem without the over NFS updatedb? 

Thanks 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Slight Return (was Re: [VM] 2.4.14/15-pre4 too "swap-happy"?)
  2001-12-09 18:51                       ` Marcelo Tosatti
@ 2001-12-10  6:56                         ` Ken Brownfield
  0 siblings, 0 replies; 20+ messages in thread
From: Ken Brownfield @ 2001-12-10  6:56 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: linux-kernel

Yes, any kind of fairly heavy, spread-out I/O combined with updatedb
will do the trick, like samba.  NFS isn't required, it just seems to be
a particularly good trigger.

It seems like anything that hits the inode/dentry caches hard, actually,
and doesn't always happen when freepages (or its 2.4.x equivalent) has
been hit.  I had a little applet that malloc'ed and memcpy'ed 1GB of RAM
and exited, which doesn't really help like it did before 2.4.15-pre[56].

It also happens for me a lot more with my 4GB machines, though I have
seen it on my 1GB HIGHMEM boxes as well.  If the problem is related to
scanning the cache, perhaps more RAM simply makes it worse.

I'm planning on trying Andrew Morton's patches as soon as I'm able.

Thanks,
-- 
Ken.
brownfld@irridia.com

On Sun, Dec 09, 2001 at 04:51:14PM -0200, Marcelo Tosatti wrote:
| 
| 
| On Sat, 8 Dec 2001, Ken Brownfield wrote:
| 
| > Just a quick followup to this, which is still a near show-stopper issue
| > for me.
| > 
| > This is easy to reproduce for me if I run updatedb locally, and then run
| > updatedb on a remote machine that's scanning an NFS-mounted filesystem
| > from the original local machine.  Instant kswapd saturation, especially
| > on large filesystems.
| > 
| > Doing updatedb on NFS-mounted filesystems also seems to cause kswapd to
| > peg on the NFS-client side as well.
| 
| Can you reproduce the problem without the over NFS updatedb? 
| 
| Thanks 
| 
| -
| To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
| the body of a message to majordomo@vger.kernel.org
| More majordomo info at  http://vger.kernel.org/majordomo-info.html
| Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2001-12-10  6:57 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <200111191801.fAJI1l922388@neosilicon.transmeta.com>
2001-11-19 18:07 ` [VM] 2.4.14/15-pre4 too "swap-happy"? Linus Torvalds
2001-11-19 18:31   ` Ken Brownfield
2001-11-19 19:23     ` Linus Torvalds
2001-11-19 23:39       ` Ken Brownfield
2001-11-19 23:52         ` Linus Torvalds
2001-11-20  0:18           ` M. Edward (Ed) Borasky
2001-11-20  0:25           ` Ken Brownfield
2001-11-20  0:31             ` Linus Torvalds
2001-11-20  3:09           ` Ken Brownfield
2001-11-20  3:30             ` Linus Torvalds
2001-11-20  3:32             ` Andrea Arcangeli
2001-11-20  5:54               ` Ken Brownfield
2001-11-20  6:50                 ` Linus Torvalds
2001-12-01 13:15                   ` Slight Return (was Re: [VM] 2.4.14/15-pre4 too "swap-happy"?) Ken Brownfield
2001-12-08 13:12                     ` Ken Brownfield
2001-12-09 18:51                       ` Marcelo Tosatti
2001-12-10  6:56                         ` Ken Brownfield
2001-11-19 19:30     ` [VM] 2.4.14/15-pre4 too "swap-happy"? Ken Brownfield
2001-11-19 18:26       ` Marcelo Tosatti
2001-11-19 19:44   ` Slo Mo Snail

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox