2.4.8preX VM problems

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* 2.4.8preX VM problems
@ 2001-08-01  3:05 Andrew Tridgell
  2001-08-01  2:26 ` Marcelo Tosatti
  0 siblings, 1 reply; 20+ messages in thread
From: Andrew Tridgell @ 2001-08-01  3:05 UTC (permalink / raw)
  To: linux-kernel

I've been testing the 2.4.8preX kernels on machines with fairly large
amounts of memory (greater than 1G) and have found them to have
disasterously bad performance through the buffer cache. If the machine
has 900M or less then it performs well, but above that the performance
drops through the floor (by about a factor of 600).

To see the effect use this:

ftp://ftp.samba.org/pub/unpacked/junkcode/readfiles.c

and this:

ftp://ftp.samba.org/pub/unpacked/junkcode/trd/

then do this:

insmod dummy_disk.o dummy_size=80000000
mknod /dev/ddisk b 241 0
readfile /dev/ddisk

"dummy_disk" is a dummy disk device (in this case iits 80G). All IOs
to the device succeed, but don't actually do anything. This makes it
easy to test very large disks on a small machine, and also eliminates
interactions with particular block devices. It also allows you to
unload the disk, which means you can easily start again with a clear
buffer cache. You can see exactly the same effect with a real device
if you would prefer not to load the dummy disk driver.

You will see that the speed is good for the first 800M then drops off
dramatically after that. Meanwhile, kswapd and kreclaimd go mad
chewing lots of cpu.

If you boot the machine with "mem=900M" then the problem goes away,
with the performance staying high. If you boot with 950M or above
then the throughput plummets once you have read more than 800M.

Here is a sample run with 2.4.8pre3:

[root@fraud trd]# ~/readfiles /dev/ddisk 
211 MB    211.754 MB/sec
404 MB    192.866 MB/sec
579 MB    175.188 MB/sec
742 MB    163.017 MB/sec
794 MB    49.5844 MB/sec
795 MB    0.971527 MB/sec
796 MB    0.94948 MB/sec
797 MB    1.35205 MB/sec
799 MB    1.30931 MB/sec
800 MB    1.16104 MB/sec
801 MB    1.30607 MB/sec
803 MB    1.67914 MB/sec
804 MB    1.1175 MB/sec
805 MB    0.645805 MB/sec
806 MB    0.749738 MB/sec
806 MB    0.555384 MB/sec
807 MB    0.330456 MB/sec
807 MB    0.320096 MB/sec
807 MB    0.320502 MB/sec
808 MB    0.33026 MB/sec

and on a real disk:

[root@fraud trd]# ~/readfiles /dev/rd/c0d1p2 
37 MB    37.5002 MB/sec
76 MB    38.8103 MB/sec
115 MB    38.8753 MB/sec
153 MB    37.6465 MB/sec
191 MB    38.223 MB/sec
229 MB    38.276 MB/sec
267 MB    38.3151 MB/sec
305 MB    37.3374 MB/sec
343 MB    37.6915 MB/sec
380 MB    37.7198 MB/sec
418 MB    37.5222 MB/sec
455 MB    37.1729 MB/sec
492 MB    37.2008 MB/sec
529 MB    36.2474 MB/sec
565 MB    36.7173 MB/sec
602 MB    36.6197 MB/sec
639 MB    36.5568 MB/sec
675 MB    36.4935 MB/sec
711 MB    36.1575 MB/sec
747 MB    36.0858 MB/sec
784 MB    36.1972 MB/sec
799 MB    15.1778 MB/sec
803 MB    4.11846 MB/sec
804 MB    1.33881 MB/sec
805 MB    0.927079 MB/sec
806 MB    0.790508 MB/sec
807 MB    0.679455 MB/sec
807 MB    0.316194 MB/sec
808 MB    0.305104 MB/sec
808 MB    0.317431 MB/sec

Interestingly, the 800M barrier is the same no matter how much memory
is in the machine (ie. its the same barrier for a machine with 2G as
1G).

So, anyone have any ideas? 

I was prompted to do these tests when I saw kswapd and kreclaimd going
mad in large SPECsfs runs on a machine with 2G of memory. I suspect
that what is happening is that the meta data throughput plummets
during the runs when the buffer cache reaches 800M in size. SPECsfs is
very meta-data intensive. Typical runs will create millions of files.

Cheers, Tridge

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: 2.4.8preX VM problems
  2001-08-01  3:05 2.4.8preX VM problems Andrew Tridgell
@ 2001-08-01  2:26 ` Marcelo Tosatti
  2001-08-01  4:37   ` Andrew Tridgell
  2001-08-01  6:09   ` Andrew Tridgell
  0 siblings, 2 replies; 20+ messages in thread
From: Marcelo Tosatti @ 2001-08-01  2:26 UTC (permalink / raw)
  To: Andrew Tridgell; +Cc: lkml


Andrew,

Can you reproduce the problem with 2.4.7 ? 


On Tue, 31 Jul 2001, Andrew Tridgell wrote:

> I've been testing the 2.4.8preX kernels on machines with fairly large
> amounts of memory (greater than 1G) and have found them to have
> disasterously bad performance through the buffer cache. If the machine
> has 900M or less then it performs well, but above that the performance
> drops through the floor (by about a factor of 600).
> 
> To see the effect use this:
> 
> ftp://ftp.samba.org/pub/unpacked/junkcode/readfiles.c
> 
> and this:
> 
> ftp://ftp.samba.org/pub/unpacked/junkcode/trd/
> 
> then do this:
> 
> insmod dummy_disk.o dummy_size=80000000
> mknod /dev/ddisk b 241 0
> readfile /dev/ddisk
> 
> "dummy_disk" is a dummy disk device (in this case iits 80G). All IOs
> to the device succeed, but don't actually do anything. This makes it
> easy to test very large disks on a small machine, and also eliminates
> interactions with particular block devices. It also allows you to
> unload the disk, which means you can easily start again with a clear
> buffer cache. You can see exactly the same effect with a real device
> if you would prefer not to load the dummy disk driver.
> 
> You will see that the speed is good for the first 800M then drops off
> dramatically after that. Meanwhile, kswapd and kreclaimd go mad
> chewing lots of cpu.
> 
> If you boot the machine with "mem=900M" then the problem goes away,
> with the performance staying high. If you boot with 950M or above
> then the throughput plummets once you have read more than 800M.
> 
> Here is a sample run with 2.4.8pre3:
> 
> [root@fraud trd]# ~/readfiles /dev/ddisk 
> 211 MB    211.754 MB/sec
> 404 MB    192.866 MB/sec
> 579 MB    175.188 MB/sec
> 742 MB    163.017 MB/sec
> 794 MB    49.5844 MB/sec
> 795 MB    0.971527 MB/sec
> 796 MB    0.94948 MB/sec
> 797 MB    1.35205 MB/sec
> 799 MB    1.30931 MB/sec
> 800 MB    1.16104 MB/sec
> 801 MB    1.30607 MB/sec
> 803 MB    1.67914 MB/sec
> 804 MB    1.1175 MB/sec
> 805 MB    0.645805 MB/sec
> 806 MB    0.749738 MB/sec
> 806 MB    0.555384 MB/sec
> 807 MB    0.330456 MB/sec
> 807 MB    0.320096 MB/sec
> 807 MB    0.320502 MB/sec
> 808 MB    0.33026 MB/sec
> 
> and on a real disk:
> 
> [root@fraud trd]# ~/readfiles /dev/rd/c0d1p2 
> 37 MB    37.5002 MB/sec
> 76 MB    38.8103 MB/sec
> 115 MB    38.8753 MB/sec
> 153 MB    37.6465 MB/sec
> 191 MB    38.223 MB/sec
> 229 MB    38.276 MB/sec
> 267 MB    38.3151 MB/sec
> 305 MB    37.3374 MB/sec
> 343 MB    37.6915 MB/sec
> 380 MB    37.7198 MB/sec
> 418 MB    37.5222 MB/sec
> 455 MB    37.1729 MB/sec
> 492 MB    37.2008 MB/sec
> 529 MB    36.2474 MB/sec
> 565 MB    36.7173 MB/sec
> 602 MB    36.6197 MB/sec
> 639 MB    36.5568 MB/sec
> 675 MB    36.4935 MB/sec
> 711 MB    36.1575 MB/sec
> 747 MB    36.0858 MB/sec
> 784 MB    36.1972 MB/sec
> 799 MB    15.1778 MB/sec
> 803 MB    4.11846 MB/sec
> 804 MB    1.33881 MB/sec
> 805 MB    0.927079 MB/sec
> 806 MB    0.790508 MB/sec
> 807 MB    0.679455 MB/sec
> 807 MB    0.316194 MB/sec
> 808 MB    0.305104 MB/sec
> 808 MB    0.317431 MB/sec
> 
> Interestingly, the 800M barrier is the same no matter how much memory
> is in the machine (ie. its the same barrier for a machine with 2G as
> 1G).
> 
> So, anyone have any ideas? 
> 
> I was prompted to do these tests when I saw kswapd and kreclaimd going
> mad in large SPECsfs runs on a machine with 2G of memory. I suspect
> that what is happening is that the meta data throughput plummets
> during the runs when the buffer cache reaches 800M in size. SPECsfs is
> very meta-data intensive. Typical runs will create millions of files.
> 
> Cheers, Tridge
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: 2.4.8preX VM problems
  2001-08-01  2:26 ` Marcelo Tosatti
@ 2001-08-01  4:37   ` Andrew Tridgell
  2001-08-01  3:32     ` Marcelo Tosatti
  2001-08-01  6:09   ` Andrew Tridgell
  1 sibling, 1 reply; 20+ messages in thread
From: Andrew Tridgell @ 2001-08-01  4:37 UTC (permalink / raw)
  To: marcelo; +Cc: linux-kernel

Marcelo wrote:
> Can you reproduce the problem with 2.4.7 ? 

no, it started with 2.4.8pre1. I am currently narrowing it down by
reverting pieces of that patch and I have successfully narrowed it
down to the changes in mm/vmscan.c. I have a 2.4.8pre3 kernel with a
hacked version of the 2.4.7 vmscan.c that doesn't show the
problem. I'll try to narrow it down a bit more this afternoon.

Cheers, Tridge

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: 2.4.8preX VM problems
  2001-08-01  4:37   ` Andrew Tridgell
@ 2001-08-01  3:32     ` Marcelo Tosatti
  2001-08-01  5:43       ` Andrew Tridgell
  0 siblings, 1 reply; 20+ messages in thread
From: Marcelo Tosatti @ 2001-08-01  3:32 UTC (permalink / raw)
  To: Andrew Tridgell; +Cc: lkml

On Tue, 31 Jul 2001, Andrew Tridgell wrote:

> Marcelo wrote:
> > Can you reproduce the problem with 2.4.7 ? 
> 
> no, it started with 2.4.8pre1. I am currently narrowing it down by
> reverting pieces of that patch and I have successfully narrowed it
> down to the changes in mm/vmscan.c. I have a 2.4.8pre3 kernel with a
> hacked version of the 2.4.7 vmscan.c that doesn't show the
> problem. 

Could you please apply
http://bazar.conectiva.com.br/~marcelo/patches/v2.4/2.4.7pre9/zoned.patch
on top of 2.4.7 and try to reproduce the problem ? 

> I'll try to narrow it down a bit more this afternoon.

There are two possibilities: 

1) the zoned approach code introduced in 2.4.8pre1 (thats why I asked you
to apply the patch alone on top of 2.4.7).

2) The used-once code also introduced in 2.4.8pre1

It seems the problem only happens when the highmem zone is active, right ?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: 2.4.8preX VM problems
  2001-08-01  3:32     ` Marcelo Tosatti
@ 2001-08-01  5:43       ` Andrew Tridgell
  0 siblings, 0 replies; 20+ messages in thread
From: Andrew Tridgell @ 2001-08-01  5:43 UTC (permalink / raw)
  To: marcelo; +Cc: linux-kernel

Marcelo wrote:
> Could you please apply
> http://bazar.conectiva.com.br/~marcelo/patches/v2.4/2.4.7pre9/zoned.patch
> on top of 2.4.7 and try to reproduce the problem ? 

yep, that's the culprit. Running an original 2.4.7 with the zoned
patch applied showed the same slowdowns as 2.4.8preX. Looks like the
zoned patch has a problem when the buffer cache grows beyond 800M.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: 2.4.8preX VM problems
  2001-08-01  2:26 ` Marcelo Tosatti
  2001-08-01  4:37   ` Andrew Tridgell
@ 2001-08-01  6:09   ` Andrew Tridgell
  2001-08-01  6:10     ` Marcelo Tosatti
  1 sibling, 1 reply; 20+ messages in thread
From: Andrew Tridgell @ 2001-08-01  6:09 UTC (permalink / raw)
  To: marcelo; +Cc: linux-kernel

Marcelo,

I've narrowed it down some more. If I apply the whole zone patch
except for this bit:

+		/* 
+		 * If we are doing zone-specific laundering, 
+		 * avoid touching pages from zones which do 
+		 * not have a free shortage.
+		 */
+		if (zone && !zone_free_shortage(page->zone)) {
+			list_del(page_lru);
+			list_add(page_lru, &inactive_dirty_list);
+			continue;
+		}
+

then the behaviour is much better:

[root@fraud trd]# ~/readfiles /dev/ddisk 
202 MB    202.125 MB/sec
394 MB    192.525 MB/sec
580 MB    185.487 MB/sec
755 MB    175.319 MB/sec
804 MB    41.3387 MB/sec
986 MB    182.5 MB/sec
1115 MB    114.862 MB/sec
1297 MB    182.276 MB/sec
1426 MB    128.983 MB/sec
1603 MB    164.939 MB/sec
1686 MB    82.9556 MB/sec
1866 MB    179.861 MB/sec
1930 MB    63.959 MB/sec

Even given that, the performance isn't exactly stunning. The
"dummy_disk" driver doesn't even do a memset or memcpy so it should
really run at the full memory bandwidth of the machine. We are only
getting a fraction of that (it is a dual PIII/800 server). If I get
time I'll try some profiling.

I also notice that the system peaks at a maximum of just under 750M in
the buffer cache. The system has 1.2G of completely unused memory
which I really expected to be consumed by something that is just
reading from a never-ending block device.

For example:

CPU0 states:  0.0% user, 67.1% system,  0.0% nice, 32.3% idle
CPU1 states:  0.0% user, 65.3% system,  0.0% nice, 34.1% idle
Mem:  2059660K av,  842712K used, 1216948K free,       0K shrd,  740816K buff
Swap: 1052216K av,       0K used, 1052216K free                    9496K cached

  PID USER     PRI  NI  SIZE  RSS SHARE LC STAT %CPU %MEM   TIME COMMAND
  615 root      14   0   452  452   328  1 R    99.9  0.0   3:52 readfiles
    5 root       9   0     0    0     0  1 SW   31.3  0.0   1:03 kswapd
    6 root       9   0     0    0     0  0 SW    0.5  0.0   0:04 kreclaimd

I know this is a *long* way from a real world benchmark, but I think
it is perhaps indicative of our buffer cache system getting a bit too
complex again :)

Cheers, Tridge

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: 2.4.8preX VM problems
  2001-08-01  6:09   ` Andrew Tridgell
@ 2001-08-01  6:10     ` Marcelo Tosatti
  2001-08-01  8:13       ` Andrew Tridgell
  0 siblings, 1 reply; 20+ messages in thread
From: Marcelo Tosatti @ 2001-08-01  6:10 UTC (permalink / raw)
  To: Andrew Tridgell; +Cc: linux-kernel



On Tue, 31 Jul 2001, Andrew Tridgell wrote:

> Marcelo,
> 
> I've narrowed it down some more. If I apply the whole zone patch
> except for this bit:
> 
> +		/* 
> +		 * If we are doing zone-specific laundering, 
> +		 * avoid touching pages from zones which do 
> +		 * not have a free shortage.
> +		 */
> +		if (zone && !zone_free_shortage(page->zone)) {
> +			list_del(page_lru);
> +			list_add(page_lru, &inactive_dirty_list);
> +			continue;
> +		}
> +
> 
> then the behaviour is much better:
> 
> [root@fraud trd]# ~/readfiles /dev/ddisk 
> 202 MB    202.125 MB/sec
> 394 MB    192.525 MB/sec
> 580 MB    185.487 MB/sec
> 755 MB    175.319 MB/sec
> 804 MB    41.3387 MB/sec
> 986 MB    182.5 MB/sec
> 1115 MB    114.862 MB/sec
> 1297 MB    182.276 MB/sec
> 1426 MB    128.983 MB/sec
> 1603 MB    164.939 MB/sec
> 1686 MB    82.9556 MB/sec
> 1866 MB    179.861 MB/sec
> 1930 MB    63.959 MB/sec
> 
> Even given that, the performance isn't exactly stunning. The
> "dummy_disk" driver doesn't even do a memset or memcpy so it should
> really run at the full memory bandwidth of the machine. We are only
> getting a fraction of that (it is a dual PIII/800 server). If I get
> time I'll try some profiling.
> 
> I also notice that the system peaks at a maximum of just under 750M in
> the buffer cache. The system has 1.2G of completely unused memory
> which I really expected to be consumed by something that is just
> reading from a never-ending block device.

Thats expected: we cannot allocate buffercache pages on highmem.

> For example:
> 
> CPU0 states:  0.0% user, 67.1% system,  0.0% nice, 32.3% idle
> CPU1 states:  0.0% user, 65.3% system,  0.0% nice, 34.1% idle
> Mem:  2059660K av,  842712K used, 1216948K free,       0K shrd,  740816K buff
> Swap: 1052216K av,       0K used, 1052216K free                    9496K cached
> 
>   PID USER     PRI  NI  SIZE  RSS SHARE LC STAT %CPU %MEM   TIME COMMAND
>   615 root      14   0   452  452   328  1 R    99.9  0.0   3:52 readfiles
>     5 root       9   0     0    0     0  1 SW   31.3  0.0   1:03 kswapd
>     6 root       9   0     0    0     0  0 SW    0.5  0.0   0:04 kreclaimd
> 
> I know this is a *long* way from a real world benchmark, but I think
> it is perhaps indicative of our buffer cache system getting a bit too
> complex again :)

do_page_launder() stops the laundering loop (which frees pages), in case
it freed a buffercache page, as soon as there is no more global free
shortage (in the global scan case), or as soon as there is no more free
shortage for the specific zone we're scanning.

Thats wrong: we should keep laundering pages if there is _any_ zone under
shortage. 

Could you please try the patch below ? (against 2.4.8pre3)

--- linux.orig/mm/vmscan.c	Wed Aug  1 04:26:36 2001
+++ linux/mm/vmscan.c	Wed Aug  1 04:33:22 2001
@@ -593,13 +593,9 @@
 			 * If we're freeing buffer cache pages, stop when
 			 * we've got enough free memory.
 			 */
-			if (freed_page) {
-				if (zone) {
-					if (!zone_free_shortage(zone))
-						break;
-				} else if (!free_shortage()) 
-					break;
-			}
+			if (freed_page && !total_free_shortage())
+				break;
+
 			continue;
 		} else if (page->mapping && !PageDirty(page)) {
 			/*


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: 2.4.8preX VM problems
  2001-08-01  6:10     ` Marcelo Tosatti
@ 2001-08-01  8:13       ` Andrew Tridgell
  2001-08-01  8:13         ` Marcelo Tosatti
  0 siblings, 1 reply; 20+ messages in thread
From: Andrew Tridgell @ 2001-08-01  8:13 UTC (permalink / raw)
  To: marcelo; +Cc: linux-kernel

Marcelo,

I'm afraid that didn't help. I get:

[root@skurk /root]# ./readfiles /dev/ddisk 
362 MB    181.145 MB/sec
695 MB    166.455 MB/sec
811 MB    57.6077 MB/sec
812 MB    0.439532 MB/sec
813 MB    0.463901 MB/sec
814 MB    0.416093 MB/sec
815 MB    0.409958 MB/sec
816 MB    0.410413 MB/sec




> Could you please try the patch below ? (against 2.4.8pre3)
> 
> --- linux.orig/mm/vmscan.c	Wed Aug  1 04:26:36 2001
> +++ linux/mm/vmscan.c	Wed Aug  1 04:33:22 2001
> @@ -593,13 +593,9 @@
>  			 * If we're freeing buffer cache pages, stop when
>  			 * we've got enough free memory.
>  			 */
> -			if (freed_page) {
> -				if (zone) {
> -					if (!zone_free_shortage(zone))
> -						break;
> -				} else if (!free_shortage()) 
> -					break;
> -			}
> +			if (freed_page && !total_free_shortage())
> +				break;
> +
>  			continue;
>  		} else if (page->mapping && !PageDirty(page)) {
>  			/*
> 
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: 2.4.8preX VM problems
  2001-08-01  8:13       ` Andrew Tridgell
@ 2001-08-01  8:13         ` Marcelo Tosatti
  2001-08-01 10:54           ` Andrew Tridgell
  2001-08-04  6:50           ` Anton Blanchard
  0 siblings, 2 replies; 20+ messages in thread
From: Marcelo Tosatti @ 2001-08-01  8:13 UTC (permalink / raw)
  To: Andrew Tridgell; +Cc: lkml, Rik van Riel


Andrew, 

The problem is pretty nasty: if there is no global shortage and only a 
given zone with shortage, we set the zone free target to freepages.min
(basically no tasks can make progress with that amount of free memory). 

The following patch sets the zone free target to freepages.high. Can you
test it ? (I tried here and got the expected results)

Maybe pages_high is _too_ high to set the free target. 

We may want to use pages_low for page freeing and pages_high for page
writeout, or something like that. (this way we keep the necessary amount
of pages to reach pages_high being written out)

I'll keep looking into this tomorrow. Going home now.


--- linux.orig/mm/page_alloc.c	Mon Jul 30 17:06:49 2001
+++ linux/mm/page_alloc.c	Wed Aug  1 06:21:35 2001
@@ -630,8 +630,8 @@
 		goto ret;
 
 	if (zone->inactive_clean_pages + zone->free_pages
-			< zone->pages_min) {
-		sum += zone->pages_min;
+			< zone->pages_high) {
+		sum += zone->pages_high;
 		sum -= zone->free_pages;
 		sum -= zone->inactive_clean_pages;
 	}


On Wed, 1 Aug 2001, Andrew Tridgell wrote:

> Marcelo,
> 
> I'm afraid that didn't help. I get:
> 
> [root@skurk /root]# ./readfiles /dev/ddisk 
> 362 MB    181.145 MB/sec
> 695 MB    166.455 MB/sec
> 811 MB    57.6077 MB/sec
> 812 MB    0.439532 MB/sec
> 813 MB    0.463901 MB/sec
> 814 MB    0.416093 MB/sec
> 815 MB    0.409958 MB/sec
> 816 MB    0.410413 MB/sec
> 
> 
> 
> 
> > Could you please try the patch below ? (against 2.4.8pre3)
> > 
> > --- linux.orig/mm/vmscan.c	Wed Aug  1 04:26:36 2001
> > +++ linux/mm/vmscan.c	Wed Aug  1 04:33:22 2001
> > @@ -593,13 +593,9 @@
> >  			 * If we're freeing buffer cache pages, stop when
> >  			 * we've got enough free memory.
> >  			 */
> > -			if (freed_page) {
> > -				if (zone) {
> > -					if (!zone_free_shortage(zone))
> > -						break;
> > -				} else if (!free_shortage()) 
> > -					break;
> > -			}
> > +			if (freed_page && !total_free_shortage())
> > +				break;
> > +
> >  			continue;
> >  		} else if (page->mapping && !PageDirty(page)) {
> >  			/*
> > 
> > 
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: 2.4.8preX VM problems
  2001-08-01  8:13         ` Marcelo Tosatti
@ 2001-08-01 10:54           ` Andrew Tridgell
  2001-08-01 11:51             ` Mike Black
  2001-08-04  6:50           ` Anton Blanchard
  1 sibling, 1 reply; 20+ messages in thread
From: Andrew Tridgell @ 2001-08-01 10:54 UTC (permalink / raw)
  To: marcelo; +Cc: linux-kernel, riel

Marcelo,

> The following patch sets the zone free target to freepages.high. Can you
> test it ? (I tried here and got the expected results)

Running just that patch against 2.4.8pre3 gives:

[root@fraud /root]# ~/readfiles /dev/ddisk 
198 MB    198.084 MB/sec
386 MB    188.634 MB/sec
570 MB    183.827 MB/sec
743 MB    172.5 MB/sec
810 MB    67.0501 MB/sec
862 MB    52.1381 MB/sec
901 MB    37.9501 MB/sec
957 MB    55.8253 MB/sec
998 MB    41.1541 MB/sec
1046 MB    48.1661 MB/sec
1088 MB    40.3898 MB/sec
1140 MB    50.8782 MB/sec
1183 MB    42.5749 MB/sec
1229 MB    46.1378 MB/sec
1275 MB    44.8515 MB/sec
1319 MB    43.5389 MB/sec
1368 MB    47.5747 MB/sec
1411 MB    42.8134 MB/sec


which is much better, but is pretty poor performance for a null
device.

Running with that latest patch plus the patch you sent previously
gives roughly the same result. Also, kswapd chews lots of cpu during
these runs:

CPU0 states:  0.0% user, 79.0% system,  0.0% nice, 20.4% idle
CPU1 states:  0.2% user, 77.1% system,  0.0% nice, 22.1% idle
Mem:  2059088K av,  892256K used, 1166832K free,       0K shrd,  784972K buff
Swap: 1052216K av,       0K used, 1052216K free                   10072K cached

  PID USER     PRI  NI  SIZE  RSS SHARE LC STAT %CPU %MEM   TIME COMMAND
  608 root      19   0   452  452   328  1 R    95.2  0.0   1:23 readfiles
    5 root      14   0     0    0     0  1 SW   58.3  0.0   0:52 kswapd
    6 root       9   0     0    0     0  1 RW    2.1  0.0   0:01 kreclaimd


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: 2.4.8preX VM problems
  2001-08-01 10:54           ` Andrew Tridgell
@ 2001-08-01 11:51             ` Mike Black
  2001-08-01 18:39               ` Daniel Phillips
  0 siblings, 1 reply; 20+ messages in thread
From: Mike Black @ 2001-08-01 11:51 UTC (permalink / raw)
  To: tridge, marcelo; +Cc: linux-kernel, riel, Andrew Morton

This sounds a lot like the problem I've been having with ext3 and raid.
A one-thread tiobench performs just great.
A two-thread tiobench starts having lots of kswapd action when free memory
gets down to ~5Meg.  ext3 exacerbates the problem.
kswapd kicks up it's heals and starts grinding away (and NEVER swaps
anything out).

I've been working this with Andrew Morton (the ext3 guy).

I have come to the opinion that kswapd needs to be a little smarter -- if it
doesn't find anything to swap shouldn't it go to sleep a little longer
before trying again?  That way it could gracefully degrade itself when it's
not making any progress.

In my testing (on a dual 1Ghz/2G machine) the machine "locks up" for long
periods of time while kswapd runs around trying to do it's thing.
If I could disable kswapd I would just to test this.
I tried to figure out how to lengthen the sleep time of kswapd but didn't
have time to chase it down (it wasn't intuitively obvious :-)

________________________________________
Michael D. Black   Principal Engineer
mblack@csihq.com  321-676-2923,x203
http://www.csihq.com  Computer Science Innovations
http://www.csihq.com/~mike  My home page
FAX 321-676-2355
----- Original Message -----
From: "Andrew Tridgell" <tridge@valinux.com>
To: <marcelo@conectiva.com.br>
Cc: <linux-kernel@vger.kernel.org>; <riel@conectiva.com.br>
Sent: Wednesday, August 01, 2001 6:54 AM
Subject: Re: 2.4.8preX VM problems

Marcelo,

> The following patch sets the zone free target to freepages.high. Can you
> test it ? (I tried here and got the expected results)

Running just that patch against 2.4.8pre3 gives:

[root@fraud /root]# ~/readfiles /dev/ddisk
198 MB    198.084 MB/sec
386 MB    188.634 MB/sec
570 MB    183.827 MB/sec
743 MB    172.5 MB/sec
810 MB    67.0501 MB/sec
862 MB    52.1381 MB/sec
901 MB    37.9501 MB/sec
957 MB    55.8253 MB/sec
998 MB    41.1541 MB/sec
1046 MB    48.1661 MB/sec
1088 MB    40.3898 MB/sec
1140 MB    50.8782 MB/sec
1183 MB    42.5749 MB/sec
1229 MB    46.1378 MB/sec
1275 MB    44.8515 MB/sec
1319 MB    43.5389 MB/sec
1368 MB    47.5747 MB/sec
1411 MB    42.8134 MB/sec

which is much better, but is pretty poor performance for a null
device.

Running with that latest patch plus the patch you sent previously
gives roughly the same result. Also, kswapd chews lots of cpu during
these runs:

CPU0 states:  0.0% user, 79.0% system,  0.0% nice, 20.4% idle
CPU1 states:  0.2% user, 77.1% system,  0.0% nice, 22.1% idle
Mem:  2059088K av,  892256K used, 1166832K free,       0K shrd,  784972K
buff
Swap: 1052216K av,       0K used, 1052216K free                   10072K
cached

  PID USER     PRI  NI  SIZE  RSS SHARE LC STAT %CPU %MEM   TIME COMMAND
  608 root      19   0   452  452   328  1 R    95.2  0.0   1:23 readfiles
    5 root      14   0     0    0     0  1 SW   58.3  0.0   0:52 kswapd
    6 root       9   0     0    0     0  1 RW    2.1  0.0   0:01 kreclaimd

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: 2.4.8preX VM problems
  2001-08-01 11:51             ` Mike Black
@ 2001-08-01 18:39               ` Daniel Phillips
  2001-08-11 12:06                 ` Pavel Machek
  0 siblings, 1 reply; 20+ messages in thread
From: Daniel Phillips @ 2001-08-01 18:39 UTC (permalink / raw)
  To: Mike Black, tridge, marcelo; +Cc: linux-kernel, riel, Andrew Morton

On Wednesday 01 August 2001 13:51, Mike Black wrote:
> I have come to the opinion that kswapd needs to be a little smarter
> -- if it doesn't find anything to swap shouldn't it go to sleep a
> little longer before trying again?  That way it could gracefully
> degrade itself when it's not making any progress.
>
> In my testing (on a dual 1Ghz/2G machine) the machine "locks up" for
> long periods of time while kswapd runs around trying to do it's
> thing. If I could disable kswapd I would just to test this.

Your wish is my command.  This patch provides a crude-but-effective 
method of disabling kswapd, using:

  echo 1 >/proc/sys/kernel/disable_kswapd

I tested this with dbench and found it runs about half as fast, but 
runs.  This is reassuring because kswapd is supposed to be doing 
something useful.

To apply:

  cd /usr/src/your.2.4.7.tree
  patch -p0 <this.patch

--- ../2.4.7.clean/include/linux/swap.h	Fri Jul 20 21:52:18 2001
+++ ./include/linux/swap.h	Wed Aug  1 19:35:27 2001
@@ -78,6 +78,7 @@
 	int next;			/* next entry on swap list */
 };
 
+extern int disable_kswapd;
 extern int nr_swap_pages;
 extern unsigned int nr_free_pages(void);
 extern unsigned int nr_inactive_clean_pages(void);
--- ../2.4.7.clean/include/linux/sysctl.h	Fri Jul 20 21:52:18 2001
+++ ./include/linux/sysctl.h	Wed Aug  1 19:35:28 2001
@@ -118,7 +118,8 @@
 	KERN_SHMPATH=48,	/* string: path to shm fs */
 	KERN_HOTPLUG=49,	/* string: path to hotplug policy agent */
 	KERN_IEEE_EMULATION_WARNINGS=50, /* int: unimplemented ieee 
instructions */
-	KERN_S390_USER_DEBUG_LOGGING=51  /* int: dumps of user faults */
+	KERN_S390_USER_DEBUG_LOGGING=51, /* int: dumps of user faults */
+	KERN_DISABLE_KSWAPD=52,	/* int: disable kswapd for testing */
 };
 
 
--- ../2.4.7.clean/kernel/sysctl.c	Thu Apr 12 21:20:31 2001
+++ ./kernel/sysctl.c	Wed Aug  1 19:35:28 2001
@@ -249,6 +249,8 @@
 	{KERN_S390_USER_DEBUG_LOGGING,"userprocess_debug",
 	 &sysctl_userprocess_debug,sizeof(int),0644,NULL,&proc_dointvec},
 #endif
+	{KERN_DISABLE_KSWAPD, "disable_kswapd", &disable_kswapd, sizeof (int),
+	 0644, NULL, &proc_dointvec},
 	{0}
 };
 
--- ../2.4.7.clean/mm/vmscan.c	Mon Jul  9 19:18:50 2001
+++ ./mm/vmscan.c	Wed Aug  1 19:35:28 2001
@@ -875,6 +875,8 @@
 DECLARE_WAIT_QUEUE_HEAD(kswapd_wait);
 DECLARE_WAIT_QUEUE_HEAD(kswapd_done);
 
+int disable_kswapd /* = 0 */;
+
 /*
  * The background pageout daemon, started as a kernel thread
  * from the init process. 
@@ -915,6 +917,9 @@
 	 */
 	for (;;) {
 		static long recalc = 0;
+
+		while (disable_kswapd)
+			interruptible_sleep_on_timeout(&kswapd_wait, HZ/10);
 
 		/* If needed, try to free some memory. */
 		if (inactive_shortage() || free_shortage()) 



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: 2.4.8preX VM problems
  2001-08-01 18:39               ` Daniel Phillips
@ 2001-08-11 12:06                 ` Pavel Machek
  2001-08-16 21:57                   ` Daniel Phillips
  0 siblings, 1 reply; 20+ messages in thread
From: Pavel Machek @ 2001-08-11 12:06 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Mike Black, tridge, marcelo, linux-kernel, riel, Andrew Morton

Hi!

> > I have come to the opinion that kswapd needs to be a little smarter
> > -- if it doesn't find anything to swap shouldn't it go to sleep a
> > little longer before trying again?  That way it could gracefully
> > degrade itself when it's not making any progress.
> >
> > In my testing (on a dual 1Ghz/2G machine) the machine "locks up" for
> > long periods of time while kswapd runs around trying to do it's
> > thing. If I could disable kswapd I would just to test this.
> 
> Your wish is my command.  This patch provides a crude-but-effective 
> method of disabling kswapd, using:
> 
>   echo 1 >/proc/sys/kernel/disable_kswapd
> 
> I tested this with dbench and found it runs about half as fast, but 
> runs.  This is reassuring because kswapd is supposed to be doing 
> something useful.

Why not just killall -STOP kswapd?

What is expected state of system without kswapd, BTW? Without kflushd, 
I give up guaranteed time to get data safely to disk [and its usefull
for spindown]. What happens without kswapd?
								Pavel
-- 
Philips Velo 1: 1"x4"x8", 300gram, 60, 12MB, 40bogomips, linux, mutt,
details at http://atrey.karlin.mff.cuni.cz/~pavel/velo/index.html.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: 2.4.8preX VM problems
  2001-08-11 12:06                 ` Pavel Machek
@ 2001-08-16 21:57                   ` Daniel Phillips
  0 siblings, 0 replies; 20+ messages in thread
From: Daniel Phillips @ 2001-08-16 21:57 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Mike Black, tridge, marcelo, linux-kernel, riel, Andrew Morton

On August 11, 2001 02:06 pm, Pavel Machek wrote:
> > > I have come to the opinion that kswapd needs to be a little smarter
> > > -- if it doesn't find anything to swap shouldn't it go to sleep a
> > > little longer before trying again?  That way it could gracefully
> > > degrade itself when it's not making any progress.
> > >
> > > In my testing (on a dual 1Ghz/2G machine) the machine "locks up" for
> > > long periods of time while kswapd runs around trying to do it's
> > > thing. If I could disable kswapd I would just to test this.
> > 
> > Your wish is my command.  This patch provides a crude-but-effective 
> > method of disabling kswapd, using:
> > 
> >   echo 1 >/proc/sys/kernel/disable_kswapd
> > 
> > I tested this with dbench and found it runs about half as fast, but 
> > runs.  This is reassuring because kswapd is supposed to be doing 
> > something useful.
> 
> Why not just killall -STOP kswapd?

Because I didn't think of it and I wanted some code for myself to do 
real-time experimental tuning of the VM behaviour.

> What is expected state of system without kswapd, BTW? Without kflushd, 
> I give up guaranteed time to get data safely to disk [and its usefull
> for spindown]. What happens without kswapd?

Without kswapd you lose much of the system's 'clean-ahead' performance and it 
ends up reacting to try_to_free_pages calls iniated through __alloc_pages 
when processes run out of free pages.  This means lots more synchronous 
waiting on page_launder and friends, but the system still runs.  It's a nice 
way to check how well the system's attempt to anticipate demand is really 
working.

--
Daniel

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: 2.4.8preX VM problems
  2001-08-01  8:13         ` Marcelo Tosatti
  2001-08-01 10:54           ` Andrew Tridgell
@ 2001-08-04  6:50           ` Anton Blanchard
  2001-08-04  5:55             ` Marcelo Tosatti
  1 sibling, 1 reply; 20+ messages in thread
From: Anton Blanchard @ 2001-08-04  6:50 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Andrew Tridgell, lkml, Rik van Riel

 
Hi Marcelo,

> The problem is pretty nasty: if there is no global shortage and only a 
> given zone with shortage, we set the zone free target to freepages.min
> (basically no tasks can make progress with that amount of free memory). 

Paulus and I were seeing the same problem on a ppc with 2.4.8-pre3. We
were doing cat > /dev/null of about 5G of data, when we had close to 3G of
page cache kswapd chewed up all the cpu. Our guess was that there was a
shortage of lowmem pages (everything above 512M is highmem on the ppc32
kernel so there isnt much lowmem).

The patch below allowed us to get close to 4G of page cache before
things slowed down again and kswapd took over.

Anton

> --- linux.orig/mm/page_alloc.c	Mon Jul 30 17:06:49 2001
> +++ linux/mm/page_alloc.c	Wed Aug  1 06:21:35 2001
> @@ -630,8 +630,8 @@
>  		goto ret;
>  
>  	if (zone->inactive_clean_pages + zone->free_pages
> -			< zone->pages_min) {
> -		sum += zone->pages_min;
> +			< zone->pages_high) {
> +		sum += zone->pages_high;
>  		sum -= zone->free_pages;
>  		sum -= zone->inactive_clean_pages;
>  	}

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: 2.4.8preX VM problems
  2001-08-04  6:50           ` Anton Blanchard
@ 2001-08-04  5:55             ` Marcelo Tosatti
  2001-08-04 17:17               ` Anton Blanchard
  0 siblings, 1 reply; 20+ messages in thread
From: Marcelo Tosatti @ 2001-08-04  5:55 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: Andrew Tridgell, lkml, Rik van Riel



On Sat, 4 Aug 2001, Anton Blanchard wrote:

>  
> Hi Marcelo,
> 
> > The problem is pretty nasty: if there is no global shortage and only a 
> > given zone with shortage, we set the zone free target to freepages.min
> > (basically no tasks can make progress with that amount of free memory). 
> 
> Paulus and I were seeing the same problem on a ppc with 2.4.8-pre3. We
> were doing cat > /dev/null of about 5G of data, when we had close to 3G of
> page cache kswapd chewed up all the cpu. Our guess was that there was a
> shortage of lowmem pages (everything above 512M is highmem on the ppc32
> kernel so there isnt much lowmem).
> 
> The patch below allowed us to get close to 4G of page cache before
> things slowed down again and kswapd took over.

How much memory do you have on the box ?


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: 2.4.8preX VM problems
  2001-08-04  5:55             ` Marcelo Tosatti
@ 2001-08-04 17:17               ` Anton Blanchard
  2001-08-06 22:58                 ` Marcelo Tosatti
  0 siblings, 1 reply; 20+ messages in thread
From: Anton Blanchard @ 2001-08-04 17:17 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Andrew Tridgell, lkml, Rik van Riel


> > The patch below allowed us to get close to 4G of page cache before
> > things slowed down again and kswapd took over.
> 
> How much memory do you have on the box ?

It has 15G, so 512M of lowmem and 14.5G of highmem.

Anton

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: 2.4.8preX VM problems
  2001-08-04 17:17               ` Anton Blanchard
@ 2001-08-06 22:58                 ` Marcelo Tosatti
  2001-08-07 17:18                   ` Anton Blanchard
  0 siblings, 1 reply; 20+ messages in thread
From: Marcelo Tosatti @ 2001-08-06 22:58 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: Andrew Tridgell, lkml, Rik van Riel

On Sat, 4 Aug 2001, Anton Blanchard wrote:

> 
> > > The patch below allowed us to get close to 4G of page cache before
> > > things slowed down again and kswapd took over.
> > 
> > How much memory do you have on the box ?
> 
> It has 15G, so 512M of lowmem and 14.5G of highmem.

Can you please use readprofile to find out where kswapd is spending its
time when you reach 4G of pagecache ?

I've never seen kswapd burn CPU time except cases where a lot of memory is
anonymous and there is a need for lots of swap space allocations.
(scan_swap_map() is where kswapd spends "all" of its time in such
workloads)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: 2.4.8preX VM problems
  2001-08-06 22:58                 ` Marcelo Tosatti
@ 2001-08-07 17:18                   ` Anton Blanchard
  2001-08-07 21:02                     ` Kernel 2.4.6 & 2.4.7 networking performance: seeing serious delays in TCP layer depending upon packet length Ron Flory
  0 siblings, 1 reply; 20+ messages in thread
From: Anton Blanchard @ 2001-08-07 17:18 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Andrew Tridgell, lkml, Rik van Riel

 
Hi Marcelo,

> Can you please use readprofile to find out where kswapd is spending its
> time when you reach 4G of pagecache ?
> 
> I've never seen kswapd burn CPU time except cases where a lot of memory is
> anonymous and there is a need for lots of swap space allocations.
> (scan_swap_map() is where kswapd spends "all" of its time in such
> workloads)

I was doing a run with 512M lowmem and 2.5G highmem and found this:

__alloc_pages: 1-order allocation failed.
__alloc_pages: 1-order allocation failed.
__alloc_pages: 1-order allocation failed.

# cat /proc/meminfo 
        total:    used:    free:  shared: buffers:  cached:
Mem:  3077513216 3067564032  9949184        0 13807616 2311172096
Swap: 2098176000        0 2098176000
MemTotal:      3005384 kB
MemFree:          9716 kB
MemShared:           0 kB
Buffers:         13484 kB
Cached:        2257004 kB
SwapCached:          0 kB
Active:         888916 kB
Inact_dirty:   1335552 kB
Inact_clean:     46020 kB
Inact_target:      316 kB
HighTotal:     2621440 kB
HighFree:         7528 kB
LowTotal:       383944 kB
LowFree:          2188 kB
SwapTotal:     2049000 kB
SwapFree:      2049000 kB

# readprofile | sort -nr | less
11967239 total                                      7.3285
7417874 idled                                    45230.9390
363813 do_page_launder                          119.3612
236764 ppc_irq_dispatch_handler                 332.5337

I can split out the do_page_launder usage if you want. I had a quick look
at the raw profile information and it appears that we are just looping a
lot.

Paulus and I moved the ppc32 kernel to load at 2G so we have 1.75G of
lowmem. This has stopped the kswapd problem, but I thought the above
information might be useful to you anyway.

Anton

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Kernel 2.4.6 & 2.4.7 networking performance: seeing serious delays in  TCP layer depending upon packet length
  2001-08-07 17:18                   ` Anton Blanchard
@ 2001-08-07 21:02                     ` Ron Flory
  0 siblings, 0 replies; 20+ messages in thread
From: Ron Flory @ 2001-08-07 21:02 UTC (permalink / raw)
  Cc: lkml, ron.flory

This email contains 99-columns of information:

[1.] One line summary of the problem:

 Depending upon length of TCP socket write requests, I'm seeing 
extremely long TCP delays, limiting throughput to only 25 datagrams 
per second.  Data blocks above a 'magic' length have a much higher 
throughput than shorter blocks, which is counter-intuitive.

[2.] Full description of the problem/report:

 Compared to a normal 35 microseconds, I am seeing TCP stack delays 
on the order of 35 millisec when socket write lengths are BELOW a 
certain value.  

 Depending upon whether I use an actual ethernet or loopback device, 
this 'slow' packet length threshold is at 1447 bytes (Ethernet), or 
16383 bytes (loopback).  For some reason, the sending stack is delaying,
and/or or waiting for an ACK between a 'short' followed by a 'longer' 
write TCP datagram.

 For all packets >= these magic lengths (1448/16384 respectively), the 
TCP stack does not wait for this ACK, and immediately sends the next 
datagram.  For packets differing by only one byte in length, I see the 
packet rate jump from only 25 per second to several thousand per second.

 In the case of the loopback device, simply changing the packet length 
from 16383 to 16384 bytes results in a max transfer rate of 25 vs. 4250
packets per second.

 I think Alan will be interested in looking into this...

--------------- more (too much) background ----------------

 To test various IP/ATM/FrameRelay network systems and interfaces, I 
wrote a small client/server app to pass as much network traffic as 
possible across a simulated customer network in our lab.

 The sequence is relatively simple:

 - create server.
 - create client. (client contacts server, transfers data).

 *Server:

  wait for client to request our max packet length (8 bytes)
  respond with our max packet length (8 bytes).

  while (1)
  {
    1. wait for client request packet payload (8 bytes).
    2. Respond with ACK message indicating length (8 bytes)
    3. send all requested data to client (1..128Kbytes).
  }

 *Client:

  contact server, request max packet length (8 bytes)
  read response from server (8 bytes).

  while (1)
  {
    1. request payload packet from server (8 bytes).
    2. accept ACK packet (with length) from server (8 bytes).
    3. read all requested data from server (1..128Kbytes).
  }

 The test application can be provided if you don't have 
alternative (independant) test tools available.

[3.] Keywords (i.e., modules, networking, kernel):

 networking, sockets, performance, benchmark, kernel, TCP, IP

[4.] Kernel version (from /proc/version):

 Both Linux x86 2.4.6 and 2.4.7, non-patched sources downloaded from 
kernel.org, compiled locally.

[5.] Output of Oops.. message with symbolic information resolved
     (see Kernel Mailing List FAQ, Section 1.5):

 No oops, just delay...

[6.] A small shell script or example program which triggers the
     problem (if possible)

 application tar-archive available upon request.  not too small.

[7.] Environment

 Local network, on a network switch, both Linux machines running 
RedHat 7.1 (x86), One machine is SCSI-based, the other is all-IDE.  
Disk subsystem not relevant.

[7.1.] Software (add the output of the ver_linux script here)

 [script is hosed]

  gcc 2.96 ('updated' versions from RH).
  binutils 2.10.91.0.2
  Linux C Lib 2.2.so ('updated' versions from RH).
  libc.so.6 => /lib/i686/libc.so.6 (0x40021000)
  /lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x40000000)

[7.2.] Processor information (from /proc/cpuinfo):

 problem seen on several SMP/uniprocessor systems:

[7.3.] Module information (from /proc/modules):

 One machine uses Tulip network modules, other uses 8139too. 
This problem is present in the loopback device as well, so physical drivers 
are probably not an issue.  SMC-Ultra is eth1 (not used in this case).

Module                  Size  Used by
smc-ultra               4992   1 (autoclean)
8390                    6928   0 (autoclean) [smc-ultra]
tulip                  39520   1 (autoclean)
st                     26640   0 (unused)

[7.4.] SCSI information (from /proc/scsi/scsi):

 N/A

[7.5.] Other information that might be relevant to the problem
       (please look in /proc and include all information that you
       think to be relevant):

[X.] Other notes, patches, fixes, workarounds:

 Protocol is TCP (SOCK_STREAM).

NOTE: machine 172.22.112.101 (shown here as .101) is the server, .100 is the client.

 Ethereal log of short/slow/bad packets (1447 bytes): (time is inter-packet)

 # Time      Src  Dst Port         Info
 1 0.000000 .100 .101 41359 > 4180 [SYN] Seq=4186558034 Ack=0 Win=5840 Len=0
 2 0.000083 .101 .100 4180 > 41359 [SYN, ACK] Seq=124141118 Ack=4186558035 Win=5792 Len=0
 3 0.000117 .100 .101 41359 > 4180 [ACK] Seq=4186558035 Ack=124141119 Win=5840 Len=0
 4 0.000061 .100 .101 41359 > 4180 [PSH, ACK] Seq=4186558035 Ack=124141119 Win=5840 Len=8
 5 0.000061 .101 .100 4180 > 41359 [ACK] Seq=124141119 Ack=4186558043 Win=5792 Len=0
 6 0.002295 .101 .100 4180 > 41359 [PSH, ACK] Seq=124141119 Ack=4186558043 Win=5792 Len=8
 7 0.000098 .100 .101 41359 > 4180 [ACK] Seq=4186558043 Ack=124141127 Win=5840 Len=0
 8 0.000111 .100 .101 41359 > 4180 [PSH, ACK] Seq=4186558043 Ack=124141127 Win=5840 Len=8
 9 0.000062 .101 .100 4180 > 41359 [PSH, ACK] Seq=124141127 Ack=4186558051 Win=5792 Len=8
10 0.036399 .100 .101 41359 > 4180 [ACK] Seq=4186558051 Ack=124141135 Win=5840 Len=0  <- DELAY
11 0.000074 .101 .100 4180 > 41359 [PSH, ACK] Seq=124141135 Ack=4186558051 Win=5792 Len=1447
12 0.000362 .100 .101 41359 > 4180 [ACK] Seq=4186558051 Ack=124142582 Win=8682 Len=0
13 0.000279 .100 .101 41359 > 4180 [FIN, ACK] Seq=4186558051 Ack=124142582 Win=8682 Len=0
14 0.000073 .101 .100 4180 > 41359 [FIN, ACK] Seq=124142582 Ack=4186558052 Win=5792 Len=0
15 0.000099 .100 .101 41359 > 4180 [ACK] Seq=4186558052 Ack=124142583 Win=8682 Len=0

 Ethereal log of long/fast/good packets (1448 bytes): (time is inter-packet)

 # Time      Src  Dst Port         Info
 1 0.000000 .100 .101 41363 > 4180 [SYN] Seq=4268751129 Ack=0 Win=5840 Len=0
 2 0.000077 .101 .100 4180 > 41363 [SYN, ACK] Seq=189815001 Ack=4268751130 Win=5792 Len=0
 3 0.000116 .100 .101 41363 > 4180 [ACK] Seq=4268751130 Ack=189815002 Win=5840 Len=0
 4 0.000060 .100 .101 41363 > 4180 [PSH, ACK] Seq=4268751130 Ack=189815002 Win=5840 Len=8
 5 0.000038 .101 .100 4180 > 41363 [ACK] Seq=189815002 Ack=4268751138 Win=5792 Len=0
 6 0.002467 .101 .100 4180 > 41363 [PSH, ACK] Seq=189815002 Ack=4268751138 Win=5792 Len=8
 7 0.000109 .100 .101 41363 > 4180 [ACK] Seq=4268751138 Ack=189815010 Win=5840 Len=0
 8 0.000122 .100 .101 41363 > 4180 [PSH, ACK] Seq=4268751138 Ack=189815010 Win=5840 Len=8
 9 0.000061 .101 .100 4180 > 41363 [PSH, ACK] Seq=189815010 Ack=4268751146 Win=5792 Len=8
10 0.000034 .101 .100 4180 > 41363 [ACK] Seq=189815018 Ack=4268751146 Win=5792 Len=1448 
11 0.000351 .100 .101 41363 > 4180 [ACK] Seq=4268751146 Ack=189816466 Win=8688 Len=0
12 0.000520 .100 .101 41363 > 4180 [FIN, ACK] Seq=4268751146 Ack=189816466 Win=8688 Len=0
13 0.000065 .101 .100 4180 > 41363 [FIN, ACK] Seq=189816466 Ack=4268751147 Win=5792 Len=0
14 0.000095 .100 .101 41363 > 4180 [ACK] Seq=4268751147 Ack=189816467 Win=8688 Len=0

 Line #9 in both above represent the server (.101) sending an 8-byte descriptor block 
back to the client (.100), informing him how much data will be following in the next 
'write' (payload block).  The payload 'write' immediately follows (no other delays or 
processing present).

 Of particular interest is line #10 of each above.  

 For some reason, in the case of the 'short' packet, the server machine waits for a
TCP ACK from the client machine (delaying 36 millisec), which does not happen if the packet 
is just 1-byte longer, in which case the server immediately follows the 8-byte block with the 
1448-byte payload.

 I realize there is probably some relationship with the 1500 MTU of ethernet, but I do not 
see the relation with the 16k slow/fast boundary seen with the loopback device, or when the 
server/client are on the same machine.  In this case, all packets < 16384 bytes wait approx
36 milli-sec for the client to ACK before sending the subsequent payload data block.

Thanks, - i hope this isn't a well known issue that 'just has to be done this way'...

ron

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2001-08-16 21:51 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2001-08-01  3:05 2.4.8preX VM problems Andrew Tridgell
2001-08-01  2:26 ` Marcelo Tosatti
2001-08-01  4:37   ` Andrew Tridgell
2001-08-01  3:32     ` Marcelo Tosatti
2001-08-01  5:43       ` Andrew Tridgell
2001-08-01  6:09   ` Andrew Tridgell
2001-08-01  6:10     ` Marcelo Tosatti
2001-08-01  8:13       ` Andrew Tridgell
2001-08-01  8:13         ` Marcelo Tosatti
2001-08-01 10:54           ` Andrew Tridgell
2001-08-01 11:51             ` Mike Black
2001-08-01 18:39               ` Daniel Phillips
2001-08-11 12:06                 ` Pavel Machek
2001-08-16 21:57                   ` Daniel Phillips
2001-08-04  6:50           ` Anton Blanchard
2001-08-04  5:55             ` Marcelo Tosatti
2001-08-04 17:17               ` Anton Blanchard
2001-08-06 22:58                 ` Marcelo Tosatti
2001-08-07 17:18                   ` Anton Blanchard
2001-08-07 21:02                     ` Kernel 2.4.6 & 2.4.7 networking performance: seeing serious delays in TCP layer depending upon packet length Ron Flory

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox