* 2.4.8preX VM problems @ 2001-08-01 3:05 Andrew Tridgell 2001-08-01 2:26 ` Marcelo Tosatti 0 siblings, 1 reply; 20+ messages in thread From: Andrew Tridgell @ 2001-08-01 3:05 UTC (permalink / raw) To: linux-kernel I've been testing the 2.4.8preX kernels on machines with fairly large amounts of memory (greater than 1G) and have found them to have disasterously bad performance through the buffer cache. If the machine has 900M or less then it performs well, but above that the performance drops through the floor (by about a factor of 600). To see the effect use this: ftp://ftp.samba.org/pub/unpacked/junkcode/readfiles.c and this: ftp://ftp.samba.org/pub/unpacked/junkcode/trd/ then do this: insmod dummy_disk.o dummy_size=80000000 mknod /dev/ddisk b 241 0 readfile /dev/ddisk "dummy_disk" is a dummy disk device (in this case iits 80G). All IOs to the device succeed, but don't actually do anything. This makes it easy to test very large disks on a small machine, and also eliminates interactions with particular block devices. It also allows you to unload the disk, which means you can easily start again with a clear buffer cache. You can see exactly the same effect with a real device if you would prefer not to load the dummy disk driver. You will see that the speed is good for the first 800M then drops off dramatically after that. Meanwhile, kswapd and kreclaimd go mad chewing lots of cpu. If you boot the machine with "mem=900M" then the problem goes away, with the performance staying high. If you boot with 950M or above then the throughput plummets once you have read more than 800M. Here is a sample run with 2.4.8pre3: [root@fraud trd]# ~/readfiles /dev/ddisk 211 MB 211.754 MB/sec 404 MB 192.866 MB/sec 579 MB 175.188 MB/sec 742 MB 163.017 MB/sec 794 MB 49.5844 MB/sec 795 MB 0.971527 MB/sec 796 MB 0.94948 MB/sec 797 MB 1.35205 MB/sec 799 MB 1.30931 MB/sec 800 MB 1.16104 MB/sec 801 MB 1.30607 MB/sec 803 MB 1.67914 MB/sec 804 MB 1.1175 MB/sec 805 MB 0.645805 MB/sec 806 MB 0.749738 MB/sec 806 MB 0.555384 MB/sec 807 MB 0.330456 MB/sec 807 MB 0.320096 MB/sec 807 MB 0.320502 MB/sec 808 MB 0.33026 MB/sec and on a real disk: [root@fraud trd]# ~/readfiles /dev/rd/c0d1p2 37 MB 37.5002 MB/sec 76 MB 38.8103 MB/sec 115 MB 38.8753 MB/sec 153 MB 37.6465 MB/sec 191 MB 38.223 MB/sec 229 MB 38.276 MB/sec 267 MB 38.3151 MB/sec 305 MB 37.3374 MB/sec 343 MB 37.6915 MB/sec 380 MB 37.7198 MB/sec 418 MB 37.5222 MB/sec 455 MB 37.1729 MB/sec 492 MB 37.2008 MB/sec 529 MB 36.2474 MB/sec 565 MB 36.7173 MB/sec 602 MB 36.6197 MB/sec 639 MB 36.5568 MB/sec 675 MB 36.4935 MB/sec 711 MB 36.1575 MB/sec 747 MB 36.0858 MB/sec 784 MB 36.1972 MB/sec 799 MB 15.1778 MB/sec 803 MB 4.11846 MB/sec 804 MB 1.33881 MB/sec 805 MB 0.927079 MB/sec 806 MB 0.790508 MB/sec 807 MB 0.679455 MB/sec 807 MB 0.316194 MB/sec 808 MB 0.305104 MB/sec 808 MB 0.317431 MB/sec Interestingly, the 800M barrier is the same no matter how much memory is in the machine (ie. its the same barrier for a machine with 2G as 1G). So, anyone have any ideas? I was prompted to do these tests when I saw kswapd and kreclaimd going mad in large SPECsfs runs on a machine with 2G of memory. I suspect that what is happening is that the meta data throughput plummets during the runs when the buffer cache reaches 800M in size. SPECsfs is very meta-data intensive. Typical runs will create millions of files. Cheers, Tridge ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 2.4.8preX VM problems 2001-08-01 3:05 2.4.8preX VM problems Andrew Tridgell @ 2001-08-01 2:26 ` Marcelo Tosatti 2001-08-01 4:37 ` Andrew Tridgell 2001-08-01 6:09 ` Andrew Tridgell 0 siblings, 2 replies; 20+ messages in thread From: Marcelo Tosatti @ 2001-08-01 2:26 UTC (permalink / raw) To: Andrew Tridgell; +Cc: lkml Andrew, Can you reproduce the problem with 2.4.7 ? On Tue, 31 Jul 2001, Andrew Tridgell wrote: > I've been testing the 2.4.8preX kernels on machines with fairly large > amounts of memory (greater than 1G) and have found them to have > disasterously bad performance through the buffer cache. If the machine > has 900M or less then it performs well, but above that the performance > drops through the floor (by about a factor of 600). > > To see the effect use this: > > ftp://ftp.samba.org/pub/unpacked/junkcode/readfiles.c > > and this: > > ftp://ftp.samba.org/pub/unpacked/junkcode/trd/ > > then do this: > > insmod dummy_disk.o dummy_size=80000000 > mknod /dev/ddisk b 241 0 > readfile /dev/ddisk > > "dummy_disk" is a dummy disk device (in this case iits 80G). All IOs > to the device succeed, but don't actually do anything. This makes it > easy to test very large disks on a small machine, and also eliminates > interactions with particular block devices. It also allows you to > unload the disk, which means you can easily start again with a clear > buffer cache. You can see exactly the same effect with a real device > if you would prefer not to load the dummy disk driver. > > You will see that the speed is good for the first 800M then drops off > dramatically after that. Meanwhile, kswapd and kreclaimd go mad > chewing lots of cpu. > > If you boot the machine with "mem=900M" then the problem goes away, > with the performance staying high. If you boot with 950M or above > then the throughput plummets once you have read more than 800M. > > Here is a sample run with 2.4.8pre3: > > [root@fraud trd]# ~/readfiles /dev/ddisk > 211 MB 211.754 MB/sec > 404 MB 192.866 MB/sec > 579 MB 175.188 MB/sec > 742 MB 163.017 MB/sec > 794 MB 49.5844 MB/sec > 795 MB 0.971527 MB/sec > 796 MB 0.94948 MB/sec > 797 MB 1.35205 MB/sec > 799 MB 1.30931 MB/sec > 800 MB 1.16104 MB/sec > 801 MB 1.30607 MB/sec > 803 MB 1.67914 MB/sec > 804 MB 1.1175 MB/sec > 805 MB 0.645805 MB/sec > 806 MB 0.749738 MB/sec > 806 MB 0.555384 MB/sec > 807 MB 0.330456 MB/sec > 807 MB 0.320096 MB/sec > 807 MB 0.320502 MB/sec > 808 MB 0.33026 MB/sec > > and on a real disk: > > [root@fraud trd]# ~/readfiles /dev/rd/c0d1p2 > 37 MB 37.5002 MB/sec > 76 MB 38.8103 MB/sec > 115 MB 38.8753 MB/sec > 153 MB 37.6465 MB/sec > 191 MB 38.223 MB/sec > 229 MB 38.276 MB/sec > 267 MB 38.3151 MB/sec > 305 MB 37.3374 MB/sec > 343 MB 37.6915 MB/sec > 380 MB 37.7198 MB/sec > 418 MB 37.5222 MB/sec > 455 MB 37.1729 MB/sec > 492 MB 37.2008 MB/sec > 529 MB 36.2474 MB/sec > 565 MB 36.7173 MB/sec > 602 MB 36.6197 MB/sec > 639 MB 36.5568 MB/sec > 675 MB 36.4935 MB/sec > 711 MB 36.1575 MB/sec > 747 MB 36.0858 MB/sec > 784 MB 36.1972 MB/sec > 799 MB 15.1778 MB/sec > 803 MB 4.11846 MB/sec > 804 MB 1.33881 MB/sec > 805 MB 0.927079 MB/sec > 806 MB 0.790508 MB/sec > 807 MB 0.679455 MB/sec > 807 MB 0.316194 MB/sec > 808 MB 0.305104 MB/sec > 808 MB 0.317431 MB/sec > > Interestingly, the 800M barrier is the same no matter how much memory > is in the machine (ie. its the same barrier for a machine with 2G as > 1G). > > So, anyone have any ideas? > > I was prompted to do these tests when I saw kswapd and kreclaimd going > mad in large SPECsfs runs on a machine with 2G of memory. I suspect > that what is happening is that the meta data throughput plummets > during the runs when the buffer cache reaches 800M in size. SPECsfs is > very meta-data intensive. Typical runs will create millions of files. > > Cheers, Tridge > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 2.4.8preX VM problems 2001-08-01 2:26 ` Marcelo Tosatti @ 2001-08-01 4:37 ` Andrew Tridgell 2001-08-01 3:32 ` Marcelo Tosatti 2001-08-01 6:09 ` Andrew Tridgell 1 sibling, 1 reply; 20+ messages in thread From: Andrew Tridgell @ 2001-08-01 4:37 UTC (permalink / raw) To: marcelo; +Cc: linux-kernel Marcelo wrote: > Can you reproduce the problem with 2.4.7 ? no, it started with 2.4.8pre1. I am currently narrowing it down by reverting pieces of that patch and I have successfully narrowed it down to the changes in mm/vmscan.c. I have a 2.4.8pre3 kernel with a hacked version of the 2.4.7 vmscan.c that doesn't show the problem. I'll try to narrow it down a bit more this afternoon. Cheers, Tridge ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 2.4.8preX VM problems 2001-08-01 4:37 ` Andrew Tridgell @ 2001-08-01 3:32 ` Marcelo Tosatti 2001-08-01 5:43 ` Andrew Tridgell 0 siblings, 1 reply; 20+ messages in thread From: Marcelo Tosatti @ 2001-08-01 3:32 UTC (permalink / raw) To: Andrew Tridgell; +Cc: lkml On Tue, 31 Jul 2001, Andrew Tridgell wrote: > Marcelo wrote: > > Can you reproduce the problem with 2.4.7 ? > > no, it started with 2.4.8pre1. I am currently narrowing it down by > reverting pieces of that patch and I have successfully narrowed it > down to the changes in mm/vmscan.c. I have a 2.4.8pre3 kernel with a > hacked version of the 2.4.7 vmscan.c that doesn't show the > problem. Could you please apply http://bazar.conectiva.com.br/~marcelo/patches/v2.4/2.4.7pre9/zoned.patch on top of 2.4.7 and try to reproduce the problem ? > I'll try to narrow it down a bit more this afternoon. There are two possibilities: 1) the zoned approach code introduced in 2.4.8pre1 (thats why I asked you to apply the patch alone on top of 2.4.7). 2) The used-once code also introduced in 2.4.8pre1 It seems the problem only happens when the highmem zone is active, right ? ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 2.4.8preX VM problems 2001-08-01 3:32 ` Marcelo Tosatti @ 2001-08-01 5:43 ` Andrew Tridgell 0 siblings, 0 replies; 20+ messages in thread From: Andrew Tridgell @ 2001-08-01 5:43 UTC (permalink / raw) To: marcelo; +Cc: linux-kernel Marcelo wrote: > Could you please apply > http://bazar.conectiva.com.br/~marcelo/patches/v2.4/2.4.7pre9/zoned.patch > on top of 2.4.7 and try to reproduce the problem ? yep, that's the culprit. Running an original 2.4.7 with the zoned patch applied showed the same slowdowns as 2.4.8preX. Looks like the zoned patch has a problem when the buffer cache grows beyond 800M. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 2.4.8preX VM problems 2001-08-01 2:26 ` Marcelo Tosatti 2001-08-01 4:37 ` Andrew Tridgell @ 2001-08-01 6:09 ` Andrew Tridgell 2001-08-01 6:10 ` Marcelo Tosatti 1 sibling, 1 reply; 20+ messages in thread From: Andrew Tridgell @ 2001-08-01 6:09 UTC (permalink / raw) To: marcelo; +Cc: linux-kernel Marcelo, I've narrowed it down some more. If I apply the whole zone patch except for this bit: + /* + * If we are doing zone-specific laundering, + * avoid touching pages from zones which do + * not have a free shortage. + */ + if (zone && !zone_free_shortage(page->zone)) { + list_del(page_lru); + list_add(page_lru, &inactive_dirty_list); + continue; + } + then the behaviour is much better: [root@fraud trd]# ~/readfiles /dev/ddisk 202 MB 202.125 MB/sec 394 MB 192.525 MB/sec 580 MB 185.487 MB/sec 755 MB 175.319 MB/sec 804 MB 41.3387 MB/sec 986 MB 182.5 MB/sec 1115 MB 114.862 MB/sec 1297 MB 182.276 MB/sec 1426 MB 128.983 MB/sec 1603 MB 164.939 MB/sec 1686 MB 82.9556 MB/sec 1866 MB 179.861 MB/sec 1930 MB 63.959 MB/sec Even given that, the performance isn't exactly stunning. The "dummy_disk" driver doesn't even do a memset or memcpy so it should really run at the full memory bandwidth of the machine. We are only getting a fraction of that (it is a dual PIII/800 server). If I get time I'll try some profiling. I also notice that the system peaks at a maximum of just under 750M in the buffer cache. The system has 1.2G of completely unused memory which I really expected to be consumed by something that is just reading from a never-ending block device. For example: CPU0 states: 0.0% user, 67.1% system, 0.0% nice, 32.3% idle CPU1 states: 0.0% user, 65.3% system, 0.0% nice, 34.1% idle Mem: 2059660K av, 842712K used, 1216948K free, 0K shrd, 740816K buff Swap: 1052216K av, 0K used, 1052216K free 9496K cached PID USER PRI NI SIZE RSS SHARE LC STAT %CPU %MEM TIME COMMAND 615 root 14 0 452 452 328 1 R 99.9 0.0 3:52 readfiles 5 root 9 0 0 0 0 1 SW 31.3 0.0 1:03 kswapd 6 root 9 0 0 0 0 0 SW 0.5 0.0 0:04 kreclaimd I know this is a *long* way from a real world benchmark, but I think it is perhaps indicative of our buffer cache system getting a bit too complex again :) Cheers, Tridge ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 2.4.8preX VM problems 2001-08-01 6:09 ` Andrew Tridgell @ 2001-08-01 6:10 ` Marcelo Tosatti 2001-08-01 8:13 ` Andrew Tridgell 0 siblings, 1 reply; 20+ messages in thread From: Marcelo Tosatti @ 2001-08-01 6:10 UTC (permalink / raw) To: Andrew Tridgell; +Cc: linux-kernel On Tue, 31 Jul 2001, Andrew Tridgell wrote: > Marcelo, > > I've narrowed it down some more. If I apply the whole zone patch > except for this bit: > > + /* > + * If we are doing zone-specific laundering, > + * avoid touching pages from zones which do > + * not have a free shortage. > + */ > + if (zone && !zone_free_shortage(page->zone)) { > + list_del(page_lru); > + list_add(page_lru, &inactive_dirty_list); > + continue; > + } > + > > then the behaviour is much better: > > [root@fraud trd]# ~/readfiles /dev/ddisk > 202 MB 202.125 MB/sec > 394 MB 192.525 MB/sec > 580 MB 185.487 MB/sec > 755 MB 175.319 MB/sec > 804 MB 41.3387 MB/sec > 986 MB 182.5 MB/sec > 1115 MB 114.862 MB/sec > 1297 MB 182.276 MB/sec > 1426 MB 128.983 MB/sec > 1603 MB 164.939 MB/sec > 1686 MB 82.9556 MB/sec > 1866 MB 179.861 MB/sec > 1930 MB 63.959 MB/sec > > Even given that, the performance isn't exactly stunning. The > "dummy_disk" driver doesn't even do a memset or memcpy so it should > really run at the full memory bandwidth of the machine. We are only > getting a fraction of that (it is a dual PIII/800 server). If I get > time I'll try some profiling. > > I also notice that the system peaks at a maximum of just under 750M in > the buffer cache. The system has 1.2G of completely unused memory > which I really expected to be consumed by something that is just > reading from a never-ending block device. Thats expected: we cannot allocate buffercache pages on highmem. > For example: > > CPU0 states: 0.0% user, 67.1% system, 0.0% nice, 32.3% idle > CPU1 states: 0.0% user, 65.3% system, 0.0% nice, 34.1% idle > Mem: 2059660K av, 842712K used, 1216948K free, 0K shrd, 740816K buff > Swap: 1052216K av, 0K used, 1052216K free 9496K cached > > PID USER PRI NI SIZE RSS SHARE LC STAT %CPU %MEM TIME COMMAND > 615 root 14 0 452 452 328 1 R 99.9 0.0 3:52 readfiles > 5 root 9 0 0 0 0 1 SW 31.3 0.0 1:03 kswapd > 6 root 9 0 0 0 0 0 SW 0.5 0.0 0:04 kreclaimd > > I know this is a *long* way from a real world benchmark, but I think > it is perhaps indicative of our buffer cache system getting a bit too > complex again :) do_page_launder() stops the laundering loop (which frees pages), in case it freed a buffercache page, as soon as there is no more global free shortage (in the global scan case), or as soon as there is no more free shortage for the specific zone we're scanning. Thats wrong: we should keep laundering pages if there is _any_ zone under shortage. Could you please try the patch below ? (against 2.4.8pre3) --- linux.orig/mm/vmscan.c Wed Aug 1 04:26:36 2001 +++ linux/mm/vmscan.c Wed Aug 1 04:33:22 2001 @@ -593,13 +593,9 @@ * If we're freeing buffer cache pages, stop when * we've got enough free memory. */ - if (freed_page) { - if (zone) { - if (!zone_free_shortage(zone)) - break; - } else if (!free_shortage()) - break; - } + if (freed_page && !total_free_shortage()) + break; + continue; } else if (page->mapping && !PageDirty(page)) { /* ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 2.4.8preX VM problems 2001-08-01 6:10 ` Marcelo Tosatti @ 2001-08-01 8:13 ` Andrew Tridgell 2001-08-01 8:13 ` Marcelo Tosatti 0 siblings, 1 reply; 20+ messages in thread From: Andrew Tridgell @ 2001-08-01 8:13 UTC (permalink / raw) To: marcelo; +Cc: linux-kernel Marcelo, I'm afraid that didn't help. I get: [root@skurk /root]# ./readfiles /dev/ddisk 362 MB 181.145 MB/sec 695 MB 166.455 MB/sec 811 MB 57.6077 MB/sec 812 MB 0.439532 MB/sec 813 MB 0.463901 MB/sec 814 MB 0.416093 MB/sec 815 MB 0.409958 MB/sec 816 MB 0.410413 MB/sec > Could you please try the patch below ? (against 2.4.8pre3) > > --- linux.orig/mm/vmscan.c Wed Aug 1 04:26:36 2001 > +++ linux/mm/vmscan.c Wed Aug 1 04:33:22 2001 > @@ -593,13 +593,9 @@ > * If we're freeing buffer cache pages, stop when > * we've got enough free memory. > */ > - if (freed_page) { > - if (zone) { > - if (!zone_free_shortage(zone)) > - break; > - } else if (!free_shortage()) > - break; > - } > + if (freed_page && !total_free_shortage()) > + break; > + > continue; > } else if (page->mapping && !PageDirty(page)) { > /* > > ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 2.4.8preX VM problems 2001-08-01 8:13 ` Andrew Tridgell @ 2001-08-01 8:13 ` Marcelo Tosatti 2001-08-01 10:54 ` Andrew Tridgell 2001-08-04 6:50 ` Anton Blanchard 0 siblings, 2 replies; 20+ messages in thread From: Marcelo Tosatti @ 2001-08-01 8:13 UTC (permalink / raw) To: Andrew Tridgell; +Cc: lkml, Rik van Riel Andrew, The problem is pretty nasty: if there is no global shortage and only a given zone with shortage, we set the zone free target to freepages.min (basically no tasks can make progress with that amount of free memory). The following patch sets the zone free target to freepages.high. Can you test it ? (I tried here and got the expected results) Maybe pages_high is _too_ high to set the free target. We may want to use pages_low for page freeing and pages_high for page writeout, or something like that. (this way we keep the necessary amount of pages to reach pages_high being written out) I'll keep looking into this tomorrow. Going home now. --- linux.orig/mm/page_alloc.c Mon Jul 30 17:06:49 2001 +++ linux/mm/page_alloc.c Wed Aug 1 06:21:35 2001 @@ -630,8 +630,8 @@ goto ret; if (zone->inactive_clean_pages + zone->free_pages - < zone->pages_min) { - sum += zone->pages_min; + < zone->pages_high) { + sum += zone->pages_high; sum -= zone->free_pages; sum -= zone->inactive_clean_pages; } On Wed, 1 Aug 2001, Andrew Tridgell wrote: > Marcelo, > > I'm afraid that didn't help. I get: > > [root@skurk /root]# ./readfiles /dev/ddisk > 362 MB 181.145 MB/sec > 695 MB 166.455 MB/sec > 811 MB 57.6077 MB/sec > 812 MB 0.439532 MB/sec > 813 MB 0.463901 MB/sec > 814 MB 0.416093 MB/sec > 815 MB 0.409958 MB/sec > 816 MB 0.410413 MB/sec > > > > > > Could you please try the patch below ? (against 2.4.8pre3) > > > > --- linux.orig/mm/vmscan.c Wed Aug 1 04:26:36 2001 > > +++ linux/mm/vmscan.c Wed Aug 1 04:33:22 2001 > > @@ -593,13 +593,9 @@ > > * If we're freeing buffer cache pages, stop when > > * we've got enough free memory. > > */ > > - if (freed_page) { > > - if (zone) { > > - if (!zone_free_shortage(zone)) > > - break; > > - } else if (!free_shortage()) > > - break; > > - } > > + if (freed_page && !total_free_shortage()) > > + break; > > + > > continue; > > } else if (page->mapping && !PageDirty(page)) { > > /* > > > > > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 2.4.8preX VM problems 2001-08-01 8:13 ` Marcelo Tosatti @ 2001-08-01 10:54 ` Andrew Tridgell 2001-08-01 11:51 ` Mike Black 2001-08-04 6:50 ` Anton Blanchard 1 sibling, 1 reply; 20+ messages in thread From: Andrew Tridgell @ 2001-08-01 10:54 UTC (permalink / raw) To: marcelo; +Cc: linux-kernel, riel Marcelo, > The following patch sets the zone free target to freepages.high. Can you > test it ? (I tried here and got the expected results) Running just that patch against 2.4.8pre3 gives: [root@fraud /root]# ~/readfiles /dev/ddisk 198 MB 198.084 MB/sec 386 MB 188.634 MB/sec 570 MB 183.827 MB/sec 743 MB 172.5 MB/sec 810 MB 67.0501 MB/sec 862 MB 52.1381 MB/sec 901 MB 37.9501 MB/sec 957 MB 55.8253 MB/sec 998 MB 41.1541 MB/sec 1046 MB 48.1661 MB/sec 1088 MB 40.3898 MB/sec 1140 MB 50.8782 MB/sec 1183 MB 42.5749 MB/sec 1229 MB 46.1378 MB/sec 1275 MB 44.8515 MB/sec 1319 MB 43.5389 MB/sec 1368 MB 47.5747 MB/sec 1411 MB 42.8134 MB/sec which is much better, but is pretty poor performance for a null device. Running with that latest patch plus the patch you sent previously gives roughly the same result. Also, kswapd chews lots of cpu during these runs: CPU0 states: 0.0% user, 79.0% system, 0.0% nice, 20.4% idle CPU1 states: 0.2% user, 77.1% system, 0.0% nice, 22.1% idle Mem: 2059088K av, 892256K used, 1166832K free, 0K shrd, 784972K buff Swap: 1052216K av, 0K used, 1052216K free 10072K cached PID USER PRI NI SIZE RSS SHARE LC STAT %CPU %MEM TIME COMMAND 608 root 19 0 452 452 328 1 R 95.2 0.0 1:23 readfiles 5 root 14 0 0 0 0 1 SW 58.3 0.0 0:52 kswapd 6 root 9 0 0 0 0 1 RW 2.1 0.0 0:01 kreclaimd ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 2.4.8preX VM problems 2001-08-01 10:54 ` Andrew Tridgell @ 2001-08-01 11:51 ` Mike Black 2001-08-01 18:39 ` Daniel Phillips 0 siblings, 1 reply; 20+ messages in thread From: Mike Black @ 2001-08-01 11:51 UTC (permalink / raw) To: tridge, marcelo; +Cc: linux-kernel, riel, Andrew Morton This sounds a lot like the problem I've been having with ext3 and raid. A one-thread tiobench performs just great. A two-thread tiobench starts having lots of kswapd action when free memory gets down to ~5Meg. ext3 exacerbates the problem. kswapd kicks up it's heals and starts grinding away (and NEVER swaps anything out). I've been working this with Andrew Morton (the ext3 guy). I have come to the opinion that kswapd needs to be a little smarter -- if it doesn't find anything to swap shouldn't it go to sleep a little longer before trying again? That way it could gracefully degrade itself when it's not making any progress. In my testing (on a dual 1Ghz/2G machine) the machine "locks up" for long periods of time while kswapd runs around trying to do it's thing. If I could disable kswapd I would just to test this. I tried to figure out how to lengthen the sleep time of kswapd but didn't have time to chase it down (it wasn't intuitively obvious :-) ________________________________________ Michael D. Black Principal Engineer mblack@csihq.com 321-676-2923,x203 http://www.csihq.com Computer Science Innovations http://www.csihq.com/~mike My home page FAX 321-676-2355 ----- Original Message ----- From: "Andrew Tridgell" <tridge@valinux.com> To: <marcelo@conectiva.com.br> Cc: <linux-kernel@vger.kernel.org>; <riel@conectiva.com.br> Sent: Wednesday, August 01, 2001 6:54 AM Subject: Re: 2.4.8preX VM problems Marcelo, > The following patch sets the zone free target to freepages.high. Can you > test it ? (I tried here and got the expected results) Running just that patch against 2.4.8pre3 gives: [root@fraud /root]# ~/readfiles /dev/ddisk 198 MB 198.084 MB/sec 386 MB 188.634 MB/sec 570 MB 183.827 MB/sec 743 MB 172.5 MB/sec 810 MB 67.0501 MB/sec 862 MB 52.1381 MB/sec 901 MB 37.9501 MB/sec 957 MB 55.8253 MB/sec 998 MB 41.1541 MB/sec 1046 MB 48.1661 MB/sec 1088 MB 40.3898 MB/sec 1140 MB 50.8782 MB/sec 1183 MB 42.5749 MB/sec 1229 MB 46.1378 MB/sec 1275 MB 44.8515 MB/sec 1319 MB 43.5389 MB/sec 1368 MB 47.5747 MB/sec 1411 MB 42.8134 MB/sec which is much better, but is pretty poor performance for a null device. Running with that latest patch plus the patch you sent previously gives roughly the same result. Also, kswapd chews lots of cpu during these runs: CPU0 states: 0.0% user, 79.0% system, 0.0% nice, 20.4% idle CPU1 states: 0.2% user, 77.1% system, 0.0% nice, 22.1% idle Mem: 2059088K av, 892256K used, 1166832K free, 0K shrd, 784972K buff Swap: 1052216K av, 0K used, 1052216K free 10072K cached PID USER PRI NI SIZE RSS SHARE LC STAT %CPU %MEM TIME COMMAND 608 root 19 0 452 452 328 1 R 95.2 0.0 1:23 readfiles 5 root 14 0 0 0 0 1 SW 58.3 0.0 0:52 kswapd 6 root 9 0 0 0 0 1 RW 2.1 0.0 0:01 kreclaimd - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 2.4.8preX VM problems 2001-08-01 11:51 ` Mike Black @ 2001-08-01 18:39 ` Daniel Phillips 2001-08-11 12:06 ` Pavel Machek 0 siblings, 1 reply; 20+ messages in thread From: Daniel Phillips @ 2001-08-01 18:39 UTC (permalink / raw) To: Mike Black, tridge, marcelo; +Cc: linux-kernel, riel, Andrew Morton On Wednesday 01 August 2001 13:51, Mike Black wrote: > I have come to the opinion that kswapd needs to be a little smarter > -- if it doesn't find anything to swap shouldn't it go to sleep a > little longer before trying again? That way it could gracefully > degrade itself when it's not making any progress. > > In my testing (on a dual 1Ghz/2G machine) the machine "locks up" for > long periods of time while kswapd runs around trying to do it's > thing. If I could disable kswapd I would just to test this. Your wish is my command. This patch provides a crude-but-effective method of disabling kswapd, using: echo 1 >/proc/sys/kernel/disable_kswapd I tested this with dbench and found it runs about half as fast, but runs. This is reassuring because kswapd is supposed to be doing something useful. To apply: cd /usr/src/your.2.4.7.tree patch -p0 <this.patch --- ../2.4.7.clean/include/linux/swap.h Fri Jul 20 21:52:18 2001 +++ ./include/linux/swap.h Wed Aug 1 19:35:27 2001 @@ -78,6 +78,7 @@ int next; /* next entry on swap list */ }; +extern int disable_kswapd; extern int nr_swap_pages; extern unsigned int nr_free_pages(void); extern unsigned int nr_inactive_clean_pages(void); --- ../2.4.7.clean/include/linux/sysctl.h Fri Jul 20 21:52:18 2001 +++ ./include/linux/sysctl.h Wed Aug 1 19:35:28 2001 @@ -118,7 +118,8 @@ KERN_SHMPATH=48, /* string: path to shm fs */ KERN_HOTPLUG=49, /* string: path to hotplug policy agent */ KERN_IEEE_EMULATION_WARNINGS=50, /* int: unimplemented ieee instructions */ - KERN_S390_USER_DEBUG_LOGGING=51 /* int: dumps of user faults */ + KERN_S390_USER_DEBUG_LOGGING=51, /* int: dumps of user faults */ + KERN_DISABLE_KSWAPD=52, /* int: disable kswapd for testing */ }; --- ../2.4.7.clean/kernel/sysctl.c Thu Apr 12 21:20:31 2001 +++ ./kernel/sysctl.c Wed Aug 1 19:35:28 2001 @@ -249,6 +249,8 @@ {KERN_S390_USER_DEBUG_LOGGING,"userprocess_debug", &sysctl_userprocess_debug,sizeof(int),0644,NULL,&proc_dointvec}, #endif + {KERN_DISABLE_KSWAPD, "disable_kswapd", &disable_kswapd, sizeof (int), + 0644, NULL, &proc_dointvec}, {0} }; --- ../2.4.7.clean/mm/vmscan.c Mon Jul 9 19:18:50 2001 +++ ./mm/vmscan.c Wed Aug 1 19:35:28 2001 @@ -875,6 +875,8 @@ DECLARE_WAIT_QUEUE_HEAD(kswapd_wait); DECLARE_WAIT_QUEUE_HEAD(kswapd_done); +int disable_kswapd /* = 0 */; + /* * The background pageout daemon, started as a kernel thread * from the init process. @@ -915,6 +917,9 @@ */ for (;;) { static long recalc = 0; + + while (disable_kswapd) + interruptible_sleep_on_timeout(&kswapd_wait, HZ/10); /* If needed, try to free some memory. */ if (inactive_shortage() || free_shortage()) ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 2.4.8preX VM problems 2001-08-01 18:39 ` Daniel Phillips @ 2001-08-11 12:06 ` Pavel Machek 2001-08-16 21:57 ` Daniel Phillips 0 siblings, 1 reply; 20+ messages in thread From: Pavel Machek @ 2001-08-11 12:06 UTC (permalink / raw) To: Daniel Phillips Cc: Mike Black, tridge, marcelo, linux-kernel, riel, Andrew Morton Hi! > > I have come to the opinion that kswapd needs to be a little smarter > > -- if it doesn't find anything to swap shouldn't it go to sleep a > > little longer before trying again? That way it could gracefully > > degrade itself when it's not making any progress. > > > > In my testing (on a dual 1Ghz/2G machine) the machine "locks up" for > > long periods of time while kswapd runs around trying to do it's > > thing. If I could disable kswapd I would just to test this. > > Your wish is my command. This patch provides a crude-but-effective > method of disabling kswapd, using: > > echo 1 >/proc/sys/kernel/disable_kswapd > > I tested this with dbench and found it runs about half as fast, but > runs. This is reassuring because kswapd is supposed to be doing > something useful. Why not just killall -STOP kswapd? What is expected state of system without kswapd, BTW? Without kflushd, I give up guaranteed time to get data safely to disk [and its usefull for spindown]. What happens without kswapd? Pavel -- Philips Velo 1: 1"x4"x8", 300gram, 60, 12MB, 40bogomips, linux, mutt, details at http://atrey.karlin.mff.cuni.cz/~pavel/velo/index.html. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 2.4.8preX VM problems 2001-08-11 12:06 ` Pavel Machek @ 2001-08-16 21:57 ` Daniel Phillips 0 siblings, 0 replies; 20+ messages in thread From: Daniel Phillips @ 2001-08-16 21:57 UTC (permalink / raw) To: Pavel Machek Cc: Mike Black, tridge, marcelo, linux-kernel, riel, Andrew Morton On August 11, 2001 02:06 pm, Pavel Machek wrote: > > > I have come to the opinion that kswapd needs to be a little smarter > > > -- if it doesn't find anything to swap shouldn't it go to sleep a > > > little longer before trying again? That way it could gracefully > > > degrade itself when it's not making any progress. > > > > > > In my testing (on a dual 1Ghz/2G machine) the machine "locks up" for > > > long periods of time while kswapd runs around trying to do it's > > > thing. If I could disable kswapd I would just to test this. > > > > Your wish is my command. This patch provides a crude-but-effective > > method of disabling kswapd, using: > > > > echo 1 >/proc/sys/kernel/disable_kswapd > > > > I tested this with dbench and found it runs about half as fast, but > > runs. This is reassuring because kswapd is supposed to be doing > > something useful. > > Why not just killall -STOP kswapd? Because I didn't think of it and I wanted some code for myself to do real-time experimental tuning of the VM behaviour. > What is expected state of system without kswapd, BTW? Without kflushd, > I give up guaranteed time to get data safely to disk [and its usefull > for spindown]. What happens without kswapd? Without kswapd you lose much of the system's 'clean-ahead' performance and it ends up reacting to try_to_free_pages calls iniated through __alloc_pages when processes run out of free pages. This means lots more synchronous waiting on page_launder and friends, but the system still runs. It's a nice way to check how well the system's attempt to anticipate demand is really working. -- Daniel ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 2.4.8preX VM problems 2001-08-01 8:13 ` Marcelo Tosatti 2001-08-01 10:54 ` Andrew Tridgell @ 2001-08-04 6:50 ` Anton Blanchard 2001-08-04 5:55 ` Marcelo Tosatti 1 sibling, 1 reply; 20+ messages in thread From: Anton Blanchard @ 2001-08-04 6:50 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: Andrew Tridgell, lkml, Rik van Riel Hi Marcelo, > The problem is pretty nasty: if there is no global shortage and only a > given zone with shortage, we set the zone free target to freepages.min > (basically no tasks can make progress with that amount of free memory). Paulus and I were seeing the same problem on a ppc with 2.4.8-pre3. We were doing cat > /dev/null of about 5G of data, when we had close to 3G of page cache kswapd chewed up all the cpu. Our guess was that there was a shortage of lowmem pages (everything above 512M is highmem on the ppc32 kernel so there isnt much lowmem). The patch below allowed us to get close to 4G of page cache before things slowed down again and kswapd took over. Anton > --- linux.orig/mm/page_alloc.c Mon Jul 30 17:06:49 2001 > +++ linux/mm/page_alloc.c Wed Aug 1 06:21:35 2001 > @@ -630,8 +630,8 @@ > goto ret; > > if (zone->inactive_clean_pages + zone->free_pages > - < zone->pages_min) { > - sum += zone->pages_min; > + < zone->pages_high) { > + sum += zone->pages_high; > sum -= zone->free_pages; > sum -= zone->inactive_clean_pages; > } ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 2.4.8preX VM problems 2001-08-04 6:50 ` Anton Blanchard @ 2001-08-04 5:55 ` Marcelo Tosatti 2001-08-04 17:17 ` Anton Blanchard 0 siblings, 1 reply; 20+ messages in thread From: Marcelo Tosatti @ 2001-08-04 5:55 UTC (permalink / raw) To: Anton Blanchard; +Cc: Andrew Tridgell, lkml, Rik van Riel On Sat, 4 Aug 2001, Anton Blanchard wrote: > > Hi Marcelo, > > > The problem is pretty nasty: if there is no global shortage and only a > > given zone with shortage, we set the zone free target to freepages.min > > (basically no tasks can make progress with that amount of free memory). > > Paulus and I were seeing the same problem on a ppc with 2.4.8-pre3. We > were doing cat > /dev/null of about 5G of data, when we had close to 3G of > page cache kswapd chewed up all the cpu. Our guess was that there was a > shortage of lowmem pages (everything above 512M is highmem on the ppc32 > kernel so there isnt much lowmem). > > The patch below allowed us to get close to 4G of page cache before > things slowed down again and kswapd took over. How much memory do you have on the box ? ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 2.4.8preX VM problems 2001-08-04 5:55 ` Marcelo Tosatti @ 2001-08-04 17:17 ` Anton Blanchard 2001-08-06 22:58 ` Marcelo Tosatti 0 siblings, 1 reply; 20+ messages in thread From: Anton Blanchard @ 2001-08-04 17:17 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: Andrew Tridgell, lkml, Rik van Riel > > The patch below allowed us to get close to 4G of page cache before > > things slowed down again and kswapd took over. > > How much memory do you have on the box ? It has 15G, so 512M of lowmem and 14.5G of highmem. Anton ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 2.4.8preX VM problems 2001-08-04 17:17 ` Anton Blanchard @ 2001-08-06 22:58 ` Marcelo Tosatti 2001-08-07 17:18 ` Anton Blanchard 0 siblings, 1 reply; 20+ messages in thread From: Marcelo Tosatti @ 2001-08-06 22:58 UTC (permalink / raw) To: Anton Blanchard; +Cc: Andrew Tridgell, lkml, Rik van Riel On Sat, 4 Aug 2001, Anton Blanchard wrote: > > > > The patch below allowed us to get close to 4G of page cache before > > > things slowed down again and kswapd took over. > > > > How much memory do you have on the box ? > > It has 15G, so 512M of lowmem and 14.5G of highmem. Can you please use readprofile to find out where kswapd is spending its time when you reach 4G of pagecache ? I've never seen kswapd burn CPU time except cases where a lot of memory is anonymous and there is a need for lots of swap space allocations. (scan_swap_map() is where kswapd spends "all" of its time in such workloads) ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 2.4.8preX VM problems 2001-08-06 22:58 ` Marcelo Tosatti @ 2001-08-07 17:18 ` Anton Blanchard 2001-08-07 21:02 ` Kernel 2.4.6 & 2.4.7 networking performance: seeing serious delays in TCP layer depending upon packet length Ron Flory 0 siblings, 1 reply; 20+ messages in thread From: Anton Blanchard @ 2001-08-07 17:18 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: Andrew Tridgell, lkml, Rik van Riel Hi Marcelo, > Can you please use readprofile to find out where kswapd is spending its > time when you reach 4G of pagecache ? > > I've never seen kswapd burn CPU time except cases where a lot of memory is > anonymous and there is a need for lots of swap space allocations. > (scan_swap_map() is where kswapd spends "all" of its time in such > workloads) I was doing a run with 512M lowmem and 2.5G highmem and found this: __alloc_pages: 1-order allocation failed. __alloc_pages: 1-order allocation failed. __alloc_pages: 1-order allocation failed. # cat /proc/meminfo total: used: free: shared: buffers: cached: Mem: 3077513216 3067564032 9949184 0 13807616 2311172096 Swap: 2098176000 0 2098176000 MemTotal: 3005384 kB MemFree: 9716 kB MemShared: 0 kB Buffers: 13484 kB Cached: 2257004 kB SwapCached: 0 kB Active: 888916 kB Inact_dirty: 1335552 kB Inact_clean: 46020 kB Inact_target: 316 kB HighTotal: 2621440 kB HighFree: 7528 kB LowTotal: 383944 kB LowFree: 2188 kB SwapTotal: 2049000 kB SwapFree: 2049000 kB # readprofile | sort -nr | less 11967239 total 7.3285 7417874 idled 45230.9390 363813 do_page_launder 119.3612 236764 ppc_irq_dispatch_handler 332.5337 I can split out the do_page_launder usage if you want. I had a quick look at the raw profile information and it appears that we are just looping a lot. Paulus and I moved the ppc32 kernel to load at 2G so we have 1.75G of lowmem. This has stopped the kswapd problem, but I thought the above information might be useful to you anyway. Anton ^ permalink raw reply [flat|nested] 20+ messages in thread
* Kernel 2.4.6 & 2.4.7 networking performance: seeing serious delays in TCP layer depending upon packet length 2001-08-07 17:18 ` Anton Blanchard @ 2001-08-07 21:02 ` Ron Flory 0 siblings, 0 replies; 20+ messages in thread From: Ron Flory @ 2001-08-07 21:02 UTC (permalink / raw) Cc: lkml, ron.flory This email contains 99-columns of information: [1.] One line summary of the problem: Depending upon length of TCP socket write requests, I'm seeing extremely long TCP delays, limiting throughput to only 25 datagrams per second. Data blocks above a 'magic' length have a much higher throughput than shorter blocks, which is counter-intuitive. [2.] Full description of the problem/report: Compared to a normal 35 microseconds, I am seeing TCP stack delays on the order of 35 millisec when socket write lengths are BELOW a certain value. Depending upon whether I use an actual ethernet or loopback device, this 'slow' packet length threshold is at 1447 bytes (Ethernet), or 16383 bytes (loopback). For some reason, the sending stack is delaying, and/or or waiting for an ACK between a 'short' followed by a 'longer' write TCP datagram. For all packets >= these magic lengths (1448/16384 respectively), the TCP stack does not wait for this ACK, and immediately sends the next datagram. For packets differing by only one byte in length, I see the packet rate jump from only 25 per second to several thousand per second. In the case of the loopback device, simply changing the packet length from 16383 to 16384 bytes results in a max transfer rate of 25 vs. 4250 packets per second. I think Alan will be interested in looking into this... --------------- more (too much) background ---------------- To test various IP/ATM/FrameRelay network systems and interfaces, I wrote a small client/server app to pass as much network traffic as possible across a simulated customer network in our lab. The sequence is relatively simple: - create server. - create client. (client contacts server, transfers data). *Server: wait for client to request our max packet length (8 bytes) respond with our max packet length (8 bytes). while (1) { 1. wait for client request packet payload (8 bytes). 2. Respond with ACK message indicating length (8 bytes) 3. send all requested data to client (1..128Kbytes). } *Client: contact server, request max packet length (8 bytes) read response from server (8 bytes). while (1) { 1. request payload packet from server (8 bytes). 2. accept ACK packet (with length) from server (8 bytes). 3. read all requested data from server (1..128Kbytes). } The test application can be provided if you don't have alternative (independant) test tools available. [3.] Keywords (i.e., modules, networking, kernel): networking, sockets, performance, benchmark, kernel, TCP, IP [4.] Kernel version (from /proc/version): Both Linux x86 2.4.6 and 2.4.7, non-patched sources downloaded from kernel.org, compiled locally. [5.] Output of Oops.. message with symbolic information resolved (see Kernel Mailing List FAQ, Section 1.5): No oops, just delay... [6.] A small shell script or example program which triggers the problem (if possible) application tar-archive available upon request. not too small. [7.] Environment Local network, on a network switch, both Linux machines running RedHat 7.1 (x86), One machine is SCSI-based, the other is all-IDE. Disk subsystem not relevant. [7.1.] Software (add the output of the ver_linux script here) [script is hosed] gcc 2.96 ('updated' versions from RH). binutils 2.10.91.0.2 Linux C Lib 2.2.so ('updated' versions from RH). libc.so.6 => /lib/i686/libc.so.6 (0x40021000) /lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x40000000) [7.2.] Processor information (from /proc/cpuinfo): problem seen on several SMP/uniprocessor systems: [7.3.] Module information (from /proc/modules): One machine uses Tulip network modules, other uses 8139too. This problem is present in the loopback device as well, so physical drivers are probably not an issue. SMC-Ultra is eth1 (not used in this case). Module Size Used by smc-ultra 4992 1 (autoclean) 8390 6928 0 (autoclean) [smc-ultra] tulip 39520 1 (autoclean) st 26640 0 (unused) [7.4.] SCSI information (from /proc/scsi/scsi): N/A [7.5.] Other information that might be relevant to the problem (please look in /proc and include all information that you think to be relevant): [X.] Other notes, patches, fixes, workarounds: Protocol is TCP (SOCK_STREAM). NOTE: machine 172.22.112.101 (shown here as .101) is the server, .100 is the client. Ethereal log of short/slow/bad packets (1447 bytes): (time is inter-packet) # Time Src Dst Port Info 1 0.000000 .100 .101 41359 > 4180 [SYN] Seq=4186558034 Ack=0 Win=5840 Len=0 2 0.000083 .101 .100 4180 > 41359 [SYN, ACK] Seq=124141118 Ack=4186558035 Win=5792 Len=0 3 0.000117 .100 .101 41359 > 4180 [ACK] Seq=4186558035 Ack=124141119 Win=5840 Len=0 4 0.000061 .100 .101 41359 > 4180 [PSH, ACK] Seq=4186558035 Ack=124141119 Win=5840 Len=8 5 0.000061 .101 .100 4180 > 41359 [ACK] Seq=124141119 Ack=4186558043 Win=5792 Len=0 6 0.002295 .101 .100 4180 > 41359 [PSH, ACK] Seq=124141119 Ack=4186558043 Win=5792 Len=8 7 0.000098 .100 .101 41359 > 4180 [ACK] Seq=4186558043 Ack=124141127 Win=5840 Len=0 8 0.000111 .100 .101 41359 > 4180 [PSH, ACK] Seq=4186558043 Ack=124141127 Win=5840 Len=8 9 0.000062 .101 .100 4180 > 41359 [PSH, ACK] Seq=124141127 Ack=4186558051 Win=5792 Len=8 10 0.036399 .100 .101 41359 > 4180 [ACK] Seq=4186558051 Ack=124141135 Win=5840 Len=0 <- DELAY 11 0.000074 .101 .100 4180 > 41359 [PSH, ACK] Seq=124141135 Ack=4186558051 Win=5792 Len=1447 12 0.000362 .100 .101 41359 > 4180 [ACK] Seq=4186558051 Ack=124142582 Win=8682 Len=0 13 0.000279 .100 .101 41359 > 4180 [FIN, ACK] Seq=4186558051 Ack=124142582 Win=8682 Len=0 14 0.000073 .101 .100 4180 > 41359 [FIN, ACK] Seq=124142582 Ack=4186558052 Win=5792 Len=0 15 0.000099 .100 .101 41359 > 4180 [ACK] Seq=4186558052 Ack=124142583 Win=8682 Len=0 Ethereal log of long/fast/good packets (1448 bytes): (time is inter-packet) # Time Src Dst Port Info 1 0.000000 .100 .101 41363 > 4180 [SYN] Seq=4268751129 Ack=0 Win=5840 Len=0 2 0.000077 .101 .100 4180 > 41363 [SYN, ACK] Seq=189815001 Ack=4268751130 Win=5792 Len=0 3 0.000116 .100 .101 41363 > 4180 [ACK] Seq=4268751130 Ack=189815002 Win=5840 Len=0 4 0.000060 .100 .101 41363 > 4180 [PSH, ACK] Seq=4268751130 Ack=189815002 Win=5840 Len=8 5 0.000038 .101 .100 4180 > 41363 [ACK] Seq=189815002 Ack=4268751138 Win=5792 Len=0 6 0.002467 .101 .100 4180 > 41363 [PSH, ACK] Seq=189815002 Ack=4268751138 Win=5792 Len=8 7 0.000109 .100 .101 41363 > 4180 [ACK] Seq=4268751138 Ack=189815010 Win=5840 Len=0 8 0.000122 .100 .101 41363 > 4180 [PSH, ACK] Seq=4268751138 Ack=189815010 Win=5840 Len=8 9 0.000061 .101 .100 4180 > 41363 [PSH, ACK] Seq=189815010 Ack=4268751146 Win=5792 Len=8 10 0.000034 .101 .100 4180 > 41363 [ACK] Seq=189815018 Ack=4268751146 Win=5792 Len=1448 11 0.000351 .100 .101 41363 > 4180 [ACK] Seq=4268751146 Ack=189816466 Win=8688 Len=0 12 0.000520 .100 .101 41363 > 4180 [FIN, ACK] Seq=4268751146 Ack=189816466 Win=8688 Len=0 13 0.000065 .101 .100 4180 > 41363 [FIN, ACK] Seq=189816466 Ack=4268751147 Win=5792 Len=0 14 0.000095 .100 .101 41363 > 4180 [ACK] Seq=4268751147 Ack=189816467 Win=8688 Len=0 Line #9 in both above represent the server (.101) sending an 8-byte descriptor block back to the client (.100), informing him how much data will be following in the next 'write' (payload block). The payload 'write' immediately follows (no other delays or processing present). Of particular interest is line #10 of each above. For some reason, in the case of the 'short' packet, the server machine waits for a TCP ACK from the client machine (delaying 36 millisec), which does not happen if the packet is just 1-byte longer, in which case the server immediately follows the 8-byte block with the 1448-byte payload. I realize there is probably some relationship with the 1500 MTU of ethernet, but I do not see the relation with the 16k slow/fast boundary seen with the loopback device, or when the server/client are on the same machine. In this case, all packets < 16384 bytes wait approx 36 milli-sec for the client to ACK before sending the subsequent payload data block. Thanks, - i hope this isn't a well known issue that 'just has to be done this way'... ron ^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2001-08-16 21:51 UTC | newest] Thread overview: 20+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2001-08-01 3:05 2.4.8preX VM problems Andrew Tridgell 2001-08-01 2:26 ` Marcelo Tosatti 2001-08-01 4:37 ` Andrew Tridgell 2001-08-01 3:32 ` Marcelo Tosatti 2001-08-01 5:43 ` Andrew Tridgell 2001-08-01 6:09 ` Andrew Tridgell 2001-08-01 6:10 ` Marcelo Tosatti 2001-08-01 8:13 ` Andrew Tridgell 2001-08-01 8:13 ` Marcelo Tosatti 2001-08-01 10:54 ` Andrew Tridgell 2001-08-01 11:51 ` Mike Black 2001-08-01 18:39 ` Daniel Phillips 2001-08-11 12:06 ` Pavel Machek 2001-08-16 21:57 ` Daniel Phillips 2001-08-04 6:50 ` Anton Blanchard 2001-08-04 5:55 ` Marcelo Tosatti 2001-08-04 17:17 ` Anton Blanchard 2001-08-06 22:58 ` Marcelo Tosatti 2001-08-07 17:18 ` Anton Blanchard 2001-08-07 21:02 ` Kernel 2.4.6 & 2.4.7 networking performance: seeing serious delays in TCP layer depending upon packet length Ron Flory
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox