public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* 2.6 vm/elevator loading down disks where 2.4 does not
@ 2004-06-08 19:51 Clint Byrum
  2004-06-09 23:25 ` Andrew Morton
       [not found] ` <2c0942db04060917542cb15077@mail.gmail.com>
  0 siblings, 2 replies; 6+ messages in thread
From: Clint Byrum @ 2004-06-08 19:51 UTC (permalink / raw)
  To: linux-kernel

Sorry for the long email. I pruned it as much as possible...

The problem:

When we upgraded one of our production boxes (details below) to 2.6.6,
we noticed an immediate loss of 5 - 15 percent efficiency. While these
boxes usually had less than 0.5% variation through out the day, this box
was consistently doing 10% fewer searches than the others.

Upon investigation, we saw that the 2.6 box was reading from the disk
about 5 times as much as 2.4. Iin 2.4 we can almost completely saturate
the CPUs; they'll get to 90% of the real CPU's, and 15% of the virtual
CPUs. With 2.6, they never get above 60/10 because they are in io-wait
state constantly (which, under 2.4, is reported as idle IIRC). I have
not done extensive testing of the anticipatory elevator, but it did
appear slower than deadline in early tests.

The vmstat runs at the bottom of this email were done in parallel on two
machines, receiving mostly identical amounts of real traffic. Traffic is
load balanced by mod_backhand to these machines, and is directly
responsive to system load with a granularity of 2 seconds, so really,
the 2.6.6 box was actually getting somewhat *less* traffic. Notice how
much higher the 'bi' numbers are, for blocks in. As I said before,
expected variation is less tham 0.5%.

This behavior is consistent and has been observed for over a month in
production now. I'm just looking for reasons this is happening and maybe
what needs to be profiled/tuned/fixed in order to find out.

If you're still interested by this point, here is the Background:

We have a bunch of identical Dual P4 Xeon 2.8Ghz machines. Each has an
Intel i865 chipset, 1GB of DDR RAM, and 2xWDC 40GB 8MB cache drives. All
are running RedHat 8.0 with security patches from fedoralegacy.org. All
are running vanilla kernel 2.4.23 or later, except one, that runs 2.6.6.
The 2.6.6 kernel was built directly from one of the other kernel's
.config with 'make oldconfig'. Two new options were selected..  
CONFIG_4KSTACKS, and (on the cmdline) elevator=deadline. The two disks
are setup in software RAID1. There is no swap configured.
 
These machines run text searches from web requests using a proprietary
file-based database (called Texis, http://www.thunderstone.com). The
data access patterns are generally "search through index files in a
tree-walking type of manner, then seek to data records in data files."
The index files are less than 300MB, and constantly accessed. The data
files total 3GB, but the data being read is very small... at most 40kB
at a time.


And now for the real details:

---------------------VMSTAT 2.4.23-------------------------

$ free -m ; uptime ; vmstat 5 5 
             total       used       free     shared    buffers    
cached 
Mem:          1007        983         24          0         12       
812 
-/+ buffers/cache:        157        850 
Swap:            0          0          0 
 14:51:06 up 113 days, 22:22,  1 user,  load average: 0.70, 0.59, 0.64 
procs -----------memory---------- ---swap-- -----io---- --system-- 
----cpu---- 
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us 
sy id wa 
 0  0      0  25284  13284 832500    0    0     1     1    1     1 12 
2 86  0 
 1  0      0  20748  13312 832688    0    0    37    39  139    92  3 
0 97  0 
 3  0      0  16720  13348 833020    0    0    63    80  189   222 12 
2 86  0 
 0  0      0  23672  13376 833184    0    0    31    57  166   142  7 
1 92  0 
 1  0      0  16572  13412 833288    0    0    20    51  155   137  5 
1 93  0 

------------------END VMSTAT 2.4.23-------------------------

---------------------VMSTAT 2.6.6-------------------------

$ free -m ; uptime ; vmstat 5 5 
             total       used       free     shared    buffers    
cached 
Mem:          1010        990         20          0         27       
867 
-/+ buffers/cache:         95        915 
Swap:            0          0          0 
 14:51:05 up 7 days,  1:17,  1 user,  load average: 0.59, 0.66, 0.76 
procs -----------memory---------- ---swap-- -----io---- --system-- 
----cpu---- 
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us 
sy id wa 
 0  1      0  20732  28220 888556    0    0    11    12   11    14  9 
2 88  2 
 0  0      0  27552  28260 883416    0    0   223   107  198   232 19 
3 76  2 
 0  0      0  26452  28276 884420    0    0   226    33  192   217  6 
1 87  6 
 0  0      0  26388  28308 884660    0    0    36    56  154   136  4 
0 95  0 
 0  0      0  25536  28344 885236    0    0   114    62  173   186  8 
2 89  1

------------------END VMSTAT 2.6.6-------------------------


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: 2.6 vm/elevator loading down disks where 2.4 does not
  2004-06-08 19:51 2.6 vm/elevator loading down disks where 2.4 does not Clint Byrum
@ 2004-06-09 23:25 ` Andrew Morton
  2004-06-11 16:45   ` Clint Byrum
       [not found] ` <2c0942db04060917542cb15077@mail.gmail.com>
  1 sibling, 1 reply; 6+ messages in thread
From: Andrew Morton @ 2004-06-09 23:25 UTC (permalink / raw)
  To: Clint Byrum; +Cc: linux-kernel

Clint Byrum <cbyrum@spamaps.org> wrote:
>
> When we upgraded one of our production boxes (details below) to 2.6.6,
> we noticed an immediate loss of 5 - 15 percent efficiency. While these
> boxes usually had less than 0.5% variation through out the day, this box
> was consistently doing 10% fewer searches than the others.
> 
> Upon investigation, we saw that the 2.6 box was reading from the disk
> about 5 times as much as 2.4. Iin 2.4 we can almost completely saturate
> the CPUs; they'll get to 90% of the real CPU's, and 15% of the virtual
> CPUs. With 2.6, they never get above 60/10 because they are in io-wait
> state constantly (which, under 2.4, is reported as idle IIRC).

Possibly a memory zone problem.  Could you try booting with "mem=896m" on
the kernel command line, see how that affects things?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: 2.6 vm/elevator loading down disks where 2.4 does not
       [not found] ` <2c0942db04060917542cb15077@mail.gmail.com>
@ 2004-06-10  1:03   ` Ray Lee
  2004-06-10  6:03     ` Clint Byrum
  0 siblings, 1 reply; 6+ messages in thread
From: Ray Lee @ 2004-06-10  1:03 UTC (permalink / raw)
  To: Clint Byrum; +Cc: Linux Kernel

> Upon investigation, we saw that the 2.6 box was reading from the disk
> about 5 times as much as 2.4.

I don't think this will account for the entire change in disk activity,
but 2.6.7-pre? contains a fix for the read ahead code to prevent it from
reading extra sectors when unneeded.

The fix applies to 'seeky database type loads' which...

> The data access patterns are generally "search through index files in
> a tree-walking type of manner, then seek to data records in data
> files."

...sounds like it may apply to you.

So, if you and your server have some time, you might try 2.6.7rc3 and
see if it changes any of the numbers.

Ray Lee


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: 2.6 vm/elevator loading down disks where 2.4 does not
  2004-06-10  1:03   ` Ray Lee
@ 2004-06-10  6:03     ` Clint Byrum
  2004-06-10  6:10       ` William Lee Irwin III
  0 siblings, 1 reply; 6+ messages in thread
From: Clint Byrum @ 2004-06-10  6:03 UTC (permalink / raw)
  To: Ray Lee; +Cc: Linux Kernel


On Wednesday, June 9, 2004, at 06:03 PM, Ray Lee wrote:

>> Upon investigation, we saw that the 2.6 box was reading from the disk
>> about 5 times as much as 2.4.
>
> I don't think this will account for the entire change in disk activity,
> but 2.6.7-pre? contains a fix for the read ahead code to prevent it 
> from
> reading extra sectors when unneeded.
>
> The fix applies to 'seeky database type loads' which...
>
>> The data access patterns are generally "search through index files in
>> a tree-walking type of manner, then seek to data records in data
>> files."
>
> ...sounds like it may apply to you.
>
> So, if you and your server have some time, you might try 2.6.7rc3 and
> see if it changes any of the numbers.
>

I updated my 2.6 box to 2.6.7-rc3 a few hours ago. While its not really 
fair to pass judgement during such low-traffic times, things look about 
the same. Where i see the 2.4 box not even touching the disks for a 
minute or longer, the 2.6 box will hit it every 10 seconds at least, 
and hit it with 40 blocks or more. 3 hours should be plenty of time to 
get the indexes and data files mostly cached.

It almost seems to me like 2.6 is caching too much with each read, and 
therefore having to free other pages that really should have been left 
alone. This might even explain why 2.4 is maintaining a lot more free 
memory (75-80MB versus 2.6 having only 15-20MB free) and less cache. 
Oddly enough, I noticed that the blockdev command reports 256 as the 
readahead in 2.6.x, but 1024 in 2.4.x. I tried mucking with that value 
but it didn't make a whole lot of difference.

Might it help to have some swap available, if for nothing else but to 
make the algorithms work better? Really the box should never use it; 
there are no daemons that go unused for longer than 10 minutes ...  and 
when I have turned on swap in the past, it uses less than 2MB of it. 
This machine, for all intents and purposes, should spend most of its 
time searching an index and database that are already cached.. 90% of 
the searches are on similar terms.

That brings up an interesting point... is there a system wide stat that 
tells me how effective the file cache is? I guess majfaults/s fits that 
bill to some degree.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: 2.6 vm/elevator loading down disks where 2.4 does not
  2004-06-10  6:03     ` Clint Byrum
@ 2004-06-10  6:10       ` William Lee Irwin III
  0 siblings, 0 replies; 6+ messages in thread
From: William Lee Irwin III @ 2004-06-10  6:10 UTC (permalink / raw)
  To: Clint Byrum; +Cc: Ray Lee, Linux Kernel

On Wed, Jun 09, 2004 at 11:03:38PM -0700, Clint Byrum wrote:
> That brings up an interesting point... is there a system wide stat that 
> tells me how effective the file cache is? I guess majfaults/s fits that 
> bill to some degree.

/proc/vmstat should log global major/minor fault counters (actually
summed on the fly per-cpu counters). I fixed those to properly report
major and minor faults for 2.6. The analogous numbers where they are
present are completely and utterly meaningless gibberish in 2.4.


-- wli

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: 2.6 vm/elevator loading down disks where 2.4 does not
  2004-06-09 23:25 ` Andrew Morton
@ 2004-06-11 16:45   ` Clint Byrum
  0 siblings, 0 replies; 6+ messages in thread
From: Clint Byrum @ 2004-06-11 16:45 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

On Wed, 2004-06-09 at 16:25, Andrew Morton wrote:
> Possibly a memory zone problem.  Could you try booting with "mem=896m" on
> the kernel command line, see how that affects things?

This took a while longer to try, as I didn't want to unfairly test it
against a box with 1G of RAM. So I rebooted my 2.4.23 and 2.6.7-rc3 test
boxes with mem=896m. No change in the 5:1 ratio when comparing 2.6's
disk reads to 2.4's. Of course, both boxes ended up reading from the
disk more often, as they had less RAM for cache. I was unable to run a
long test as I did before, but I'm confident the 3 hours I did run tests
for show that this isn't a memory zone problem.

I still think this behavior is happening because useful pages are being
removed from the page cache too soon. Maybe this is happening because of
excessive readahead?

-cb


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2004-06-11 16:48 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-06-08 19:51 2.6 vm/elevator loading down disks where 2.4 does not Clint Byrum
2004-06-09 23:25 ` Andrew Morton
2004-06-11 16:45   ` Clint Byrum
     [not found] ` <2c0942db04060917542cb15077@mail.gmail.com>
2004-06-10  1:03   ` Ray Lee
2004-06-10  6:03     ` Clint Byrum
2004-06-10  6:10       ` William Lee Irwin III

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox