IA64 Linux VM performance woes.

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* IA64 Linux VM performance woes.
@ 2004-04-13 18:53 Michael E. Thomadakis
  0 siblings, 0 replies; 3+ messages in thread
From: Michael E. Thomadakis @ 2004-04-13 18:53 UTC (permalink / raw)
  To: linux-kernel

Hello all.

We are trying to deploy a 128 PE SGI Altix 3700 running Linux, with 265GB main
memory and 10TB RAID disk (TP9500) :

# cat /etc/redhat-release
Red Hat Linux Advanced Server release 2.1AS (Derry)

# cat /etc/sgi-release
SGI ProPack 2.4 for Linux, Build 240rp04032500_10054-0403250031

# uname -a
Linux c 2.4.21-sgi240rp04032500_10054 #1 SMP Thu Mar 25 00:45:27
PST 2004 ia64 unknown

We have been experiencing bad performance and downright bad behavior when we
are trying to read or write large files (10-100GB).

File Throughput Issues
----------------------
At first the throughtput we are getting without file cache bypass is at around
440MB/sec MAX. This specific file system has LUNs whose primary FC paths go
over all four 2Gb/sec FC channels and the max throughput should have been
close to 800MB/sec.

I've also noticed that the FC adapter driver threads are running at 100% CPU
utilization, when they are pumping data to the RAID for long time. Is there
any data copy taking place at the drivers? The HBAs are from QLogic.

VM Untoward Behavior
-------------------
A more disturbing issue is that the system does NOT clean up the file cache
and eventually all memory gets occupied by FS pages. Then the system simply
hungs.

We tried enabling / removing bootCPUsets, bcfree and anything else available
to us. The crashes are just keep comming. Recently we started experiencing a
lot of 'Cannot do kernel page out at address' by the bdflush and kupdated
threads as well. This complicates any attempt to tune the FS in a way that can
maximize the throughput and finally setup sub-volumes on the RAID in a way
that different FS performance objectives can be attained.

Tunning bdlsuh/kupdated Behavior
-------------------------------

One of our main objectives at our center is to maximize file thoughput for our
systems. We are a medium size Supercomputing Center were compute and I/O
intensive numerical computation code runs in batch sub-systems. Several
programs expect and generate often very large files, in the order of 10-70GBs.
Minimizing file access time is importand in a batch environment since
processors remain allocated and idle while data is shuttled back and forth
from the file system.

Another common problem is the competition between file cache and computation
pages. We definitely do NOT want file cache pages being cached, while
computation pages are reclaimed.

As far as I know, the only place in Linux that the VM / file cache behavior
can be tuned is with the 'bdflush/kupdated' settings. We need a good way to
tuneup the 'bdflush' parameters. I have been trying very hard to find in-depth
documentation on this.

Unfortunately I have only gleaned some general and abstract advices on the
bdflush parameters, mainly in the kernel source documentation tree
(/usr/src/kernel/Documentation/).

For instance, what is a 'buffer'? Is it a fixed size block (e.g., a VM page)
or it can be of any size? This is important as bdlush uses number and
percentages of dirty buffers. A small number of large buffers require much
more data to get transferred to the disks, vs. a large number of small
buffers.

Controls that are Needed
------------------------
Ideally we need to:

1. Set an upper bound on the number of memory pages ever caching FS blocks.

2. Control the amount of data flushed out to disk in set time periods; that is
we need to be able to match the long term flushing rate with the service rate
that the I/O subsystem is capable of delivering, tolerating possible transient
spikes. We also need to be able to control the amount of read-ahead, write
behind or even hint that data are only being streamed through, never to be
reused again.

3. Specify different parameters for 2., above, per file system: we have file
systems that are meant to transfer wide stripes of sequential data, vs. file
systems that need to perform well with smaller block, random I/O, vs. ones
that need to provide access to numerous smaller files. Also, cache percentages
per file system would be useful.

4. Specify, if else fails, what parts of the FS cache should flushed in the
near future.

5. Provide in-depth technical documentation on the internal workings of the
file system cache, its interaction with the VM and the interaction of XFS/LVM
with the VM.

6. We do operate IRIX Origins and IBM Regatta SMPs where all these issues have
been addressed to a far more satisfying degree than on Linux. Is the IRIX file
system cache going to be ported to ALTIX Linux? There is already a LOT of
experience in IRIX for these types of matters that should NOT remain
unleveraged.

Any information/hint or pointers for in-depth discussion on the bugs and
tunning of VM/FS and I/O subsystems or other relevant topics would be
GREATLY appreciated!

We are willing to share our experience with anyone who is interested in
improving any of the above kernel sub-systems and provide feedback with
experimental results and insights.

Thanks

Michael Thomadakis

Supercomputing Center
Texas A&M University

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: IA64 Linux VM performance woes
@ 2004-04-21 12:39 Satoshi Oshima
  2004-04-21 21:00 ` Marcelo Tosatti
  0 siblings, 1 reply; 3+ messages in thread
From: Satoshi Oshima @ 2004-04-21 12:39 UTC (permalink / raw)
  To: linux-kernel

Hello, Michael and all.

We have realized the same kind of performance issue. 
In our case it is not an IA64 huge scale system but an IA32
server system.

In our experiment, we see file I/O throughput decline on 
the server with over 8GByte memory. Kernel versions we use 
are 2.6.0 and Red Hat AS3. We show our experiment.

Below is our hardware configuration and test bench.

CPUs: Xeon 1.6Ghz - 4way
Memory: 12GB
Storage: ATA 120GB
File I/O workload generator consists of 1024 processes and 
generates 100KByte to 5MByte file write. Using "mem=" option, 
we change the memory recognition 2GByte to 12GB.

Below is the result ( unit: MByte/sec).

      2GB  4GB  8GB  12GB
2.6.0 13.1 18.5 18.4 16.1
AS3   11.0 11.3 10.3 8.92

The result shows throughput decline occurs when the server 
has over 8GByte memory.

We agree that your proposal is good idea. It reduces cache 
memory reclaiming cost to set upper bound on number of 
cache memory pages. 

Generally it is very difficult to build one system which 
could handle various type of workloads well. So we hope 
Linux would have kernel parameter tuning interface.

We would be very happy if we could share information to 
manage large scale memory.

Thank you.

Satoshi Oshima,
Systems Development Laboratory
Hitachi Ltd.

>>Hello all.
>>
>>We are trying to deploy a 128 PE SGI Altix 3700 running Linux, with 265GB 
main>>
>>memory and 10TB RAID disk (TP9500) :
>>
>># cat /etc/redhat-release
>>Red Hat Linux Advanced Server release 2.1AS (Derry)
>>
>># cat /etc/sgi-release
>>SGI ProPack 2.4 for Linux, Build 240rp04032500_10054-0403250031
>>
>># uname -a
>>Linux c 2.4.21-sgi240rp04032500_10054 #1 SMP Thu Mar 25 00:45:27
>>PST 2004 ia64 unknown
>>
>>We have been experiencing bad performance and downright bad behavior when 
we
>>are trying to read or write large files (10-100GB).
>>
>>File Throughput Issues
>>----------------------
>>At first the throughtput we are getting without file cache bypass is at 
around>>
>>440MB/sec MAX. This specific file system has LUNs whose primary FC paths 
go
>>over all four 2Gb/sec FC channels and the max throughput should have been
>>close to 800MB/sec.
>>
>>I've also noticed that the FC adapter driver threads are running at 100% 
CPU
>>utilization, when they are pumping data to the RAID for long time. Is 
there
>>any data copy taking place at the drivers? The HBAs are from QLogic.
>>
>>
>>VM Untoward Behavior
>>-------------------
>>A more disturbing issue is that the system does NOT clean up the file 
cache
>>and eventually all memory gets occupied by FS pages. Then the system 
simply
>>hungs.
>>
>>We tried enabling / removing bootCPUsets, bcfree and anything else 
available
>>to us. The crashes are just keep comming. Recently we started 
experiencing a
>>lot of 'Cannot do kernel page out at address' by the bdflush and kupdated
>>threads as well. This complicates any attempt to tune the FS in a way 
that can>>
>>maximize the throughput and finally setup sub-volumes on the RAID in a 
way
>>that different FS performance objectives can be attained.
>>
>>
>>Tunning bdlsuh/kupdated Behavior
>>-------------------------------
>>
>>One of our main objectives at our center is to maximize file thoughput 
for our>>
>>systems. We are a medium size Supercomputing Center were compute and I/O
>>intensive numerical computation code runs in batch sub-systems. Several
>>programs expect and generate often very large files, in the order of 10-
70GBs.>>
>>Minimizing file access time is importand in a batch environment since
>>processors remain allocated and idle while data is shuttled back and 
forth
>>from the file system.
>>
>>Another common problem is the competition between file cache and 
computation
>>pages. We definitely do NOT want file cache pages being cached, while
>>computation pages are reclaimed.
>>
>>As far as I know, the only place in Linux that the VM / file cache 
behavior
>>can be tuned is with the 'bdflush/kupdated' settings. We need a good way 
to
>>tuneup the 'bdflush' parameters. I have been trying very hard to find in-
depth>>
>>documentation on this.
>>
>>Unfortunately I have only gleaned some general and abstract advices on 
the
>>bdflush parameters, mainly in the kernel source documentation tree
>>(/usr/src/kernel/Documentation/).
>>
>>For instance, what is a 'buffer'? Is it a fixed size block (e.g., a VM 
page)
>>or it can be of any size? This is important as bdlush uses number and
>>percentages of dirty buffers. A small number of large buffers require 
much
>>more data to get transferred to the disks, vs. a large number of small
>>buffers.
>>
>>Controls that are Needed
>>------------------------
>>Ideally we need to:
>>
>>1. Set an upper bound on the number of memory pages ever caching FS 
blocks.
>>
>>2. Control the amount of data flushed out to disk in set time periods; 
that is>>
>>we need to be able to match the long term flushing rate with the service 
rate
>>that the I/O subsystem is capable of delivering, tolerating possible 
transient>>
>>spikes. We also need to be able to control the amount of read-ahead, 
write
>>behind or even hint that data are only being streamed through, never to 
be
>>reused again.
>>
>>3. Specify different parameters for 2., above, per file system: we have 
file
>>systems that are meant to transfer wide stripes of sequential data, vs. 
file
>>systems that need to perform well with smaller block, random I/O, vs. 
ones
>>that need to provide access to numerous smaller files. Also, cache 
percentages>>
>>per file system would be useful.
>>
>>4. Specify, if else fails, what parts of the FS cache should flushed in 
the
>>near future.
>>
>>5. Provide in-depth technical documentation on the internal workings of 
the
>>file system cache, its interaction with the VM and the interaction of 
XFS/LVM
>>with the VM.
>>
>>6. We do operate IRIX Origins and IBM Regatta SMPs where all these issues 
have>>
>>been addressed to a far more satisfying degree than on Linux. Is the IRIX 
file>>
>>system cache going to be ported to ALTIX Linux? There is already a LOT of
>>experience in IRIX for these types of matters that should NOT remain
>>unleveraged.
>>
>>
>>Any information/hint or pointers for in-depth discussion on the bugs and
>>tunning of VM/FS and I/O subsystems or other relevant topics would be
>>GREATLY appreciated!
>>
>>We are willing to share our experience with anyone who is interested in
>>improving any of the above kernel sub-systems and provide feedback with
>>experimental results and insights.
>>
>>Thanks
>>
>>Michael Thomadakis
>>
>>Supercomputing Center
>>Texas A&M University



^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: IA64 Linux VM performance woes
  2004-04-21 12:39 IA64 Linux VM performance woes Satoshi Oshima
@ 2004-04-21 21:00 ` Marcelo Tosatti
  0 siblings, 0 replies; 3+ messages in thread
From: Marcelo Tosatti @ 2004-04-21 21:00 UTC (permalink / raw)
  To: Satoshi Oshima; +Cc: linux-kernel

On Wed, Apr 21, 2004 at 09:39:23PM +0900, Satoshi Oshima wrote:
> Hello, Michael and all.
> 
> We have realized the same kind of performance issue. 
> In our case it is not an IA64 huge scale system but an IA32
> server system.
> 
> In our experiment, we see file I/O throughput decline on 
> the server with over 8GByte memory. Kernel versions we use 
> are 2.6.0 and Red Hat AS3. We show our experiment.
> 
> Below is our hardware configuration and test bench.
> 
> CPUs: Xeon 1.6Ghz - 4way
> Memory: 12GB
> Storage: ATA 120GB
> File I/O workload generator consists of 1024 processes and 
> generates 100KByte to 5MByte file write. Using "mem=" option, 
> we change the memory recognition 2GByte to 12GB.
> 
> Below is the result ( unit: MByte/sec).
> 
>       2GB  4GB  8GB  12GB
> 2.6.0 13.1 18.5 18.4 16.1
> AS3   11.0 11.3 10.3 8.92
> 
> The result shows throughput decline occurs when the server 
> has over 8GByte memory.

Can you share the tests with us? It would be great. 

> We agree that your proposal is good idea. It reduces cache 
> memory reclaiming cost to set upper bound on number of 
> cache memory pages. 

I'm not exactly sure of the problem (others (Andrea, Andrew, etc) probably are). Still,
one useful thing would be to rerun the benchmarks on recent kernels (2.6.6-rc2, which contains a 
lot of VM rewrite and tuning). It will be interesting to know the results.

> Generally it is very difficult to build one system which 
> could handle various type of workloads well. So we hope 
> Linux would have kernel parameter tuning interface.
> 
> We would be very happy if we could share information to 
> manage large scale memory.

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2004-04-21 20:59 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-04-21 12:39 IA64 Linux VM performance woes Satoshi Oshima
2004-04-21 21:00 ` Marcelo Tosatti
  -- strict thread matches above, loose matches on Subject: below --
2004-04-13 18:53 Michael E. Thomadakis

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox