Re: Reducing inode cache usage on 2.4?

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Re: Reducing inode cache usage on 2.4?
  2004-12-17 17:26 Reducing inode cache usage on 2.4? James Pearson
@ 2004-12-17 15:12 ` Marcelo Tosatti
  2004-12-17 21:52   ` Willy Tarreau
  2004-12-18  0:32   ` James Pearson
  0 siblings, 2 replies; 18+ messages in thread
From: Marcelo Tosatti @ 2004-12-17 15:12 UTC (permalink / raw)
  To: James Pearson; +Cc: linux-kernel


Hi James,

On Fri, Dec 17, 2004 at 05:26:20PM +0000, James Pearson wrote:
> I have an NFS server with 1Gb RAM running a 2.4.26 kernel with 2 XFS 
> file systems with about 2 million files in total.
> 
> Occasionally I get reports that the server is 'sticky' (slow 
> read/writes) and the inode cache appears to consume most of the 
> available memory and doesn't appear to reduce - a typical /proc/slabinfo 
> output is below.
> 
> If I run a simple application that grabs memory on the server, the inode 
> and other caches are reduced and the server becomes more responsive 
> (i.e. data rates to/from the server are restored to 'normal').
> 
> Is there anyway I can purge the cached inode data, or any kernel 
> parameters I can tweak to limit the inode cache or flush it more frequently?
> 
> Or am I looking in completely the wrong place i.e. the inode cache is 
> not the problem?

No, in your case the extreme inode/dcache sizes indeed seem to be a problem. 

The default kernel shrinking ratio can be tuned for enhanced reclaim efficiency.

> xfs_inode         931428 931428    408 103492 103492    1 :  124   62
> dentry_cache      499222 518850    128 17295 17295    1 :  252  126

vm_vfs_scan_ratio:
------------------
is what proportion of the VFS queues we will scan in one go.
A value of 6 for vm_vfs_scan_ratio implies that 1/6th of the
unused-inode, dentry and dquot caches will be freed during a
normal aging round.
Big fileservers (NFS, SMB etc.) probably want to set this
value to 3 or 2.

The default value is 6.
=============================================================

Tune /proc/sys/vm/vm_vfs_scan_ratio increasing the value to 10 and so on and 
examine the results.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Reducing inode cache usage on 2.4?
@ 2004-12-17 17:26 James Pearson
  2004-12-17 15:12 ` Marcelo Tosatti
  0 siblings, 1 reply; 18+ messages in thread
From: James Pearson @ 2004-12-17 17:26 UTC (permalink / raw)
  To: linux-kernel

I have an NFS server with 1Gb RAM running a 2.4.26 kernel with 2 XFS 
file systems with about 2 million files in total.

Occasionally I get reports that the server is 'sticky' (slow 
read/writes) and the inode cache appears to consume most of the 
available memory and doesn't appear to reduce - a typical /proc/slabinfo 
output is below.

If I run a simple application that grabs memory on the server, the inode 
and other caches are reduced and the server becomes more responsive 
(i.e. data rates to/from the server are restored to 'normal').

Is there anyway I can purge the cached inode data, or any kernel 
parameters I can tweak to limit the inode cache or flush it more frequently?

Or am I looking in completely the wrong place i.e. the inode cache is 
not the problem?

Thanks

James Pearson

/proc/slabinfo:

slabinfo - version: 1.1 (SMP)
kmem_cache           104    104    148    4    4    1 :  252  126
nfs_write_data         0      0    352    0    0    1 :  124   62
nfs_read_data          0      0    352    0    0    1 :  124   62
nfs_page               0      0     96    0    0    1 :  252  126
ip_fib_hash           10    226     32    2    2    1 :  252  126
clip_arp_cache         0      0    128    0    0    1 :  252  126
ip_mrt_cache           0      0     96    0    0    1 :  252  126
tcp_tw_bucket         40     40     96    1    1    1 :  252  126
tcp_bind_bucket      143    226     32    2    2    1 :  252  126
tcp_open_request      59     59     64    1    1    1 :  252  126
inet_peer_cache       55    236     64    4    4    1 :  252  126
ip_dst_cache         520    520    192   26   26    1 :  252  126
arp_cache             47    210    128    7    7    1 :  252  126
blkdev_requests     5120   5160     96  129  129    1 :  252  126
xfs_chashlist      35838  40560     20  240  240    1 :  252  126
xfs_ili             6664   8652    140  309  309    1 :  252  126
xfs_ifork              0      0     56    0    0    1 :  252  126
xfs_efi_item          15     15    260    1    1    1 :  124   62
xfs_efd_item          15     15    260    1    1    1 :  124   62
xfs_buf_item         130    130    148    5    5    1 :  252  126
xfs_dabuf            202    202     16    1    1    1 :  252  126
xfs_da_state           0      0    336    0    0    1 :  124   62
xfs_trans             81    143    596    9   11    2 :  124   62
xfs_inode         931428 931428    408 103492 103492    1 :  124   62
xfs_btree_cur         58     58    132    2    2    1 :  252  126
xfs_bmap_free_item    252    253     12    1    1    1 :  252  126
page_buf_t           200    200    192   10   10    1 :  252  126
linvfs_icache     931425 931425    352 84675 84675    1 :  124   62
dnotify_cache          0      0     20    0    0    1 :  252  126
file_lock_cache       80     80     96    2    2    1 :  252  126
fasync_cache           0      0     16    0    0    1 :  252  126
uid_cache              8    113     32    1    1    1 :  252  126
skbuff_head_cache    673    680    192   34   34    1 :  252  126
sock                  75     75   1216   25   25    1 :   60   30
sigqueue              58     58    132    2    2    1 :  252  126
kiobuf                 0      0     64    0    0    1 :  252  126
cdev_cache            11    177     64    3    3    1 :  252  126
bdev_cache             5    118     64    2    2    1 :  252  126
mnt_cache             19    177     64    3    3    1 :  252  126
inode_cache          217    217    512   31   31    1 :  124   62
dentry_cache      499222 518850    128 17295 17295    1 :  252  126
dquot                  0      0    128    0    0    1 :  252  126
filp                 486    600    128   20   20    1 :  252  126
names_cache            3      3   4096    3    3    1 :   60   30
buffer_head        31305  34400     96  860  860    1 :  252  126
mm_struct            120    120    160    5    5    1 :  252  126
vm_area_struct       861    880     96   22   22    1 :  252  126
fs_cache             177    177     64    3    3    1 :  252  126
files_cache           63     63    416    7    7    1 :  124   62
signal_act            72     72   1312   24   24    1 :   60   30
size-131072(DMA)       0      0 131072    0    0   32 :    0    0
size-131072            0      0 131072    0    0   32 :    0    0
size-65536(DMA)        0      0  65536    0    0   16 :    0    0
size-65536             0      0  65536    0    0   16 :    0    0
size-32768(DMA)        0      0  32768    0    0    8 :    0    0
size-32768            24     24  32768   24   24    8 :    0    0
size-16384(DMA)        0      0  16384    0    0    4 :    0    0
size-16384            16     18  16384   16   18    4 :    0    0
size-8192(DMA)         0      0   8192    0    0    2 :    0    0
size-8192              7      8   8192    7    8    2 :    0    0
size-4096(DMA)         0      0   4096    0    0    1 :   60   30
size-4096            385    385   4096  385  385    1 :   60   30
size-2048(DMA)         0      0   2048    0    0    1 :   60   30
size-2048           1952   1952   2048  976  976    1 :   60   30
size-1024(DMA)         0      0   1024    0    0    1 :  124   62
size-1024            476    476   1024  119  119    1 :  124   62
size-512(DMA)          0      0    512    0    0    1 :  124   62
size-512             344    344    512   43   43    1 :  124   62
size-256(DMA)          0      0    256    0    0    1 :  252  126
size-256             892   1335    256   89   89    1 :  252  126
size-128(DMA)          0      0    128    0    0    1 :  252  126
size-128            4087   8130    128  271  271    1 :  252  126
size-64(DMA)           0      0     64    0    0    1 :  252  126
size-64            65813  90683     64 1537 1537    1 :  252  126
size-32(DMA)           0      0     32    0    0    1 :  252  126
size-32           421038 421038     32 3726 3726    1 :  252  126

/proc/meminfo:

         total:    used:    free:  shared: buffers:  cached:
Mem:  1057779712 1034821632 22958080        0    36864 136249344
Swap: 2147459072  2015232 2145443840
MemTotal:      1032988 kB
MemFree:         22420 kB
MemShared:           0 kB
Buffers:            36 kB
Cached:         132032 kB
SwapCached:       1024 kB
Active:          29204 kB
Inactive:       113520 kB
HighTotal:      131072 kB
HighFree:         7864 kB
LowTotal:       901916 kB
LowFree:         14556 kB
SwapTotal:     2097128 kB
SwapFree:      2095160 kB


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Reducing inode cache usage on 2.4?
  2004-12-17 15:12 ` Marcelo Tosatti
@ 2004-12-17 21:52   ` Willy Tarreau
  2004-12-18  0:32   ` James Pearson
  1 sibling, 0 replies; 18+ messages in thread
From: Willy Tarreau @ 2004-12-17 21:52 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: James Pearson, linux-kernel

Hi Marcelo,

On Fri, Dec 17, 2004 at 01:12:28PM -0200, Marcelo Tosatti wrote:
(...)
> The default kernel shrinking ratio can be tuned for enhanced reclaim efficiency.

Thanks for having explained this. Up to now, after several series of find or
other FS-intensive tasks, I often launched a simple home-made program to which
I tell how much memory I want it to allocate (and touch), then it exits freeing
this amount of memory. A bit dangerous but really effective indeed !

I too will try to play with vm_vfs_scan_ratio, it seems appealing.

Cheers,
Willy

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Reducing inode cache usage on 2.4?
  2004-12-17 15:12 ` Marcelo Tosatti
  2004-12-17 21:52   ` Willy Tarreau
@ 2004-12-18  0:32   ` James Pearson
  2004-12-18  1:21     ` Andrew Morton
  2004-12-18 15:02     ` Marcelo Tosatti
  1 sibling, 2 replies; 18+ messages in thread
From: James Pearson @ 2004-12-18  0:32 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: linux-kernel

Marcelo Tosatti wrote:
> 
>>Or am I looking in completely the wrong place i.e. the inode cache is 
>>not the problem?
> 
> 
> No, in your case the extreme inode/dcache sizes indeed seem to be a problem. 
> 
> The default kernel shrinking ratio can be tuned for enhanced reclaim efficiency.
> 
> 
>>xfs_inode         931428 931428    408 103492 103492    1 :  124   62
>>dentry_cache      499222 518850    128 17295 17295    1 :  252  126
> 
> 
> vm_vfs_scan_ratio:
> ------------------
> is what proportion of the VFS queues we will scan in one go.
> A value of 6 for vm_vfs_scan_ratio implies that 1/6th of the
> unused-inode, dentry and dquot caches will be freed during a
> normal aging round.
> Big fileservers (NFS, SMB etc.) probably want to set this
> value to 3 or 2.
> 
> The default value is 6.
> =============================================================
> 
> Tune /proc/sys/vm/vm_vfs_scan_ratio increasing the value to 10 and so on and 
> examine the results.

Thanks for the info - but doesn't increasing the value of 
vm_vfs_scan_ratio mean that less of the caches will be freed?

Doing a few tests (on another test file system with 2 million or so 
files and 1Gb of memory) running 'find $disk -type f', with 
vm_vfs_scan_ratio set to 6 (or 10), the first two column values for 
xfs_inode, linvfs_icache and dentry_cache in /proc/slabinfo reach about 
900000 and stay around that value, but setting vm_vfs_scan_ratio to 1, 
then each value still reaches 900000, but then falls to a few thousand 
and increases up to 900000 and then drop away again and repeats.

This still happens when I cat many large files (100Mb) to /dev/null at 
the same time as running the find i.e. the inode caches can still reach 
90% of the memory before being reclaimed (with vm_vfs_scan_ratio set to 1).

If I stop the find process when the inode caches reach about 90% of the 
memory, and then start cat'ing the large files, it appears the inode 
caches are never reclaimed (or longer than it takes to cat 100Gb of data 
to /dev/null) - is this expected behaviour?

It seems the inode cache has priority over cached file data.

What triggers the 'normal ageing round'? Is it possible to trigger this 
earlier (at a lower memory usage), or give a higher priority to cached data?

Thanks

James Pearson

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Reducing inode cache usage on 2.4?
  2004-12-18  0:32   ` James Pearson
@ 2004-12-18  1:21     ` Andrew Morton
  2004-12-18 11:02       ` Marcelo Tosatti
  2004-12-20 19:20       ` Andrea Arcangeli
  2004-12-18 15:02     ` Marcelo Tosatti
  1 sibling, 2 replies; 18+ messages in thread
From: Andrew Morton @ 2004-12-18  1:21 UTC (permalink / raw)
  To: James Pearson; +Cc: marcelo.tosatti, linux-kernel

James Pearson <james-p@moving-picture.com> wrote:
>
> It seems the inode cache has priority over cached file data.

It does.  If the machine is full of unmapped clean pagecache pages the
kernel won't even try to reclaim inodes.  This should help a bit:

--- 24/mm/vmscan.c~a	2004-12-17 17:18:31.660254712 -0800
+++ 24-akpm/mm/vmscan.c	2004-12-17 17:18:41.821709936 -0800
@@ -659,13 +659,13 @@ int fastcall try_to_free_pages_zone(zone
 
 		do {
 			nr_pages = shrink_caches(classzone, gfp_mask, nr_pages, &failed_swapout);
-			if (nr_pages <= 0)
-				return 1;
 			shrink_dcache_memory(vm_vfs_scan_ratio, gfp_mask);
 			shrink_icache_memory(vm_vfs_scan_ratio, gfp_mask);
 #ifdef CONFIG_QUOTA
 			shrink_dqcache_memory(vm_vfs_scan_ratio, gfp_mask);
 #endif
+			if (nr_pages <= 0)
+				return 1;
 			if (!failed_swapout)
 				failed_swapout = !swap_out(classzone);
 		} while (--tries);
_


>  What triggers the 'normal ageing round'? Is it possible to trigger this 
>  earlier (at a lower memory usage), or give a higher priority to cached data?

You could also try lowering /proc/sys/vm/vm_mapped_ratio.  That will cause
inodes to be reaped more easily, but will also cause more swapout.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Reducing inode cache usage on 2.4?
  2004-12-18  1:21     ` Andrew Morton
@ 2004-12-18 11:02       ` Marcelo Tosatti
  2004-12-20 13:47         ` James Pearson
  2004-12-20 19:20       ` Andrea Arcangeli
  1 sibling, 1 reply; 18+ messages in thread
From: Marcelo Tosatti @ 2004-12-18 11:02 UTC (permalink / raw)
  To: Andrew Morton; +Cc: James Pearson, linux-kernel


James,

Can apply Andrew's patch and examine the results?

I've merged it to mainline because it looks sensible.

Thanks Andrew!

On Fri, Dec 17, 2004 at 05:21:04PM -0800, Andrew Morton wrote:
> James Pearson <james-p@moving-picture.com> wrote:
> >
> > It seems the inode cache has priority over cached file data.
> 
> It does.  If the machine is full of unmapped clean pagecache pages the
> kernel won't even try to reclaim inodes.  This should help a bit:
> 
> --- 24/mm/vmscan.c~a	2004-12-17 17:18:31.660254712 -0800
> +++ 24-akpm/mm/vmscan.c	2004-12-17 17:18:41.821709936 -0800
> @@ -659,13 +659,13 @@ int fastcall try_to_free_pages_zone(zone
>  
>  		do {
>  			nr_pages = shrink_caches(classzone, gfp_mask, nr_pages, &failed_swapout);
> -			if (nr_pages <= 0)
> -				return 1;
>  			shrink_dcache_memory(vm_vfs_scan_ratio, gfp_mask);
>  			shrink_icache_memory(vm_vfs_scan_ratio, gfp_mask);
>  #ifdef CONFIG_QUOTA
>  			shrink_dqcache_memory(vm_vfs_scan_ratio, gfp_mask);
>  #endif
> +			if (nr_pages <= 0)
> +				return 1;
>  			if (!failed_swapout)
>  				failed_swapout = !swap_out(classzone);
>  		} while (--tries);
> _
> 
> 
> >  What triggers the 'normal ageing round'? Is it possible to trigger this 
> >  earlier (at a lower memory usage), or give a higher priority to cached data?
> 
> You could also try lowering /proc/sys/vm/vm_mapped_ratio.  That will cause
> inodes to be reaped more easily, but will also cause more swapout.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Reducing inode cache usage on 2.4?
  2004-12-18  0:32   ` James Pearson
  2004-12-18  1:21     ` Andrew Morton
@ 2004-12-18 15:02     ` Marcelo Tosatti
  1 sibling, 0 replies; 18+ messages in thread
From: Marcelo Tosatti @ 2004-12-18 15:02 UTC (permalink / raw)
  To: James Pearson; +Cc: linux-kernel, Andrew Morton

On Sat, Dec 18, 2004 at 12:32:54AM +0000, James Pearson wrote:
> Marcelo Tosatti wrote:
> >
> >>Or am I looking in completely the wrong place i.e. the inode cache is 
> >>not the problem?
> >
> >
> >No, in your case the extreme inode/dcache sizes indeed seem to be a 
> >problem. 
> >The default kernel shrinking ratio can be tuned for enhanced reclaim 
> >efficiency.
> >
> >
> >>xfs_inode         931428 931428    408 103492 103492    1 :  124   62
> >>dentry_cache      499222 518850    128 17295 17295    1 :  252  126
> >
> >
> >vm_vfs_scan_ratio:
> >------------------
> >is what proportion of the VFS queues we will scan in one go.
> >A value of 6 for vm_vfs_scan_ratio implies that 1/6th of the
> >unused-inode, dentry and dquot caches will be freed during a
> >normal aging round.
> >Big fileservers (NFS, SMB etc.) probably want to set this
> >value to 3 or 2.
> >
> >The default value is 6.
> >=============================================================
> >
> >Tune /proc/sys/vm/vm_vfs_scan_ratio increasing the value to 10 and so on 
> >and examine the results.
> 
> Thanks for the info - but doesn't increasing the value of 
> vm_vfs_scan_ratio mean that less of the caches will be freed?

Right - what I said was wrong - its the other way around:
Decreasing the value increases the percentage of VFS caches scanned at each "aging pass".

Now Andrew's changed the ageing round pass. 

Quoting him "If the machine is full of unmapped clean pagecache pages the kernel 
won't even try to reclaim inodes".

vm_vfs_scan_ratio now is more meaningful. 

kswapd is awaken as soon as a zone's low watermark is reached, and will
work to free pages until it reaches the zone's high watermark.

There are three zones: DMA (1) , Normal (2) and Highmem (3).

 * On machines where it is needed (eg PCs) we divide physical memory
 * into multiple physical zones. On a PC we have 3 zones:
 *
 * ZONE_DMA       < 16 MB       ISA DMA capable memory
 * ZONE_NORMAL  16-896 MB       direct mapped by the kernel
 * ZONE_HIGHMEM  > 896 MB       only page cache and user processes

So these thresolds are used to calculate each zone's min, low and high
watermarks using the following calculation (mm/page_alloc.c):

	mask = (realsize / zone_balance_ratio[j]);
	if (mask < zone_balance_min[j])
     	mask = zone_balance_min[j];
              else if (mask > zone_balance_max[j])
                        mask = zone_balance_max[j];
                zone->watermarks[j].min = mask;
                zone->watermarks[j].low = mask*2;
                zone->watermarks[j].high = mask*3;

To trigger the normal aging round earlier the "low" watermark has to be increased,
but you better increase the "high" watermark which makes kswapd work up longer
until such high free page watermark is reached, one can try for example

 zone->watermarks[j].high = mask*4

But hopefully you wont need such modification (it would be nice if they were all boot 
configurable BTW) with Andrew's change.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Reducing inode cache usage on 2.4?
  2004-12-20 13:47         ` James Pearson
@ 2004-12-20 12:46           ` Marcelo Tosatti
  2004-12-20 15:10             ` Andrea Arcangeli
  0 siblings, 1 reply; 18+ messages in thread
From: Marcelo Tosatti @ 2004-12-20 12:46 UTC (permalink / raw)
  To: James Pearson; +Cc: Andrew Morton, linux-kernel, andrea

On Mon, Dec 20, 2004 at 01:47:46PM +0000, James Pearson wrote:
> I've tested the patch on my test setup - running a 'find $disk -type f' 
> and a cat of large files to /dev/null at the same time does indeed 
> reduce the size of the inode and dentry caches considerably - the first 
> column numbers for fs_inode, linvfs_icache and dentry_cache in 
> /proc/slabinfo hover at about 400-600 (over 900000 previously).
> 
> However, is this going a bit to far the other way? When I boot the 
> machine with 4Gb RAM, the inode and dentry caches are squeezed to the 
> same amounts, but it may be the case that it would be more beneficial to 
> have more in the inode and dentry caches? i.e. I guess some sort of 
> tunable factor that limits the minimum size of the inode and dentry 
> caches in this case?

One can increase vm_vfs_scan_ratio if required, but hopefully this change
will benefit all workloads.

Andrew, Andrea, do you think of any workloads which might be hurt by this change?

> But saying that, I notice my 'find $disk -type f' (with about 2 million 
> files) runs a lot faster with the smaller inode/dentry caches - about 1 
> or 2 minutes with the patched kernel compared with about 5 to 7 minutes 
> with the unpatched kernel - I guess it was taking longer to search the 
> inode/dentry cache than reading direct from disk.

Wonderful.

> 
> James Pearson
> 
> Marcelo Tosatti wrote:
> >James,
> >
> >Can apply Andrew's patch and examine the results?
> >
> >I've merged it to mainline because it looks sensible.
> >
> >Thanks Andrew!
> >
> >On Fri, Dec 17, 2004 at 05:21:04PM -0800, Andrew Morton wrote:
> >
> >>James Pearson <james-p@moving-picture.com> wrote:
> >>
> >>>It seems the inode cache has priority over cached file data.
> >>
> >>It does.  If the machine is full of unmapped clean pagecache pages the
> >>kernel won't even try to reclaim inodes.  This should help a bit:
> >>
> >>--- 24/mm/vmscan.c~a	2004-12-17 17:18:31.660254712 -0800
> >>+++ 24-akpm/mm/vmscan.c	2004-12-17 17:18:41.821709936 -0800
> >>@@ -659,13 +659,13 @@ int fastcall try_to_free_pages_zone(zone
> >>
> >>		do {
> >>			nr_pages = shrink_caches(classzone, gfp_mask, 
> >>			nr_pages, &failed_swapout);
> >>-			if (nr_pages <= 0)
> >>-				return 1;
> >>			shrink_dcache_memory(vm_vfs_scan_ratio, gfp_mask);
> >>			shrink_icache_memory(vm_vfs_scan_ratio, gfp_mask);
> >>#ifdef CONFIG_QUOTA
> >>			shrink_dqcache_memory(vm_vfs_scan_ratio, gfp_mask);
> >>#endif
> >>+			if (nr_pages <= 0)
> >>+				return 1;
> >>			if (!failed_swapout)
> >>				failed_swapout = !swap_out(classzone);
> >>		} while (--tries);
> >>_
> >>
> >>
> >>
> >>>What triggers the 'normal ageing round'? Is it possible to trigger this 
> >>>earlier (at a lower memory usage), or give a higher priority to cached 
> >>>data?
> >>
> >>You could also try lowering /proc/sys/vm/vm_mapped_ratio.  That will cause
> >>inodes to be reaped more easily, but will also cause more swapout.
> >
> >

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Reducing inode cache usage on 2.4?
  2004-12-18 11:02       ` Marcelo Tosatti
@ 2004-12-20 13:47         ` James Pearson
  2004-12-20 12:46           ` Marcelo Tosatti
  0 siblings, 1 reply; 18+ messages in thread
From: James Pearson @ 2004-12-20 13:47 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Andrew Morton, linux-kernel

I've tested the patch on my test setup - running a 'find $disk -type f' 
and a cat of large files to /dev/null at the same time does indeed 
reduce the size of the inode and dentry caches considerably - the first 
column numbers for fs_inode, linvfs_icache and dentry_cache in 
/proc/slabinfo hover at about 400-600 (over 900000 previously).

However, is this going a bit to far the other way? When I boot the 
machine with 4Gb RAM, the inode and dentry caches are squeezed to the 
same amounts, but it may be the case that it would be more beneficial to 
have more in the inode and dentry caches? i.e. I guess some sort of 
tunable factor that limits the minimum size of the inode and dentry 
caches in this case?

But saying that, I notice my 'find $disk -type f' (with about 2 million 
files) runs a lot faster with the smaller inode/dentry caches - about 1 
or 2 minutes with the patched kernel compared with about 5 to 7 minutes 
with the unpatched kernel - I guess it was taking longer to search the 
inode/dentry cache than reading direct from disk.

James Pearson

Marcelo Tosatti wrote:
> James,
> 
> Can apply Andrew's patch and examine the results?
> 
> I've merged it to mainline because it looks sensible.
> 
> Thanks Andrew!
> 
> On Fri, Dec 17, 2004 at 05:21:04PM -0800, Andrew Morton wrote:
> 
>>James Pearson <james-p@moving-picture.com> wrote:
>>
>>>It seems the inode cache has priority over cached file data.
>>
>>It does.  If the machine is full of unmapped clean pagecache pages the
>>kernel won't even try to reclaim inodes.  This should help a bit:
>>
>>--- 24/mm/vmscan.c~a	2004-12-17 17:18:31.660254712 -0800
>>+++ 24-akpm/mm/vmscan.c	2004-12-17 17:18:41.821709936 -0800
>>@@ -659,13 +659,13 @@ int fastcall try_to_free_pages_zone(zone
>> 
>> 		do {
>> 			nr_pages = shrink_caches(classzone, gfp_mask, nr_pages, &failed_swapout);
>>-			if (nr_pages <= 0)
>>-				return 1;
>> 			shrink_dcache_memory(vm_vfs_scan_ratio, gfp_mask);
>> 			shrink_icache_memory(vm_vfs_scan_ratio, gfp_mask);
>> #ifdef CONFIG_QUOTA
>> 			shrink_dqcache_memory(vm_vfs_scan_ratio, gfp_mask);
>> #endif
>>+			if (nr_pages <= 0)
>>+				return 1;
>> 			if (!failed_swapout)
>> 				failed_swapout = !swap_out(classzone);
>> 		} while (--tries);
>>_
>>
>>
>>
>>> What triggers the 'normal ageing round'? Is it possible to trigger this 
>>> earlier (at a lower memory usage), or give a higher priority to cached data?
>>
>>You could also try lowering /proc/sys/vm/vm_mapped_ratio.  That will cause
>>inodes to be reaped more easily, but will also cause more swapout.
> 
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Reducing inode cache usage on 2.4?
  2004-12-20 15:10             ` Andrea Arcangeli
@ 2004-12-20 15:06               ` Marcelo Tosatti
  2004-12-20 17:54                 ` Andrea Arcangeli
  0 siblings, 1 reply; 18+ messages in thread
From: Marcelo Tosatti @ 2004-12-20 15:06 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: James Pearson, Andrew Morton, linux-kernel

On Mon, Dec 20, 2004 at 04:10:45PM +0100, Andrea Arcangeli wrote:
> On Mon, Dec 20, 2004 at 10:46:04AM -0200, Marcelo Tosatti wrote:
> > On Mon, Dec 20, 2004 at 01:47:46PM +0000, James Pearson wrote:
> > > I've tested the patch on my test setup - running a 'find $disk -type f' 
> > > and a cat of large files to /dev/null at the same time does indeed 
> > > reduce the size of the inode and dentry caches considerably - the first 
> > > column numbers for fs_inode, linvfs_icache and dentry_cache in 
> > > /proc/slabinfo hover at about 400-600 (over 900000 previously).
> > > 
> > > However, is this going a bit to far the other way? When I boot the 
> > > machine with 4Gb RAM, the inode and dentry caches are squeezed to the 
> > > same amounts, but it may be the case that it would be more beneficial to 
> > > have more in the inode and dentry caches? i.e. I guess some sort of 
> > > tunable factor that limits the minimum size of the inode and dentry 
> > > caches in this case?
> > 
> > One can increase vm_vfs_scan_ratio if required, but hopefully this change
> > will benefit all workloads.
> > 
> > Andrew, Andrea, do you think of any workloads which might be hurt by this change?
> 
> I wouldn't touch the defaults, but the sysctl is there so if you've a
> strange workload you can tune for it.
> 
> There's nothing wrong with dcache/icache growing a lot.

The thing is right now we dont try to reclaim from icache/dcache _at all_ 
if enough clean pagecache pages are found and reclaimed.

Its sounds unfair to me.

> A cat of a large file is polluting the cache, so that's not a workload that should shrink
> the dcache/icache. 

Why not? If we have a lot of them they will probably be hurting performace, which seems
to be the case now.

> I'd prefer a feedback based on a real useful workload
> before even considering touching the defaults at this time.

Following this logic any workload which generates pagecache and happen to, most times,
have enough pagecache clean to be reclaimed should not reclaim the i/dcache's.
Which is not right.

But yes, feedback based on other workloads is required. I'm hoping people do test 
the next 2.4.29-pre3 and send feedback.

So I'll probably revert the patch if any considerable regression is found. 



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Reducing inode cache usage on 2.4?
  2004-12-20 12:46           ` Marcelo Tosatti
@ 2004-12-20 15:10             ` Andrea Arcangeli
  2004-12-20 15:06               ` Marcelo Tosatti
  0 siblings, 1 reply; 18+ messages in thread
From: Andrea Arcangeli @ 2004-12-20 15:10 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: James Pearson, Andrew Morton, linux-kernel

On Mon, Dec 20, 2004 at 10:46:04AM -0200, Marcelo Tosatti wrote:
> On Mon, Dec 20, 2004 at 01:47:46PM +0000, James Pearson wrote:
> > I've tested the patch on my test setup - running a 'find $disk -type f' 
> > and a cat of large files to /dev/null at the same time does indeed 
> > reduce the size of the inode and dentry caches considerably - the first 
> > column numbers for fs_inode, linvfs_icache and dentry_cache in 
> > /proc/slabinfo hover at about 400-600 (over 900000 previously).
> > 
> > However, is this going a bit to far the other way? When I boot the 
> > machine with 4Gb RAM, the inode and dentry caches are squeezed to the 
> > same amounts, but it may be the case that it would be more beneficial to 
> > have more in the inode and dentry caches? i.e. I guess some sort of 
> > tunable factor that limits the minimum size of the inode and dentry 
> > caches in this case?
> 
> One can increase vm_vfs_scan_ratio if required, but hopefully this change
> will benefit all workloads.
> 
> Andrew, Andrea, do you think of any workloads which might be hurt by this change?

I wouldn't touch the defaults, but the sysctl is there so if you've a
strange workload you can tune for it.

There's nothing wrong with dcache/icache growing a lot. A cat of a large
file is polluting the cache, so that's not a workload that should shrink
the dcache/icache. I'd prefer a feedback based on a real useful workload
before even considering touching the defaults at this time.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Reducing inode cache usage on 2.4?
  2004-12-20 17:54                 ` Andrea Arcangeli
@ 2004-12-20 15:43                   ` Marcelo Tosatti
  0 siblings, 0 replies; 18+ messages in thread
From: Marcelo Tosatti @ 2004-12-20 15:43 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: James Pearson, Andrew Morton, linux-kernel

On Mon, Dec 20, 2004 at 06:54:09PM +0100, Andrea Arcangeli wrote:
> On Mon, Dec 20, 2004 at 01:06:34PM -0200, Marcelo Tosatti wrote:
> > The thing is right now we dont try to reclaim from icache/dcache _at all_ 
> > if enough clean pagecache pages are found and reclaimed.
> > 
> > Its sounds unfair to me.
> 
> If most ram is in pagecache there's not much point to shrink the dcache.
> The more ram goes into dcache/icache, the less ram will be in pagecache,
> and the more likely we'll start shrinking dcache/icache. Also keep in
> mind in a highmem machine the pagecache will be in highmemory and the
> dcache/icache in lowmemory (on very very big boxes the lowmem_reserve
> algorithm pratically splits the two in non-overkapping zones), so
> especially on a big highmem machine shrinking dcache/icache during a
> pagecache allocation (because this is what the workload is doing: only
> pagecache allocations) is a worthless effort.
> 
> This is the best solution we have right now, but there have been several
> discussions in the past on how to shrink dcache/icache. But if we want
> to talk on how to change this, we should talk about 2.6/2.7 only IMHO.
> 
> > Why not? If we have a lot of them they will probably be hurting performace, which seems
> > to be the case now.
> 
> The slowdown could be because the icache/dcache hash size is too small.
> It signals collisions in the dcache/icache hashtable. 2.6 with bootmem
> allocated hashes should be better. Optimizing 2.4 for performance if not
> worth the risk IMHO. I would suggest to check if you can reproduce in
> 2.6, and fix it there, if it's still there.
> 
> > Following this logic any workload which generates pagecache and happen
> > to, most times, have enough pagecache clean to be reclaimed should not
> > reclaim the i/dcache's.  Which is not right.
> 
> This mostly happens for cache-polluting-workloads like in this testcase.
> If the cache would be activated, there would be less pages in the
> inactive list and you had a better chance to invoke the dcache/icache
> shrinking.

OK I buy your arguments I'll revert Andrew's patch.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Reducing inode cache usage on 2.4?
  2004-12-20 15:06               ` Marcelo Tosatti
@ 2004-12-20 17:54                 ` Andrea Arcangeli
  2004-12-20 15:43                   ` Marcelo Tosatti
  0 siblings, 1 reply; 18+ messages in thread
From: Andrea Arcangeli @ 2004-12-20 17:54 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: James Pearson, Andrew Morton, linux-kernel

On Mon, Dec 20, 2004 at 01:06:34PM -0200, Marcelo Tosatti wrote:
> The thing is right now we dont try to reclaim from icache/dcache _at all_ 
> if enough clean pagecache pages are found and reclaimed.
> 
> Its sounds unfair to me.

If most ram is in pagecache there's not much point to shrink the dcache.
The more ram goes into dcache/icache, the less ram will be in pagecache,
and the more likely we'll start shrinking dcache/icache. Also keep in
mind in a highmem machine the pagecache will be in highmemory and the
dcache/icache in lowmemory (on very very big boxes the lowmem_reserve
algorithm pratically splits the two in non-overkapping zones), so
especially on a big highmem machine shrinking dcache/icache during a
pagecache allocation (because this is what the workload is doing: only
pagecache allocations) is a worthless effort.

This is the best solution we have right now, but there have been several
discussions in the past on how to shrink dcache/icache. But if we want
to talk on how to change this, we should talk about 2.6/2.7 only IMHO.

> Why not? If we have a lot of them they will probably be hurting performace, which seems
> to be the case now.

The slowdown could be because the icache/dcache hash size is too small.
It signals collisions in the dcache/icache hashtable. 2.6 with bootmem
allocated hashes should be better. Optimizing 2.4 for performance if not
worth the risk IMHO. I would suggest to check if you can reproduce in
2.6, and fix it there, if it's still there.

> Following this logic any workload which generates pagecache and happen
> to, most times, have enough pagecache clean to be reclaimed should not
> reclaim the i/dcache's.  Which is not right.

This mostly happens for cache-polluting-workloads like in this testcase.
If the cache would be activated, there would be less pages in the
inactive list and you had a better chance to invoke the dcache/icache
shrinking.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Reducing inode cache usage on 2.4?
  2004-12-18  1:21     ` Andrew Morton
  2004-12-18 11:02       ` Marcelo Tosatti
@ 2004-12-20 19:20       ` Andrea Arcangeli
  2004-12-21 11:33         ` James Pearson
  1 sibling, 1 reply; 18+ messages in thread
From: Andrea Arcangeli @ 2004-12-20 19:20 UTC (permalink / raw)
  To: Andrew Morton; +Cc: James Pearson, marcelo.tosatti, linux-kernel

On Fri, Dec 17, 2004 at 05:21:04PM -0800, Andrew Morton wrote:
> James Pearson <james-p@moving-picture.com> wrote:
> >
> > It seems the inode cache has priority over cached file data.
> 
> It does.  If the machine is full of unmapped clean pagecache pages the
> kernel won't even try to reclaim inodes.  This should help a bit:
> 
> --- 24/mm/vmscan.c~a	2004-12-17 17:18:31.660254712 -0800
> +++ 24-akpm/mm/vmscan.c	2004-12-17 17:18:41.821709936 -0800
> @@ -659,13 +659,13 @@ int fastcall try_to_free_pages_zone(zone
>  
>  		do {
>  			nr_pages = shrink_caches(classzone, gfp_mask, nr_pages, &failed_swapout);
> -			if (nr_pages <= 0)
> -				return 1;
>  			shrink_dcache_memory(vm_vfs_scan_ratio, gfp_mask);
>  			shrink_icache_memory(vm_vfs_scan_ratio, gfp_mask);
>  #ifdef CONFIG_QUOTA
>  			shrink_dqcache_memory(vm_vfs_scan_ratio, gfp_mask);
>  #endif
> +			if (nr_pages <= 0)
> +				return 1;
>  			if (!failed_swapout)
>  				failed_swapout = !swap_out(classzone);
>  		} while (--tries);

I'm worried this is too aggressive by default and it may hurt stuff. The
real bug is that we don't do anything when too many collisions happens
in the hashtables. That is the thing to work on. We should free
colliding entries in the background after a 'touch' timeout. That should
work pretty well to age the dcache proprerly too. But the above will
just shrink everything all the time and it's going to break stuff.
For 2.6 we can talk about the background shrink based on timeout.

My only suggestion for 2.4 is to try with vm_cache_scan_ratio = 20 or
higher (or alternatively vm_mapped_ratio = 50 or = 20).  There's a
reason why everything is tunable by sysctl.

I don't think the vm_lru_balance_ratio is the one he's interested
about. vm_lru_balance_ratio controls how much work is being done at
every dcache/icache shrinking.

His real objective is to invoke the dcache/icache shrinking more
frequently, how much work is being done at each pass is a secondary
issue. If we don't invoke it, nothing will be shrunk, no matter what is
the value of vm_lru_balance_ratio.

Hope this helps funding an optimal tuning for the workload.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Reducing inode cache usage on 2.4?
  2004-12-20 19:20       ` Andrea Arcangeli
@ 2004-12-21 11:33         ` James Pearson
  2004-12-21 13:22           ` Andrea Arcangeli
  0 siblings, 1 reply; 18+ messages in thread
From: James Pearson @ 2004-12-21 11:33 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Andrew Morton, marcelo.tosatti, linux-kernel

Andrea Arcangeli wrote:
> 
> My only suggestion for 2.4 is to try with vm_cache_scan_ratio = 20 or
> higher (or alternatively vm_mapped_ratio = 50 or = 20).  There's a
> reason why everything is tunable by sysctl.
> 
> I don't think the vm_lru_balance_ratio is the one he's interested
> about. vm_lru_balance_ratio controls how much work is being done at
> every dcache/icache shrinking.
> 
> His real objective is to invoke the dcache/icache shrinking more
> frequently, how much work is being done at each pass is a secondary
> issue. If we don't invoke it, nothing will be shrunk, no matter what is
> the value of vm_lru_balance_ratio.
> 
> Hope this helps funding an optimal tuning for the workload.

Setting vm_mapped_ratio to 20 seems to give a 'better' memory usage 
using my very contrived test - running a find will result in about 900Mb 
of dcache/icache, but then running a cat to /dev/null will shrink the 
dcache/icache down to between 100-300Mb - running the find and cat at 
the same time results in about the same dcache/icache usage.

I'll give this a go on the production NFS server and I'll see if it 
improves things.

Thanks

James Pearson





^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Reducing inode cache usage on 2.4?
  2004-12-21 11:33         ` James Pearson
@ 2004-12-21 13:22           ` Andrea Arcangeli
  2004-12-21 13:59             ` James Pearson
  0 siblings, 1 reply; 18+ messages in thread
From: Andrea Arcangeli @ 2004-12-21 13:22 UTC (permalink / raw)
  To: James Pearson; +Cc: Andrew Morton, marcelo.tosatti, linux-kernel

On Tue, Dec 21, 2004 at 11:33:24AM +0000, James Pearson wrote:
> Setting vm_mapped_ratio to 20 seems to give a 'better' memory usage 
> using my very contrived test - running a find will result in about 900Mb 
> of dcache/icache, but then running a cat to /dev/null will shrink the 
> dcache/icache down to between 100-300Mb - running the find and cat at 
> the same time results in about the same dcache/icache usage.
> 
> I'll give this a go on the production NFS server and I'll see if it 
> improves things.

Ok great. If 20 isn't enough just set it to 40, just be careful that if
you set it too high the system may swap a bit too early.

Overall this is still a workaround, real fix would be a background
scanning of the icache/dcache collisions in the hash buckets but that's
not for 2.4 ;).

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Reducing inode cache usage on 2.4?
  2004-12-21 13:22           ` Andrea Arcangeli
@ 2004-12-21 13:59             ` James Pearson
  2004-12-21 14:39               ` Andrea Arcangeli
  0 siblings, 1 reply; 18+ messages in thread
From: James Pearson @ 2004-12-21 13:59 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Andrew Morton, marcelo.tosatti, linux-kernel

Andrea Arcangeli wrote:
> On Tue, Dec 21, 2004 at 11:33:24AM +0000, James Pearson wrote:
> 
>>Setting vm_mapped_ratio to 20 seems to give a 'better' memory usage 
>>using my very contrived test - running a find will result in about 900Mb 
>>of dcache/icache, but then running a cat to /dev/null will shrink the 
>>dcache/icache down to between 100-300Mb - running the find and cat at 
>>the same time results in about the same dcache/icache usage.
>>
>>I'll give this a go on the production NFS server and I'll see if it 
>>improves things.
> 
> 
> Ok great. If 20 isn't enough just set it to 40, just be careful that if
> you set it too high the system may swap a bit too early.

I've changed the value of vm_mapped_ratio to 20 - which has a default 
value of 100 - I guess you're talking about vm_cache_scan_ratio?

I've tried changing just vm_cache_scan_ratio to 20, but it doesn't seem 
to make any difference - I though a higher vm_cache_scan_ratio value 
meant less is scanned?

James Pearson

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Reducing inode cache usage on 2.4?
  2004-12-21 13:59             ` James Pearson
@ 2004-12-21 14:39               ` Andrea Arcangeli
  0 siblings, 0 replies; 18+ messages in thread
From: Andrea Arcangeli @ 2004-12-21 14:39 UTC (permalink / raw)
  To: James Pearson; +Cc: Andrew Morton, marcelo.tosatti, linux-kernel

On Tue, Dec 21, 2004 at 01:59:06PM +0000, James Pearson wrote:
> I've changed the value of vm_mapped_ratio to 20 - which has a default 
> value of 100 - I guess you're talking about vm_cache_scan_ratio?

yes, I was talking about vm_cache_scan_ratio, you can combine the two
sysctl together just fine.

> I've tried changing just vm_cache_scan_ratio to 20, but it doesn't seem 
> to make any difference - I though a higher vm_cache_scan_ratio value 
> meant less is scanned?

The less pages are scanned, the more likely you won't free enough
pagecache, the more likely you'll shrink dcache/icache.

I see why vm_mapped_ratio makes most of the difference though and
probably it's the easier fix for your problem (though increasing
vm_cache_scan_ratio sure won't make things worse).

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2004-12-21 14:40 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-12-17 17:26 Reducing inode cache usage on 2.4? James Pearson
2004-12-17 15:12 ` Marcelo Tosatti
2004-12-17 21:52   ` Willy Tarreau
2004-12-18  0:32   ` James Pearson
2004-12-18  1:21     ` Andrew Morton
2004-12-18 11:02       ` Marcelo Tosatti
2004-12-20 13:47         ` James Pearson
2004-12-20 12:46           ` Marcelo Tosatti
2004-12-20 15:10             ` Andrea Arcangeli
2004-12-20 15:06               ` Marcelo Tosatti
2004-12-20 17:54                 ` Andrea Arcangeli
2004-12-20 15:43                   ` Marcelo Tosatti
2004-12-20 19:20       ` Andrea Arcangeli
2004-12-21 11:33         ` James Pearson
2004-12-21 13:22           ` Andrea Arcangeli
2004-12-21 13:59             ` James Pearson
2004-12-21 14:39               ` Andrea Arcangeli
2004-12-18 15:02     ` Marcelo Tosatti

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox