* Re: Reducing inode cache usage on 2.4?
2004-12-17 17:26 Reducing inode cache usage on 2.4? James Pearson
@ 2004-12-17 15:12 ` Marcelo Tosatti
2004-12-17 21:52 ` Willy Tarreau
2004-12-18 0:32 ` James Pearson
0 siblings, 2 replies; 18+ messages in thread
From: Marcelo Tosatti @ 2004-12-17 15:12 UTC (permalink / raw)
To: James Pearson; +Cc: linux-kernel
Hi James,
On Fri, Dec 17, 2004 at 05:26:20PM +0000, James Pearson wrote:
> I have an NFS server with 1Gb RAM running a 2.4.26 kernel with 2 XFS
> file systems with about 2 million files in total.
>
> Occasionally I get reports that the server is 'sticky' (slow
> read/writes) and the inode cache appears to consume most of the
> available memory and doesn't appear to reduce - a typical /proc/slabinfo
> output is below.
>
> If I run a simple application that grabs memory on the server, the inode
> and other caches are reduced and the server becomes more responsive
> (i.e. data rates to/from the server are restored to 'normal').
>
> Is there anyway I can purge the cached inode data, or any kernel
> parameters I can tweak to limit the inode cache or flush it more frequently?
>
> Or am I looking in completely the wrong place i.e. the inode cache is
> not the problem?
No, in your case the extreme inode/dcache sizes indeed seem to be a problem.
The default kernel shrinking ratio can be tuned for enhanced reclaim efficiency.
> xfs_inode 931428 931428 408 103492 103492 1 : 124 62
> dentry_cache 499222 518850 128 17295 17295 1 : 252 126
vm_vfs_scan_ratio:
------------------
is what proportion of the VFS queues we will scan in one go.
A value of 6 for vm_vfs_scan_ratio implies that 1/6th of the
unused-inode, dentry and dquot caches will be freed during a
normal aging round.
Big fileservers (NFS, SMB etc.) probably want to set this
value to 3 or 2.
The default value is 6.
=============================================================
Tune /proc/sys/vm/vm_vfs_scan_ratio increasing the value to 10 and so on and
examine the results.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Reducing inode cache usage on 2.4?
@ 2004-12-17 17:26 James Pearson
2004-12-17 15:12 ` Marcelo Tosatti
0 siblings, 1 reply; 18+ messages in thread
From: James Pearson @ 2004-12-17 17:26 UTC (permalink / raw)
To: linux-kernel
I have an NFS server with 1Gb RAM running a 2.4.26 kernel with 2 XFS
file systems with about 2 million files in total.
Occasionally I get reports that the server is 'sticky' (slow
read/writes) and the inode cache appears to consume most of the
available memory and doesn't appear to reduce - a typical /proc/slabinfo
output is below.
If I run a simple application that grabs memory on the server, the inode
and other caches are reduced and the server becomes more responsive
(i.e. data rates to/from the server are restored to 'normal').
Is there anyway I can purge the cached inode data, or any kernel
parameters I can tweak to limit the inode cache or flush it more frequently?
Or am I looking in completely the wrong place i.e. the inode cache is
not the problem?
Thanks
James Pearson
/proc/slabinfo:
slabinfo - version: 1.1 (SMP)
kmem_cache 104 104 148 4 4 1 : 252 126
nfs_write_data 0 0 352 0 0 1 : 124 62
nfs_read_data 0 0 352 0 0 1 : 124 62
nfs_page 0 0 96 0 0 1 : 252 126
ip_fib_hash 10 226 32 2 2 1 : 252 126
clip_arp_cache 0 0 128 0 0 1 : 252 126
ip_mrt_cache 0 0 96 0 0 1 : 252 126
tcp_tw_bucket 40 40 96 1 1 1 : 252 126
tcp_bind_bucket 143 226 32 2 2 1 : 252 126
tcp_open_request 59 59 64 1 1 1 : 252 126
inet_peer_cache 55 236 64 4 4 1 : 252 126
ip_dst_cache 520 520 192 26 26 1 : 252 126
arp_cache 47 210 128 7 7 1 : 252 126
blkdev_requests 5120 5160 96 129 129 1 : 252 126
xfs_chashlist 35838 40560 20 240 240 1 : 252 126
xfs_ili 6664 8652 140 309 309 1 : 252 126
xfs_ifork 0 0 56 0 0 1 : 252 126
xfs_efi_item 15 15 260 1 1 1 : 124 62
xfs_efd_item 15 15 260 1 1 1 : 124 62
xfs_buf_item 130 130 148 5 5 1 : 252 126
xfs_dabuf 202 202 16 1 1 1 : 252 126
xfs_da_state 0 0 336 0 0 1 : 124 62
xfs_trans 81 143 596 9 11 2 : 124 62
xfs_inode 931428 931428 408 103492 103492 1 : 124 62
xfs_btree_cur 58 58 132 2 2 1 : 252 126
xfs_bmap_free_item 252 253 12 1 1 1 : 252 126
page_buf_t 200 200 192 10 10 1 : 252 126
linvfs_icache 931425 931425 352 84675 84675 1 : 124 62
dnotify_cache 0 0 20 0 0 1 : 252 126
file_lock_cache 80 80 96 2 2 1 : 252 126
fasync_cache 0 0 16 0 0 1 : 252 126
uid_cache 8 113 32 1 1 1 : 252 126
skbuff_head_cache 673 680 192 34 34 1 : 252 126
sock 75 75 1216 25 25 1 : 60 30
sigqueue 58 58 132 2 2 1 : 252 126
kiobuf 0 0 64 0 0 1 : 252 126
cdev_cache 11 177 64 3 3 1 : 252 126
bdev_cache 5 118 64 2 2 1 : 252 126
mnt_cache 19 177 64 3 3 1 : 252 126
inode_cache 217 217 512 31 31 1 : 124 62
dentry_cache 499222 518850 128 17295 17295 1 : 252 126
dquot 0 0 128 0 0 1 : 252 126
filp 486 600 128 20 20 1 : 252 126
names_cache 3 3 4096 3 3 1 : 60 30
buffer_head 31305 34400 96 860 860 1 : 252 126
mm_struct 120 120 160 5 5 1 : 252 126
vm_area_struct 861 880 96 22 22 1 : 252 126
fs_cache 177 177 64 3 3 1 : 252 126
files_cache 63 63 416 7 7 1 : 124 62
signal_act 72 72 1312 24 24 1 : 60 30
size-131072(DMA) 0 0 131072 0 0 32 : 0 0
size-131072 0 0 131072 0 0 32 : 0 0
size-65536(DMA) 0 0 65536 0 0 16 : 0 0
size-65536 0 0 65536 0 0 16 : 0 0
size-32768(DMA) 0 0 32768 0 0 8 : 0 0
size-32768 24 24 32768 24 24 8 : 0 0
size-16384(DMA) 0 0 16384 0 0 4 : 0 0
size-16384 16 18 16384 16 18 4 : 0 0
size-8192(DMA) 0 0 8192 0 0 2 : 0 0
size-8192 7 8 8192 7 8 2 : 0 0
size-4096(DMA) 0 0 4096 0 0 1 : 60 30
size-4096 385 385 4096 385 385 1 : 60 30
size-2048(DMA) 0 0 2048 0 0 1 : 60 30
size-2048 1952 1952 2048 976 976 1 : 60 30
size-1024(DMA) 0 0 1024 0 0 1 : 124 62
size-1024 476 476 1024 119 119 1 : 124 62
size-512(DMA) 0 0 512 0 0 1 : 124 62
size-512 344 344 512 43 43 1 : 124 62
size-256(DMA) 0 0 256 0 0 1 : 252 126
size-256 892 1335 256 89 89 1 : 252 126
size-128(DMA) 0 0 128 0 0 1 : 252 126
size-128 4087 8130 128 271 271 1 : 252 126
size-64(DMA) 0 0 64 0 0 1 : 252 126
size-64 65813 90683 64 1537 1537 1 : 252 126
size-32(DMA) 0 0 32 0 0 1 : 252 126
size-32 421038 421038 32 3726 3726 1 : 252 126
/proc/meminfo:
total: used: free: shared: buffers: cached:
Mem: 1057779712 1034821632 22958080 0 36864 136249344
Swap: 2147459072 2015232 2145443840
MemTotal: 1032988 kB
MemFree: 22420 kB
MemShared: 0 kB
Buffers: 36 kB
Cached: 132032 kB
SwapCached: 1024 kB
Active: 29204 kB
Inactive: 113520 kB
HighTotal: 131072 kB
HighFree: 7864 kB
LowTotal: 901916 kB
LowFree: 14556 kB
SwapTotal: 2097128 kB
SwapFree: 2095160 kB
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Reducing inode cache usage on 2.4?
2004-12-17 15:12 ` Marcelo Tosatti
@ 2004-12-17 21:52 ` Willy Tarreau
2004-12-18 0:32 ` James Pearson
1 sibling, 0 replies; 18+ messages in thread
From: Willy Tarreau @ 2004-12-17 21:52 UTC (permalink / raw)
To: Marcelo Tosatti; +Cc: James Pearson, linux-kernel
Hi Marcelo,
On Fri, Dec 17, 2004 at 01:12:28PM -0200, Marcelo Tosatti wrote:
(...)
> The default kernel shrinking ratio can be tuned for enhanced reclaim efficiency.
Thanks for having explained this. Up to now, after several series of find or
other FS-intensive tasks, I often launched a simple home-made program to which
I tell how much memory I want it to allocate (and touch), then it exits freeing
this amount of memory. A bit dangerous but really effective indeed !
I too will try to play with vm_vfs_scan_ratio, it seems appealing.
Cheers,
Willy
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Reducing inode cache usage on 2.4?
2004-12-17 15:12 ` Marcelo Tosatti
2004-12-17 21:52 ` Willy Tarreau
@ 2004-12-18 0:32 ` James Pearson
2004-12-18 1:21 ` Andrew Morton
2004-12-18 15:02 ` Marcelo Tosatti
1 sibling, 2 replies; 18+ messages in thread
From: James Pearson @ 2004-12-18 0:32 UTC (permalink / raw)
To: Marcelo Tosatti; +Cc: linux-kernel
Marcelo Tosatti wrote:
>
>>Or am I looking in completely the wrong place i.e. the inode cache is
>>not the problem?
>
>
> No, in your case the extreme inode/dcache sizes indeed seem to be a problem.
>
> The default kernel shrinking ratio can be tuned for enhanced reclaim efficiency.
>
>
>>xfs_inode 931428 931428 408 103492 103492 1 : 124 62
>>dentry_cache 499222 518850 128 17295 17295 1 : 252 126
>
>
> vm_vfs_scan_ratio:
> ------------------
> is what proportion of the VFS queues we will scan in one go.
> A value of 6 for vm_vfs_scan_ratio implies that 1/6th of the
> unused-inode, dentry and dquot caches will be freed during a
> normal aging round.
> Big fileservers (NFS, SMB etc.) probably want to set this
> value to 3 or 2.
>
> The default value is 6.
> =============================================================
>
> Tune /proc/sys/vm/vm_vfs_scan_ratio increasing the value to 10 and so on and
> examine the results.
Thanks for the info - but doesn't increasing the value of
vm_vfs_scan_ratio mean that less of the caches will be freed?
Doing a few tests (on another test file system with 2 million or so
files and 1Gb of memory) running 'find $disk -type f', with
vm_vfs_scan_ratio set to 6 (or 10), the first two column values for
xfs_inode, linvfs_icache and dentry_cache in /proc/slabinfo reach about
900000 and stay around that value, but setting vm_vfs_scan_ratio to 1,
then each value still reaches 900000, but then falls to a few thousand
and increases up to 900000 and then drop away again and repeats.
This still happens when I cat many large files (100Mb) to /dev/null at
the same time as running the find i.e. the inode caches can still reach
90% of the memory before being reclaimed (with vm_vfs_scan_ratio set to 1).
If I stop the find process when the inode caches reach about 90% of the
memory, and then start cat'ing the large files, it appears the inode
caches are never reclaimed (or longer than it takes to cat 100Gb of data
to /dev/null) - is this expected behaviour?
It seems the inode cache has priority over cached file data.
What triggers the 'normal ageing round'? Is it possible to trigger this
earlier (at a lower memory usage), or give a higher priority to cached data?
Thanks
James Pearson
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Reducing inode cache usage on 2.4?
2004-12-18 0:32 ` James Pearson
@ 2004-12-18 1:21 ` Andrew Morton
2004-12-18 11:02 ` Marcelo Tosatti
2004-12-20 19:20 ` Andrea Arcangeli
2004-12-18 15:02 ` Marcelo Tosatti
1 sibling, 2 replies; 18+ messages in thread
From: Andrew Morton @ 2004-12-18 1:21 UTC (permalink / raw)
To: James Pearson; +Cc: marcelo.tosatti, linux-kernel
James Pearson <james-p@moving-picture.com> wrote:
>
> It seems the inode cache has priority over cached file data.
It does. If the machine is full of unmapped clean pagecache pages the
kernel won't even try to reclaim inodes. This should help a bit:
--- 24/mm/vmscan.c~a 2004-12-17 17:18:31.660254712 -0800
+++ 24-akpm/mm/vmscan.c 2004-12-17 17:18:41.821709936 -0800
@@ -659,13 +659,13 @@ int fastcall try_to_free_pages_zone(zone
do {
nr_pages = shrink_caches(classzone, gfp_mask, nr_pages, &failed_swapout);
- if (nr_pages <= 0)
- return 1;
shrink_dcache_memory(vm_vfs_scan_ratio, gfp_mask);
shrink_icache_memory(vm_vfs_scan_ratio, gfp_mask);
#ifdef CONFIG_QUOTA
shrink_dqcache_memory(vm_vfs_scan_ratio, gfp_mask);
#endif
+ if (nr_pages <= 0)
+ return 1;
if (!failed_swapout)
failed_swapout = !swap_out(classzone);
} while (--tries);
_
> What triggers the 'normal ageing round'? Is it possible to trigger this
> earlier (at a lower memory usage), or give a higher priority to cached data?
You could also try lowering /proc/sys/vm/vm_mapped_ratio. That will cause
inodes to be reaped more easily, but will also cause more swapout.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Reducing inode cache usage on 2.4?
2004-12-18 1:21 ` Andrew Morton
@ 2004-12-18 11:02 ` Marcelo Tosatti
2004-12-20 13:47 ` James Pearson
2004-12-20 19:20 ` Andrea Arcangeli
1 sibling, 1 reply; 18+ messages in thread
From: Marcelo Tosatti @ 2004-12-18 11:02 UTC (permalink / raw)
To: Andrew Morton; +Cc: James Pearson, linux-kernel
James,
Can apply Andrew's patch and examine the results?
I've merged it to mainline because it looks sensible.
Thanks Andrew!
On Fri, Dec 17, 2004 at 05:21:04PM -0800, Andrew Morton wrote:
> James Pearson <james-p@moving-picture.com> wrote:
> >
> > It seems the inode cache has priority over cached file data.
>
> It does. If the machine is full of unmapped clean pagecache pages the
> kernel won't even try to reclaim inodes. This should help a bit:
>
> --- 24/mm/vmscan.c~a 2004-12-17 17:18:31.660254712 -0800
> +++ 24-akpm/mm/vmscan.c 2004-12-17 17:18:41.821709936 -0800
> @@ -659,13 +659,13 @@ int fastcall try_to_free_pages_zone(zone
>
> do {
> nr_pages = shrink_caches(classzone, gfp_mask, nr_pages, &failed_swapout);
> - if (nr_pages <= 0)
> - return 1;
> shrink_dcache_memory(vm_vfs_scan_ratio, gfp_mask);
> shrink_icache_memory(vm_vfs_scan_ratio, gfp_mask);
> #ifdef CONFIG_QUOTA
> shrink_dqcache_memory(vm_vfs_scan_ratio, gfp_mask);
> #endif
> + if (nr_pages <= 0)
> + return 1;
> if (!failed_swapout)
> failed_swapout = !swap_out(classzone);
> } while (--tries);
> _
>
>
> > What triggers the 'normal ageing round'? Is it possible to trigger this
> > earlier (at a lower memory usage), or give a higher priority to cached data?
>
> You could also try lowering /proc/sys/vm/vm_mapped_ratio. That will cause
> inodes to be reaped more easily, but will also cause more swapout.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Reducing inode cache usage on 2.4?
2004-12-18 0:32 ` James Pearson
2004-12-18 1:21 ` Andrew Morton
@ 2004-12-18 15:02 ` Marcelo Tosatti
1 sibling, 0 replies; 18+ messages in thread
From: Marcelo Tosatti @ 2004-12-18 15:02 UTC (permalink / raw)
To: James Pearson; +Cc: linux-kernel, Andrew Morton
On Sat, Dec 18, 2004 at 12:32:54AM +0000, James Pearson wrote:
> Marcelo Tosatti wrote:
> >
> >>Or am I looking in completely the wrong place i.e. the inode cache is
> >>not the problem?
> >
> >
> >No, in your case the extreme inode/dcache sizes indeed seem to be a
> >problem.
> >The default kernel shrinking ratio can be tuned for enhanced reclaim
> >efficiency.
> >
> >
> >>xfs_inode 931428 931428 408 103492 103492 1 : 124 62
> >>dentry_cache 499222 518850 128 17295 17295 1 : 252 126
> >
> >
> >vm_vfs_scan_ratio:
> >------------------
> >is what proportion of the VFS queues we will scan in one go.
> >A value of 6 for vm_vfs_scan_ratio implies that 1/6th of the
> >unused-inode, dentry and dquot caches will be freed during a
> >normal aging round.
> >Big fileservers (NFS, SMB etc.) probably want to set this
> >value to 3 or 2.
> >
> >The default value is 6.
> >=============================================================
> >
> >Tune /proc/sys/vm/vm_vfs_scan_ratio increasing the value to 10 and so on
> >and examine the results.
>
> Thanks for the info - but doesn't increasing the value of
> vm_vfs_scan_ratio mean that less of the caches will be freed?
Right - what I said was wrong - its the other way around:
Decreasing the value increases the percentage of VFS caches scanned at each "aging pass".
Now Andrew's changed the ageing round pass.
Quoting him "If the machine is full of unmapped clean pagecache pages the kernel
won't even try to reclaim inodes".
vm_vfs_scan_ratio now is more meaningful.
kswapd is awaken as soon as a zone's low watermark is reached, and will
work to free pages until it reaches the zone's high watermark.
There are three zones: DMA (1) , Normal (2) and Highmem (3).
* On machines where it is needed (eg PCs) we divide physical memory
* into multiple physical zones. On a PC we have 3 zones:
*
* ZONE_DMA < 16 MB ISA DMA capable memory
* ZONE_NORMAL 16-896 MB direct mapped by the kernel
* ZONE_HIGHMEM > 896 MB only page cache and user processes
So these thresolds are used to calculate each zone's min, low and high
watermarks using the following calculation (mm/page_alloc.c):
mask = (realsize / zone_balance_ratio[j]);
if (mask < zone_balance_min[j])
mask = zone_balance_min[j];
else if (mask > zone_balance_max[j])
mask = zone_balance_max[j];
zone->watermarks[j].min = mask;
zone->watermarks[j].low = mask*2;
zone->watermarks[j].high = mask*3;
To trigger the normal aging round earlier the "low" watermark has to be increased,
but you better increase the "high" watermark which makes kswapd work up longer
until such high free page watermark is reached, one can try for example
zone->watermarks[j].high = mask*4
But hopefully you wont need such modification (it would be nice if they were all boot
configurable BTW) with Andrew's change.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Reducing inode cache usage on 2.4?
2004-12-20 13:47 ` James Pearson
@ 2004-12-20 12:46 ` Marcelo Tosatti
2004-12-20 15:10 ` Andrea Arcangeli
0 siblings, 1 reply; 18+ messages in thread
From: Marcelo Tosatti @ 2004-12-20 12:46 UTC (permalink / raw)
To: James Pearson; +Cc: Andrew Morton, linux-kernel, andrea
On Mon, Dec 20, 2004 at 01:47:46PM +0000, James Pearson wrote:
> I've tested the patch on my test setup - running a 'find $disk -type f'
> and a cat of large files to /dev/null at the same time does indeed
> reduce the size of the inode and dentry caches considerably - the first
> column numbers for fs_inode, linvfs_icache and dentry_cache in
> /proc/slabinfo hover at about 400-600 (over 900000 previously).
>
> However, is this going a bit to far the other way? When I boot the
> machine with 4Gb RAM, the inode and dentry caches are squeezed to the
> same amounts, but it may be the case that it would be more beneficial to
> have more in the inode and dentry caches? i.e. I guess some sort of
> tunable factor that limits the minimum size of the inode and dentry
> caches in this case?
One can increase vm_vfs_scan_ratio if required, but hopefully this change
will benefit all workloads.
Andrew, Andrea, do you think of any workloads which might be hurt by this change?
> But saying that, I notice my 'find $disk -type f' (with about 2 million
> files) runs a lot faster with the smaller inode/dentry caches - about 1
> or 2 minutes with the patched kernel compared with about 5 to 7 minutes
> with the unpatched kernel - I guess it was taking longer to search the
> inode/dentry cache than reading direct from disk.
Wonderful.
>
> James Pearson
>
> Marcelo Tosatti wrote:
> >James,
> >
> >Can apply Andrew's patch and examine the results?
> >
> >I've merged it to mainline because it looks sensible.
> >
> >Thanks Andrew!
> >
> >On Fri, Dec 17, 2004 at 05:21:04PM -0800, Andrew Morton wrote:
> >
> >>James Pearson <james-p@moving-picture.com> wrote:
> >>
> >>>It seems the inode cache has priority over cached file data.
> >>
> >>It does. If the machine is full of unmapped clean pagecache pages the
> >>kernel won't even try to reclaim inodes. This should help a bit:
> >>
> >>--- 24/mm/vmscan.c~a 2004-12-17 17:18:31.660254712 -0800
> >>+++ 24-akpm/mm/vmscan.c 2004-12-17 17:18:41.821709936 -0800
> >>@@ -659,13 +659,13 @@ int fastcall try_to_free_pages_zone(zone
> >>
> >> do {
> >> nr_pages = shrink_caches(classzone, gfp_mask,
> >> nr_pages, &failed_swapout);
> >>- if (nr_pages <= 0)
> >>- return 1;
> >> shrink_dcache_memory(vm_vfs_scan_ratio, gfp_mask);
> >> shrink_icache_memory(vm_vfs_scan_ratio, gfp_mask);
> >>#ifdef CONFIG_QUOTA
> >> shrink_dqcache_memory(vm_vfs_scan_ratio, gfp_mask);
> >>#endif
> >>+ if (nr_pages <= 0)
> >>+ return 1;
> >> if (!failed_swapout)
> >> failed_swapout = !swap_out(classzone);
> >> } while (--tries);
> >>_
> >>
> >>
> >>
> >>>What triggers the 'normal ageing round'? Is it possible to trigger this
> >>>earlier (at a lower memory usage), or give a higher priority to cached
> >>>data?
> >>
> >>You could also try lowering /proc/sys/vm/vm_mapped_ratio. That will cause
> >>inodes to be reaped more easily, but will also cause more swapout.
> >
> >
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Reducing inode cache usage on 2.4?
2004-12-18 11:02 ` Marcelo Tosatti
@ 2004-12-20 13:47 ` James Pearson
2004-12-20 12:46 ` Marcelo Tosatti
0 siblings, 1 reply; 18+ messages in thread
From: James Pearson @ 2004-12-20 13:47 UTC (permalink / raw)
To: Marcelo Tosatti; +Cc: Andrew Morton, linux-kernel
I've tested the patch on my test setup - running a 'find $disk -type f'
and a cat of large files to /dev/null at the same time does indeed
reduce the size of the inode and dentry caches considerably - the first
column numbers for fs_inode, linvfs_icache and dentry_cache in
/proc/slabinfo hover at about 400-600 (over 900000 previously).
However, is this going a bit to far the other way? When I boot the
machine with 4Gb RAM, the inode and dentry caches are squeezed to the
same amounts, but it may be the case that it would be more beneficial to
have more in the inode and dentry caches? i.e. I guess some sort of
tunable factor that limits the minimum size of the inode and dentry
caches in this case?
But saying that, I notice my 'find $disk -type f' (with about 2 million
files) runs a lot faster with the smaller inode/dentry caches - about 1
or 2 minutes with the patched kernel compared with about 5 to 7 minutes
with the unpatched kernel - I guess it was taking longer to search the
inode/dentry cache than reading direct from disk.
James Pearson
Marcelo Tosatti wrote:
> James,
>
> Can apply Andrew's patch and examine the results?
>
> I've merged it to mainline because it looks sensible.
>
> Thanks Andrew!
>
> On Fri, Dec 17, 2004 at 05:21:04PM -0800, Andrew Morton wrote:
>
>>James Pearson <james-p@moving-picture.com> wrote:
>>
>>>It seems the inode cache has priority over cached file data.
>>
>>It does. If the machine is full of unmapped clean pagecache pages the
>>kernel won't even try to reclaim inodes. This should help a bit:
>>
>>--- 24/mm/vmscan.c~a 2004-12-17 17:18:31.660254712 -0800
>>+++ 24-akpm/mm/vmscan.c 2004-12-17 17:18:41.821709936 -0800
>>@@ -659,13 +659,13 @@ int fastcall try_to_free_pages_zone(zone
>>
>> do {
>> nr_pages = shrink_caches(classzone, gfp_mask, nr_pages, &failed_swapout);
>>- if (nr_pages <= 0)
>>- return 1;
>> shrink_dcache_memory(vm_vfs_scan_ratio, gfp_mask);
>> shrink_icache_memory(vm_vfs_scan_ratio, gfp_mask);
>> #ifdef CONFIG_QUOTA
>> shrink_dqcache_memory(vm_vfs_scan_ratio, gfp_mask);
>> #endif
>>+ if (nr_pages <= 0)
>>+ return 1;
>> if (!failed_swapout)
>> failed_swapout = !swap_out(classzone);
>> } while (--tries);
>>_
>>
>>
>>
>>> What triggers the 'normal ageing round'? Is it possible to trigger this
>>> earlier (at a lower memory usage), or give a higher priority to cached data?
>>
>>You could also try lowering /proc/sys/vm/vm_mapped_ratio. That will cause
>>inodes to be reaped more easily, but will also cause more swapout.
>
>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Reducing inode cache usage on 2.4?
2004-12-20 15:10 ` Andrea Arcangeli
@ 2004-12-20 15:06 ` Marcelo Tosatti
2004-12-20 17:54 ` Andrea Arcangeli
0 siblings, 1 reply; 18+ messages in thread
From: Marcelo Tosatti @ 2004-12-20 15:06 UTC (permalink / raw)
To: Andrea Arcangeli; +Cc: James Pearson, Andrew Morton, linux-kernel
On Mon, Dec 20, 2004 at 04:10:45PM +0100, Andrea Arcangeli wrote:
> On Mon, Dec 20, 2004 at 10:46:04AM -0200, Marcelo Tosatti wrote:
> > On Mon, Dec 20, 2004 at 01:47:46PM +0000, James Pearson wrote:
> > > I've tested the patch on my test setup - running a 'find $disk -type f'
> > > and a cat of large files to /dev/null at the same time does indeed
> > > reduce the size of the inode and dentry caches considerably - the first
> > > column numbers for fs_inode, linvfs_icache and dentry_cache in
> > > /proc/slabinfo hover at about 400-600 (over 900000 previously).
> > >
> > > However, is this going a bit to far the other way? When I boot the
> > > machine with 4Gb RAM, the inode and dentry caches are squeezed to the
> > > same amounts, but it may be the case that it would be more beneficial to
> > > have more in the inode and dentry caches? i.e. I guess some sort of
> > > tunable factor that limits the minimum size of the inode and dentry
> > > caches in this case?
> >
> > One can increase vm_vfs_scan_ratio if required, but hopefully this change
> > will benefit all workloads.
> >
> > Andrew, Andrea, do you think of any workloads which might be hurt by this change?
>
> I wouldn't touch the defaults, but the sysctl is there so if you've a
> strange workload you can tune for it.
>
> There's nothing wrong with dcache/icache growing a lot.
The thing is right now we dont try to reclaim from icache/dcache _at all_
if enough clean pagecache pages are found and reclaimed.
Its sounds unfair to me.
> A cat of a large file is polluting the cache, so that's not a workload that should shrink
> the dcache/icache.
Why not? If we have a lot of them they will probably be hurting performace, which seems
to be the case now.
> I'd prefer a feedback based on a real useful workload
> before even considering touching the defaults at this time.
Following this logic any workload which generates pagecache and happen to, most times,
have enough pagecache clean to be reclaimed should not reclaim the i/dcache's.
Which is not right.
But yes, feedback based on other workloads is required. I'm hoping people do test
the next 2.4.29-pre3 and send feedback.
So I'll probably revert the patch if any considerable regression is found.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Reducing inode cache usage on 2.4?
2004-12-20 12:46 ` Marcelo Tosatti
@ 2004-12-20 15:10 ` Andrea Arcangeli
2004-12-20 15:06 ` Marcelo Tosatti
0 siblings, 1 reply; 18+ messages in thread
From: Andrea Arcangeli @ 2004-12-20 15:10 UTC (permalink / raw)
To: Marcelo Tosatti; +Cc: James Pearson, Andrew Morton, linux-kernel
On Mon, Dec 20, 2004 at 10:46:04AM -0200, Marcelo Tosatti wrote:
> On Mon, Dec 20, 2004 at 01:47:46PM +0000, James Pearson wrote:
> > I've tested the patch on my test setup - running a 'find $disk -type f'
> > and a cat of large files to /dev/null at the same time does indeed
> > reduce the size of the inode and dentry caches considerably - the first
> > column numbers for fs_inode, linvfs_icache and dentry_cache in
> > /proc/slabinfo hover at about 400-600 (over 900000 previously).
> >
> > However, is this going a bit to far the other way? When I boot the
> > machine with 4Gb RAM, the inode and dentry caches are squeezed to the
> > same amounts, but it may be the case that it would be more beneficial to
> > have more in the inode and dentry caches? i.e. I guess some sort of
> > tunable factor that limits the minimum size of the inode and dentry
> > caches in this case?
>
> One can increase vm_vfs_scan_ratio if required, but hopefully this change
> will benefit all workloads.
>
> Andrew, Andrea, do you think of any workloads which might be hurt by this change?
I wouldn't touch the defaults, but the sysctl is there so if you've a
strange workload you can tune for it.
There's nothing wrong with dcache/icache growing a lot. A cat of a large
file is polluting the cache, so that's not a workload that should shrink
the dcache/icache. I'd prefer a feedback based on a real useful workload
before even considering touching the defaults at this time.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Reducing inode cache usage on 2.4?
2004-12-20 17:54 ` Andrea Arcangeli
@ 2004-12-20 15:43 ` Marcelo Tosatti
0 siblings, 0 replies; 18+ messages in thread
From: Marcelo Tosatti @ 2004-12-20 15:43 UTC (permalink / raw)
To: Andrea Arcangeli; +Cc: James Pearson, Andrew Morton, linux-kernel
On Mon, Dec 20, 2004 at 06:54:09PM +0100, Andrea Arcangeli wrote:
> On Mon, Dec 20, 2004 at 01:06:34PM -0200, Marcelo Tosatti wrote:
> > The thing is right now we dont try to reclaim from icache/dcache _at all_
> > if enough clean pagecache pages are found and reclaimed.
> >
> > Its sounds unfair to me.
>
> If most ram is in pagecache there's not much point to shrink the dcache.
> The more ram goes into dcache/icache, the less ram will be in pagecache,
> and the more likely we'll start shrinking dcache/icache. Also keep in
> mind in a highmem machine the pagecache will be in highmemory and the
> dcache/icache in lowmemory (on very very big boxes the lowmem_reserve
> algorithm pratically splits the two in non-overkapping zones), so
> especially on a big highmem machine shrinking dcache/icache during a
> pagecache allocation (because this is what the workload is doing: only
> pagecache allocations) is a worthless effort.
>
> This is the best solution we have right now, but there have been several
> discussions in the past on how to shrink dcache/icache. But if we want
> to talk on how to change this, we should talk about 2.6/2.7 only IMHO.
>
> > Why not? If we have a lot of them they will probably be hurting performace, which seems
> > to be the case now.
>
> The slowdown could be because the icache/dcache hash size is too small.
> It signals collisions in the dcache/icache hashtable. 2.6 with bootmem
> allocated hashes should be better. Optimizing 2.4 for performance if not
> worth the risk IMHO. I would suggest to check if you can reproduce in
> 2.6, and fix it there, if it's still there.
>
> > Following this logic any workload which generates pagecache and happen
> > to, most times, have enough pagecache clean to be reclaimed should not
> > reclaim the i/dcache's. Which is not right.
>
> This mostly happens for cache-polluting-workloads like in this testcase.
> If the cache would be activated, there would be less pages in the
> inactive list and you had a better chance to invoke the dcache/icache
> shrinking.
OK I buy your arguments I'll revert Andrew's patch.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Reducing inode cache usage on 2.4?
2004-12-20 15:06 ` Marcelo Tosatti
@ 2004-12-20 17:54 ` Andrea Arcangeli
2004-12-20 15:43 ` Marcelo Tosatti
0 siblings, 1 reply; 18+ messages in thread
From: Andrea Arcangeli @ 2004-12-20 17:54 UTC (permalink / raw)
To: Marcelo Tosatti; +Cc: James Pearson, Andrew Morton, linux-kernel
On Mon, Dec 20, 2004 at 01:06:34PM -0200, Marcelo Tosatti wrote:
> The thing is right now we dont try to reclaim from icache/dcache _at all_
> if enough clean pagecache pages are found and reclaimed.
>
> Its sounds unfair to me.
If most ram is in pagecache there's not much point to shrink the dcache.
The more ram goes into dcache/icache, the less ram will be in pagecache,
and the more likely we'll start shrinking dcache/icache. Also keep in
mind in a highmem machine the pagecache will be in highmemory and the
dcache/icache in lowmemory (on very very big boxes the lowmem_reserve
algorithm pratically splits the two in non-overkapping zones), so
especially on a big highmem machine shrinking dcache/icache during a
pagecache allocation (because this is what the workload is doing: only
pagecache allocations) is a worthless effort.
This is the best solution we have right now, but there have been several
discussions in the past on how to shrink dcache/icache. But if we want
to talk on how to change this, we should talk about 2.6/2.7 only IMHO.
> Why not? If we have a lot of them they will probably be hurting performace, which seems
> to be the case now.
The slowdown could be because the icache/dcache hash size is too small.
It signals collisions in the dcache/icache hashtable. 2.6 with bootmem
allocated hashes should be better. Optimizing 2.4 for performance if not
worth the risk IMHO. I would suggest to check if you can reproduce in
2.6, and fix it there, if it's still there.
> Following this logic any workload which generates pagecache and happen
> to, most times, have enough pagecache clean to be reclaimed should not
> reclaim the i/dcache's. Which is not right.
This mostly happens for cache-polluting-workloads like in this testcase.
If the cache would be activated, there would be less pages in the
inactive list and you had a better chance to invoke the dcache/icache
shrinking.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Reducing inode cache usage on 2.4?
2004-12-18 1:21 ` Andrew Morton
2004-12-18 11:02 ` Marcelo Tosatti
@ 2004-12-20 19:20 ` Andrea Arcangeli
2004-12-21 11:33 ` James Pearson
1 sibling, 1 reply; 18+ messages in thread
From: Andrea Arcangeli @ 2004-12-20 19:20 UTC (permalink / raw)
To: Andrew Morton; +Cc: James Pearson, marcelo.tosatti, linux-kernel
On Fri, Dec 17, 2004 at 05:21:04PM -0800, Andrew Morton wrote:
> James Pearson <james-p@moving-picture.com> wrote:
> >
> > It seems the inode cache has priority over cached file data.
>
> It does. If the machine is full of unmapped clean pagecache pages the
> kernel won't even try to reclaim inodes. This should help a bit:
>
> --- 24/mm/vmscan.c~a 2004-12-17 17:18:31.660254712 -0800
> +++ 24-akpm/mm/vmscan.c 2004-12-17 17:18:41.821709936 -0800
> @@ -659,13 +659,13 @@ int fastcall try_to_free_pages_zone(zone
>
> do {
> nr_pages = shrink_caches(classzone, gfp_mask, nr_pages, &failed_swapout);
> - if (nr_pages <= 0)
> - return 1;
> shrink_dcache_memory(vm_vfs_scan_ratio, gfp_mask);
> shrink_icache_memory(vm_vfs_scan_ratio, gfp_mask);
> #ifdef CONFIG_QUOTA
> shrink_dqcache_memory(vm_vfs_scan_ratio, gfp_mask);
> #endif
> + if (nr_pages <= 0)
> + return 1;
> if (!failed_swapout)
> failed_swapout = !swap_out(classzone);
> } while (--tries);
I'm worried this is too aggressive by default and it may hurt stuff. The
real bug is that we don't do anything when too many collisions happens
in the hashtables. That is the thing to work on. We should free
colliding entries in the background after a 'touch' timeout. That should
work pretty well to age the dcache proprerly too. But the above will
just shrink everything all the time and it's going to break stuff.
For 2.6 we can talk about the background shrink based on timeout.
My only suggestion for 2.4 is to try with vm_cache_scan_ratio = 20 or
higher (or alternatively vm_mapped_ratio = 50 or = 20). There's a
reason why everything is tunable by sysctl.
I don't think the vm_lru_balance_ratio is the one he's interested
about. vm_lru_balance_ratio controls how much work is being done at
every dcache/icache shrinking.
His real objective is to invoke the dcache/icache shrinking more
frequently, how much work is being done at each pass is a secondary
issue. If we don't invoke it, nothing will be shrunk, no matter what is
the value of vm_lru_balance_ratio.
Hope this helps funding an optimal tuning for the workload.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Reducing inode cache usage on 2.4?
2004-12-20 19:20 ` Andrea Arcangeli
@ 2004-12-21 11:33 ` James Pearson
2004-12-21 13:22 ` Andrea Arcangeli
0 siblings, 1 reply; 18+ messages in thread
From: James Pearson @ 2004-12-21 11:33 UTC (permalink / raw)
To: Andrea Arcangeli; +Cc: Andrew Morton, marcelo.tosatti, linux-kernel
Andrea Arcangeli wrote:
>
> My only suggestion for 2.4 is to try with vm_cache_scan_ratio = 20 or
> higher (or alternatively vm_mapped_ratio = 50 or = 20). There's a
> reason why everything is tunable by sysctl.
>
> I don't think the vm_lru_balance_ratio is the one he's interested
> about. vm_lru_balance_ratio controls how much work is being done at
> every dcache/icache shrinking.
>
> His real objective is to invoke the dcache/icache shrinking more
> frequently, how much work is being done at each pass is a secondary
> issue. If we don't invoke it, nothing will be shrunk, no matter what is
> the value of vm_lru_balance_ratio.
>
> Hope this helps funding an optimal tuning for the workload.
Setting vm_mapped_ratio to 20 seems to give a 'better' memory usage
using my very contrived test - running a find will result in about 900Mb
of dcache/icache, but then running a cat to /dev/null will shrink the
dcache/icache down to between 100-300Mb - running the find and cat at
the same time results in about the same dcache/icache usage.
I'll give this a go on the production NFS server and I'll see if it
improves things.
Thanks
James Pearson
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Reducing inode cache usage on 2.4?
2004-12-21 11:33 ` James Pearson
@ 2004-12-21 13:22 ` Andrea Arcangeli
2004-12-21 13:59 ` James Pearson
0 siblings, 1 reply; 18+ messages in thread
From: Andrea Arcangeli @ 2004-12-21 13:22 UTC (permalink / raw)
To: James Pearson; +Cc: Andrew Morton, marcelo.tosatti, linux-kernel
On Tue, Dec 21, 2004 at 11:33:24AM +0000, James Pearson wrote:
> Setting vm_mapped_ratio to 20 seems to give a 'better' memory usage
> using my very contrived test - running a find will result in about 900Mb
> of dcache/icache, but then running a cat to /dev/null will shrink the
> dcache/icache down to between 100-300Mb - running the find and cat at
> the same time results in about the same dcache/icache usage.
>
> I'll give this a go on the production NFS server and I'll see if it
> improves things.
Ok great. If 20 isn't enough just set it to 40, just be careful that if
you set it too high the system may swap a bit too early.
Overall this is still a workaround, real fix would be a background
scanning of the icache/dcache collisions in the hash buckets but that's
not for 2.4 ;).
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Reducing inode cache usage on 2.4?
2004-12-21 13:22 ` Andrea Arcangeli
@ 2004-12-21 13:59 ` James Pearson
2004-12-21 14:39 ` Andrea Arcangeli
0 siblings, 1 reply; 18+ messages in thread
From: James Pearson @ 2004-12-21 13:59 UTC (permalink / raw)
To: Andrea Arcangeli; +Cc: Andrew Morton, marcelo.tosatti, linux-kernel
Andrea Arcangeli wrote:
> On Tue, Dec 21, 2004 at 11:33:24AM +0000, James Pearson wrote:
>
>>Setting vm_mapped_ratio to 20 seems to give a 'better' memory usage
>>using my very contrived test - running a find will result in about 900Mb
>>of dcache/icache, but then running a cat to /dev/null will shrink the
>>dcache/icache down to between 100-300Mb - running the find and cat at
>>the same time results in about the same dcache/icache usage.
>>
>>I'll give this a go on the production NFS server and I'll see if it
>>improves things.
>
>
> Ok great. If 20 isn't enough just set it to 40, just be careful that if
> you set it too high the system may swap a bit too early.
I've changed the value of vm_mapped_ratio to 20 - which has a default
value of 100 - I guess you're talking about vm_cache_scan_ratio?
I've tried changing just vm_cache_scan_ratio to 20, but it doesn't seem
to make any difference - I though a higher vm_cache_scan_ratio value
meant less is scanned?
James Pearson
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Reducing inode cache usage on 2.4?
2004-12-21 13:59 ` James Pearson
@ 2004-12-21 14:39 ` Andrea Arcangeli
0 siblings, 0 replies; 18+ messages in thread
From: Andrea Arcangeli @ 2004-12-21 14:39 UTC (permalink / raw)
To: James Pearson; +Cc: Andrew Morton, marcelo.tosatti, linux-kernel
On Tue, Dec 21, 2004 at 01:59:06PM +0000, James Pearson wrote:
> I've changed the value of vm_mapped_ratio to 20 - which has a default
> value of 100 - I guess you're talking about vm_cache_scan_ratio?
yes, I was talking about vm_cache_scan_ratio, you can combine the two
sysctl together just fine.
> I've tried changing just vm_cache_scan_ratio to 20, but it doesn't seem
> to make any difference - I though a higher vm_cache_scan_ratio value
> meant less is scanned?
The less pages are scanned, the more likely you won't free enough
pagecache, the more likely you'll shrink dcache/icache.
I see why vm_mapped_ratio makes most of the difference though and
probably it's the easier fix for your problem (though increasing
vm_cache_scan_ratio sure won't make things worse).
^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2004-12-21 14:40 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-12-17 17:26 Reducing inode cache usage on 2.4? James Pearson
2004-12-17 15:12 ` Marcelo Tosatti
2004-12-17 21:52 ` Willy Tarreau
2004-12-18 0:32 ` James Pearson
2004-12-18 1:21 ` Andrew Morton
2004-12-18 11:02 ` Marcelo Tosatti
2004-12-20 13:47 ` James Pearson
2004-12-20 12:46 ` Marcelo Tosatti
2004-12-20 15:10 ` Andrea Arcangeli
2004-12-20 15:06 ` Marcelo Tosatti
2004-12-20 17:54 ` Andrea Arcangeli
2004-12-20 15:43 ` Marcelo Tosatti
2004-12-20 19:20 ` Andrea Arcangeli
2004-12-21 11:33 ` James Pearson
2004-12-21 13:22 ` Andrea Arcangeli
2004-12-21 13:59 ` James Pearson
2004-12-21 14:39 ` Andrea Arcangeli
2004-12-18 15:02 ` Marcelo Tosatti
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox