From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S261995AbVF1JUM (ORCPT ); Tue, 28 Jun 2005 05:20:12 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S262031AbVF1JUB (ORCPT ); Tue, 28 Jun 2005 05:20:01 -0400 Received: from Nazgul.esiway.net ([193.194.16.154]:26551 "EHLO Nazgul.esiway.net") by vger.kernel.org with ESMTP id S261870AbVF1JPj (ORCPT ); Tue, 28 Jun 2005 05:15:39 -0400 Subject: Re: Tracking down a memory leak From: Marco Colombo To: linux-kernel@vger.kernel.org In-Reply-To: <1119263592.31049.19.camel@Frodo.esi> References: <1119263592.31049.19.camel@Frodo.esi> Content-Type: text/plain Organization: ESI srl Date: Tue, 28 Jun 2005 11:18:02 +0200 Message-Id: <1119950283.26948.271.camel@Frodo.esi> Mime-Version: 1.0 X-Mailer: Evolution 2.0.4 (2.0.4-4) Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org On Mon, 2005-06-20 at 12:33 +0200, Marco Colombo wrote: > Hi, > today I've found a server in a OOM condition, the funny thing is that > after some investigation I've found no process that has mem allocated > to. I even switched to single user, here's what I've found: [...] > total used free shared buffers cached > Mem: 1035812 898524 137288 0 3588 16732 > -/+ buffers/cache: 878204 157608 > Swap: 1049248 788 1048460 > sh-2.05b# uptime > 12:13:28 up 35 days, 1:48, 0 users, load average: 0.00, 0.59, 16.13 > sh-2.05b# uname -a > Linux xxxx.example.org 2.6.10-1.12_FC2.marco #1 Mon Feb 7 14:53:42 CET 2005 > i686 athlon i386 GNU/Linux > > I know this is an old Fedora Core 2 kernel, eventually I'll bring the > issue on thier lists. An upgrade has already been scheduled for this > host, so I'm not really pressed in tracking this specific bug (unless it > occurs on the new system, of course). > > Anyway, I just wonder if generally there's a way to find out where those > 850+ MBs are allocated. Since there are no big user processes, I'm > assuming it's a memory leak in kernel space. I'm curious, this is the > first time I see something like this. Any suggestion what to look at > besides 'ps' and 'free'? > > The server has been mainly running PostgreSQL at a fairly high load for > the last 35 days, BTW. > > TIA, > .TM. Thanks to everybody who replied to me. Here's more data: sh-2.05b# sort -rn +1 /proc/slabinfo | head -5 biovec-1 7502216 7502296 16 226 1 : tunables 120 60 0 : slabdata 33196 33196 0 bio 7502216 7502262 96 41 1 : tunables 120 60 0 : slabdata 182982 182982 0 size-64 4948 5307 64 61 1 : tunables 120 60 0 : slabdata 87 87 0 buffer_head 3691 3750 52 75 1 : tunables 120 60 0 : slabdata 50 50 0 dentry_cache 2712 2712 164 24 1 : tunables 120 60 0 : slabdata 113 113 0 I've found no way do free that memory, so decided to reboot it. In the following days I had been monitoring the system after upgrading to kernel-2.6.10-1.770_FC2. Here are the results I got, day by day: bio 115333 115333 96 41 1 : tunables 120 60 0 : slabdata 2813 2813 0 biovec-1 115322 115486 16 226 1 : tunables 120 60 0 : slabdata 511 511 0 biovec-1 325006 325440 16 226 1 : tunables 120 60 0 : slabdata 1440 1440 0 bio 324987 325212 96 41 1 : tunables 120 60 0 : slabdata 7930 7932 0 bio 538535 538535 96 41 1 : tunables 120 60 0 : slabdata 13135 13135 0 biovec-1 538528 538784 16 226 1 : tunables 120 60 0 : slabdata 2384 2384 0 bio 749870 750218 96 41 1 : tunables 120 60 0 : slabdata 18296 18298 0 biovec-1 749886 750772 16 226 1 : tunables 120 60 0 : slabdata 3322 3322 0 bio 960630 960630 96 41 1 : tunables 120 60 0 : slabdata 23430 23430 0 biovec-1 960642 960726 16 226 1 : tunables 120 60 0 : slabdata 4251 4251 0 bio 1170079 1170345 96 41 1 : tunables 120 60 0 : slabdata 28543 28545 0 biovec-1 1170066 1170906 16 226 1 : tunables 120 60 0 : slabdata 5181 5181 0 bio 1379857 1380019 96 41 1 : tunables 120 60 0 : slabdata 33658 33659 0 biovec-1 1379854 1380408 16 226 1 : tunables 120 60 0 : slabdata 6108 6108 0 Clearly, something was going on. So I decided to run a vanilla kernel instead. I'm running 2.6.12.1 right now, and after about one day of uptime: bio 345 369 96 41 1 : tunables 120 60 0 : slabdata 9 9 0 biovec-1 376 678 16 226 1 : tunables 120 60 0 : slabdata 3 3 0 which seem to me sane values (and they stay like that as far as I can see). No more daily increase of more than 200,000. I'll keep an eye on it in the next days, but I think 2.6.12.1 is not affected. .TM. -- ____/ ____/ / / / / Marco Colombo ___/ ___ / / Technical Manager / / / ESI s.r.l. _____/ _____/ _/ Colombo@ESI.it