public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Eric Dumazet <dada1@cosmosbay.com>
To: Christoph Lameter <clameter@sgi.com>
Cc: Ingo Molnar <mingo@elte.hu>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Matt Mackall <mpm@selenic.com>, "Rafael J. Wysocki" <rjw@sisk.pl>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: tipc_init(), WARNING: at arch/x86/mm/highmem_32.c:52, [2.6.24-rc4-git5: Reported regressions from 2.6.23]
Date: Fri, 14 Dec 2007 08:16:34 +0100	[thread overview]
Message-ID: <47622DD2.10405@cosmosbay.com> (raw)
In-Reply-To: <Pine.LNX.4.64.0712131954440.30721@schroedinger.engr.sgi.com>

Christoph Lameter a écrit :
> On Sat, 8 Dec 2007, Ingo Molnar wrote:
> 
>>> Good. Although we should perhaps look at that reported performance 
>>> problem with SLUB. It looks like SLUB will do a memclear() for the 
>>> area twice (first for the whole page, then for the thing it allocated) 
>>> for the slow case. Maybe that exacerbates the problem.
>> i dont think the SLUB problem could be explained purely via a double 
>> memset(). [which ought to be extremely fast anyway] We are talking about 
>> a 10 times slowdown on a 64-way box of a workload that is fairly 
>> common-sense. (tasks sending messages to each other via bog standard 
>> means)
>>
>> while i dont want to jump to conclusions without looking at some 
>> profiles, i think the SLUB performance regression is indicative of the 
>> following fallacy: "SLAB can be done significantly simpler while keeping 
>> the same performance".
> 
> Well this is double crap. First of all SLUB does not do memclear twice. 
> There is no reason to assume that SLUB has the problem just because SLOB 
> hat that. A "fix" for that nonexistent problem went into Linus tree. WTH 
> is going on?
> 
> SLUB was done because of a series of problem with the basic concepts of 
> SLAB that treaten it usability in the future.
>  
>> I couldnt point to any particular aspect of SLAB that i could 
>> characterise as "needless bloat".
> 
> I agree, SLABs architecture is pretty tight and I was one of those who 
> helped it along to be that way.
> 
> However, SLAB is just fundamentally wrong for todays machine. The key 
> problem today is cacheline fetch latency and that problem will increase 
> significantly in the future. Sure under some circumstances that exploit 
> the fact that SLAB sometimes gets its guesses on the cpu cache right SLAB 
> can still win but the more processors and nodes we get the more it will 
> become difficult to keep SLAB around and the more it will become 
> difficult to establish what cachelines are in the cpu cache.
> 
>> I think we should we make SLAB the default for v2.6.24 ...
> 
> If you guarantee that all the regression of SLAB vs. SLUB are addressed 
> then thats fine but AFAICT that is not possible.
> 
> Here is a list of some of the benefits of SLUB just in case we forgot:
> 
> 
> - SLUB is performance wise much faster than SLAB. This can be more than a
>   factor of 10 (case of concurrent allocations / frees on multiple
>   processors). See http://lkml.org/lkml/2007/10/27/245
> 
> - Single threaded allocation speed is up to double that of SLAB
> 
> - Remote freeing of objectcs in a NUMA systems is typically 30% faster.
> 
> - Debugging on SLAB is difficult. Requires recompile of the kernel
>   and the resulting output is difficult to interpret. SLUB can apply
>   debugging options to a subset of the slabcaches in order to allow
>   the system to work with maximum speed. This is necessary to detect
>   difficult to reproduce race conditions.
> 
> - SLAB can capture huge amounts of memory in its queues. The problem
>   gets worse the more processors and NUMA nodes are in the system. The 
>   amount of memory limits the number of per cpu objects one can configure.
> 
> - SLAB requires a pass through all slab caches every 2 seconds to
>   expire objects. This is a problem both for realtime and MPI jobs
>   that cannot take such a processor outage.
> 
> - SLAB does not have a sophisticated slabinfo tool to report the
>   state of slab objects on the system. Can provide details of
>   object use.
> 
> - SLAB requires the update of two words for freeing
>   and allocation. SLUB can do that by updating a single
>   word which allows to avoid enabling and disabling interrupts if
>   the processor supports an atomic instruction for that purpose.
>   This is important for realtime kernels where special measures
>   may have to be implemented if one wants to disable interrupts.
> 
> - SLAB requires memory to be set aside for queues (processors
>   times number of slabs times queue size). SLUB requires none of that.
> 
> - SLUB merges slab caches with similar characteristics to
>   reduce the memory footprint even further.
> 
> - SLAB performs object level NUMA management which creates
>   a complex allocator complexity. SLUB manages NUMA on the level of
>   slab pages reducing object management overhead.
> 
> - SLUB allows remote node defragmentation to avoid the buildup
>   of large partial lists on a single node.
> 
> - SLUB can actively reduce the fragmentation of slabs through
>   slab cache specific callbacks (not merged yet)
> 
> - SLUB has resiliency features that allow it to isolate a problem
>   object and continue after diagnostics have been performed.
> 
> - SLUB creates rarely used DMA caches on demand instead of creating
>   them all on bootup (SLAB).
> 

Yes, SLUB should be the way to go, but some issues are not yet solved.

I had to switch back to SLAB on a production NUMA server, with 2 nodes and 8GB 
ram. Using a lot of sockets, so a large part of memory was used by kernel.

SLUB kernel was hitting OOM after 2 or 3 days of uptime.
SLAB kernel never hit this.

Unfortunatly I dont have a test machine to reproduce the setup.

Maybe the problem is not related to SLUB at all, but an underlying VM/NUMA bug.

The /proc/buddyinfo showed that :

Node 0 contained two zones (DMA and DMA32) total 4 GB
Node 1 contained one zone (Normal) total 4 GB

So Node 0 contained no (Normal) zone

part of /proc/meminfo

Slab:          3338512 kB
SReclaimable:   789716 kB
SUnreclaim:    2548796 kB

I remember network interrupts were taken by CPU 1, so most allocations were 
done by CPU 1 (node 1), and many freeing were done on CPU 0

Hope this helps


  reply	other threads:[~2007-12-14  7:17 UTC|newest]

Thread overview: 94+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-12-08  2:40 2.6.24-rc4-git5: Reported regressions from 2.6.23 Rafael J. Wysocki
2007-12-08  6:53 ` Fabio Comolli
2007-12-08  8:28   ` Ingo Molnar
2007-12-08  9:23     ` Andrew Morton
2007-12-08 22:11       ` Rafael J. Wysocki
2007-12-08  9:29 ` Andrew Morton
2007-12-08 22:17   ` Rafael J. Wysocki
2007-12-08  9:30 ` tipc_init(), WARNING: at arch/x86/mm/highmem_32.c:52, [2.6.24-rc4-git5: Reported regressions from 2.6.23] Ingo Molnar
2007-12-08 10:11   ` Andrew Morton
2007-12-08 16:37   ` Matt Mackall
2007-12-08 17:47     ` Linus Torvalds
2007-12-08 17:54       ` Linus Torvalds
2007-12-08 18:09         ` Andrew Morton
2007-12-08 18:37           ` Linus Torvalds
2007-12-08 19:52             ` Ingo Molnar
2007-12-08 20:29               ` Ingo Molnar
2007-12-09  8:20                 ` Pekka Enberg
2007-12-09  8:50                   ` Ingo Molnar
2007-12-09  9:18                     ` Pekka Enberg
2007-12-09 11:51                       ` Ingo Molnar
2007-12-09 12:34                         ` Ingo Molnar
2007-12-13 22:07                     ` Christoph Lameter
2007-12-09 15:59                   ` Arjan van de Ven
2007-12-11  6:27               ` Dave Jones
2007-12-11  8:52                 ` Ingo Molnar
2007-12-11 19:03                   ` Peter Zijlstra
2007-12-14  4:07               ` Christoph Lameter
2007-12-14  7:16                 ` Eric Dumazet [this message]
2007-12-14 12:49                 ` Ingo Molnar
2007-12-17 19:54                   ` Christoph Lameter
2007-12-09  7:58           ` Ingo Molnar
2007-12-09 14:17             ` Rafael J. Wysocki
2007-12-08 18:33         ` Matt Mackall
2007-12-08 19:00         ` Matt Mackall
2007-12-09  8:33         ` Pekka Enberg
2007-12-13 22:03       ` Christoph Lameter
2007-12-08  9:36 ` 2.6.24-rc4-git5: Reported regressions from 2.6.23 Andrew Morton
2007-12-08 10:12   ` Andreas Mohr
2007-12-08 10:20     ` Andrew Morton
2007-12-08 10:28       ` Matthew Garrett
2007-12-08 10:55       ` Andreas Mohr
2007-12-09 15:46         ` Tejun Heo
2007-12-09 19:59           ` Andreas Mohr
2007-12-09  6:52   ` Tejun Heo
2007-12-09 14:20     ` Rafael J. Wysocki
2007-12-09 15:11       ` Tejun Heo
2007-12-08  9:42 ` Andrew Morton
2007-12-08 18:57   ` Roland Dreier
2007-12-08 19:40   ` Theodore Tso
2007-12-08 19:55     ` Ingo Molnar
2007-12-08 22:30     ` Rafael J. Wysocki
2007-12-09  2:15       ` Theodore Tso
2007-12-13 10:49         ` Takashi Iwai
2007-12-20 15:42           ` Takashi Iwai
2007-12-08  9:46 ` Andrew Morton
2007-12-08 15:49   ` Alan Stern
2007-12-08  9:52 ` Andrew Morton
2007-12-09  7:00   ` Tejun Heo
2007-12-09 13:42     ` Alan Cox
2007-12-09 15:09       ` Tejun Heo
2007-12-09 15:25         ` Alan Cox
2007-12-09 15:39           ` Tejun Heo
2007-12-09 18:36       ` Linus Torvalds
2007-12-09 21:54         ` Alan Cox
2007-12-09 18:41       ` Linus Torvalds
2007-12-09 22:01         ` Alan Cox
2007-12-09 22:51           ` Ray Lee
2007-12-10  1:57           ` Linus Torvalds
2007-12-10  3:28             ` Alan Cox
2007-12-10  3:38             ` Alan Cox
2007-12-10 15:38               ` Linus Torvalds
2007-12-10  8:21             ` Ingo Molnar
2007-12-10  8:27               ` Tejun Heo
2007-12-10  8:41                 ` Ingo Molnar
2007-12-08 10:44 ` Richard Purdie
2007-12-08 22:32   ` Rafael J. Wysocki
2007-12-09 11:54 ` Andrew Morton
2007-12-09 12:05   ` Ingo Molnar
2007-12-09 14:24   ` Rafael J. Wysocki
2007-12-10 20:42 ` Ingo Molnar
2007-12-10 20:57   ` Guillaume Chazarain
2007-12-10 20:59   ` Andrew Morton
2007-12-10 22:45     ` Ingo Molnar
2007-12-10 23:04       ` Ingo Molnar
2007-12-10 23:34         ` Stefano Brivio
2007-12-10 23:53           ` Guillaume Chazarain
2007-12-11  8:48             ` Ingo Molnar
2007-12-10 23:56           ` Arjan van de Ven
2007-12-11  0:01             ` Guillaume Chazarain
2007-12-11  1:06               ` Arjan van de Ven
2007-12-11  8:43                 ` Ingo Molnar
2007-12-11  9:01           ` Ingo Molnar
2007-12-11 21:10             ` Stefano Brivio
2007-12-19  0:58             ` Stefano Brivio

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=47622DD2.10405@cosmosbay.com \
    --to=dada1@cosmosbay.com \
    --cc=akpm@linux-foundation.org \
    --cc=clameter@sgi.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=mpm@selenic.com \
    --cc=rjw@sisk.pl \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox