From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from holomorphy.com ([207.189.100.168]:48519 "EHLO holomorphy.com") by vger.kernel.org with ESMTP id S266199AbUHLArF (ORCPT ); Wed, 11 Aug 2004 20:47:05 -0400 Date: Wed, 11 Aug 2004 17:46:54 -0700 From: William Lee Irwin III Subject: Re: clear_user_highpage() Message-ID: <20040812004654.GX11200@holomorphy.com> References: <20040811161537.5e24c2b6.davem@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: To: Linus Torvalds Cc: "David S. Miller" , linux-arch@vger.kernel.org List-ID: On Wed, 11 Aug 2004, David S. Miller wrote: >> The PPC people used to zero out pages in the cpu idle loop >> and I'd definitely like to do something along those lines >> on sparc64 as well, I feel it would be extremely effective. On Wed, Aug 11, 2004 at 04:46:10PM -0700, Linus Torvalds wrote: > No. It sucks. It sucks so bad it's not funny. > It sucks because it eats CPU and memory bandwidth when it shouldn't be > eaten. It's a total disaster on SMP, but it's bad on UP too. Results from prototype prezeroing patches (ca. 2001) showed that dedicating a cpu on a 16x machine to prezeroing userspace pages (doing no other work on that cpu) improved kernel compile (insert sound of projectile vomiting here) "benchmarks". This suggests cache pollution and scheduling latency can be circumvented under some circumstances. On Wed, Aug 11, 2004 at 04:46:10PM -0700, Linus Torvalds wrote: > It sucks because it does bad things to cache behaviour. Sure, you'll move > the cost away from "clear_user_highpage", but the thing is, you will _not_ > move it into the idle time. What you will do is to move it into some > random time _after_ the idle time, when the idle thing has crapped all > over your caches. > The thing is, you make your cache footprint per CPU _much_ bigger, and you > spread it out a lot over time too, so you make it even worse. Uncached zeroing, dedicated cpus, or appropriate cache semantics (e.g. not allocating a cacheline either via some special instruction or by the cache in general not allocating lines on some writes and/or zeroing writes that miss) negate this. On Wed, Aug 11, 2004 at 04:46:10PM -0700, Linus Torvalds wrote: > The clearing will then be totally hidden in the profiles, because you will > have turned a nice and well-behaved "this is there the time goes" profile > into a mush of "we're taking cache misses at random times, and we don't > know why". > That, btw, is a _classic_ mistake in profiling. Move the work around so > that it's not as visible any more. > In other words, don't do it. It's a mistake. It is optimizing the profile > without actually optimizing what you want _done_. > Btw, this is exactly what the totally brain-damaged slab stuff does. It > takes away the peaks, but does so by having worse cache access patterns > all around. I beg to differ; slab preconstruction, when it has not been effective, has had to do with the heaviness of the slab allocator and when the slab allocator is circumvented it's effective even where it's otherwise too heavyweight. Zeroing pagetables is in fact the poster child for this, where almost all architectures have cached prezeroed pagetables forever. Reinstating caching of i386 pagetables improved SDET performance by a consistent (and hence statistically significant) margin of 1%-1.5%. One of the key aspects of an access pattern that makes preconstruction useful is that very little of the allocated memory is actually touched during typical accesses. Hence, the construction of the object pollutes the cache with numerous cachelines that are rarely touched. Objects as large as pages, e.g. pagetable pages, show this very well. Typical usage of the upper levels is sparse, and for smaller processes the lower levels are also sparsely-used. Userspace likewise can't be assumed to reference an entire zeroed page allocated to it. Userspace can't be predicted but it is also typical there for only small portions of large data structures to be referenced. e.g. a large, say, PAGE_SIZE buffer is allocated for read() traffic, but all typical read()'s are only a few bytes in length. And in general the "precharging" stalls taking unnecessary misses for the cachelines of the object that are rarely accessed, pollutes the cache with those cachelines of the object that are rarely accessed, and burns a few extra cycles (dwarfed by the misses on the unnecessarily- touched cachelines) doing an unnecessary pass over the object. On Wed, Aug 11, 2004 at 04:46:10PM -0700, Linus Torvalds wrote: > Look at it this way: > - it might be worth doing in big batches under some kind of user control, > when you really can _control_ that it happens at a good time. > I _might_ buy into this argument. Make it a batch thing that really > screws the caches, but only does so very seldom, when the user asked > for it. > - but we aren't supposed to have that much memory free _anyway_, and > trying to keep it around on a separate list is horrible for > fragmentation. So batching huge things up is likely not a good idea > either. > - with caches growing larger, it's actually BETTER to clear the page at > usage time, because then the CPU that actually touches the page won't > have to bring in the page in from memory. We'll blow one page of cache > by clearing it, but we will blow it in a "good" way - hopefully with > almost no memory traffic at all (ie the clear can be done as pure > invalidate cycles, no read-back into the CPU). > And the thing is, the background clearing will just get worse and worse. > In summary: it's a _good_ thing when you see a sharp peak in your > profiles, and you can say "I know exactly what that peak is for, and it's > doing exactly the work it should be doing and nothing else". The real flaws I see in background zeroing are fragmentation and scheduling latency (or potential loss of cpus dedicated to the purpose). Preventing cache pollution is already a prerequisite for remotely non-naive implementations. The scheduling latency aspect is due to the fact that many cpus have caching semantics that require extremely slow uncached accesss to prevent cache pollution, and that page zeroing is slow enough of an operation to noticeably stall rescheduling userspace. It's possible that this could be mitigated by incrementally zeroing pages and polling TIF_NEED_RESCHED between blocks of a page, but the background zeroing efforts went in a rather different, useless direction (dedicating cpus). The fragmentation bits are just as you say, an artifact of segregating a pool of pages from the general pool of free pages that can be coalesced. I haven't come up with any methods to address this. In general, I despise background processing and would rather see event-driven methods of accomplishing preconstruction, though I've no idea whatsoever how those would be carried out for userspace memory. -- wli