From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755379AbYIHTKo (ORCPT ); Mon, 8 Sep 2008 15:10:44 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1754739AbYIHTK1 (ORCPT ); Mon, 8 Sep 2008 15:10:27 -0400 Received: from yx-out-2324.google.com ([74.125.44.28]:42097 "EHLO yx-out-2324.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754584AbYIHTKY (ORCPT ); Mon, 8 Sep 2008 15:10:24 -0400 DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:user-agent:mime-version:to:cc:subject :references:in-reply-to:content-type:content-transfer-encoding; b=dboP3tglzaSz48pQJ2SQKGPAA3JkR02gInoKXEY1OkTIypMDYzutgWEway4cj0Gx0f pSQ8m0834EUApFBm4r414Bkq4AgxBOHdzB+4l3enV3ZhZgQj5CYD7x03pT5J71eGRMS7 Xh5oqOZXJ599xUKrfXZRZ2rMiXNNMVqEd7jME= Message-ID: <48C57898.1080304@gmail.com> Date: Mon, 08 Sep 2008 22:10:16 +0300 From: =?ISO-8859-1?Q?T=F6r=F6k_Edwin?= User-Agent: Mozilla-Thunderbird 2.0.0.16 (X11/20080724) MIME-Version: 1.0 To: Andi Kleen CC: Theodore Tso , Peter Zijlstra , Ingo Molnar , rml@tech9.net, Linux Kernel , "Thomas Gleixner mingo@redhat.com" , "H. Peter Anvin" Subject: Re: Quad core CPUs loaded at only 50% when running a CPU and mmap intensive multi-threaded task References: <48B1CC15.2040006@gmail.com> <1219643476.20732.1.camel@twins> <48B25988.8040302@gmail.com> <1219656190.8515.7.camel@twins> <48B28015.3040602@gmail.com> <1219658527.8515.16.camel@twins> <48B287D8.1000000@gmail.com> <1219660582.8515.24.camel@twins> <48B290E7.4070805@gmail.com> <1219664477.8515.54.camel@twins> <20080825134801.GN1408@mit.edu> <87y72k9otw.fsf@basil.nowhere.org> In-Reply-To: <87y72k9otw.fsf@basil.nowhere.org> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2008-08-26 11:12, Andi Kleen wrote: > Theodore Tso writes: > > As a general comment it still sounds there is a regression here? > If the workload was faster in an earlier kernel and is now slow > clearly something got slower? And that might be fixable. > Perhaps something for Rafael's list? After some more careful testing with the real program (clamd) I can say that there is no regression. If I scan the exact same files as the box running 2.6.18 I get similar results, the difference is within 10% [1]. There is however a problem with mmap [mmap with N threads is as slow as mmap with 1 thread, i.e. it is sequential :(], pagefaults and disk I/O, I think I am hitting the problem described in this thread (2 years ago!) http://lwn.net/Articles/200215/ http://lkml.org/lkml/2006/9/19/260 It looks like such a patch is still not part of 2.6.27, what happened to it? I will see if that patch applies to 2.6.27, and will rerun my test with that patch applied too. While running clamd I noticed in latencytop, that besides mmap/munmap latencies (around 20 ms), I also get page fault latencies (again around 20 ms). So I wrote another test program [2] that walks a directory tree, and reads each file once using read for each, and once using mmap for each. It clears the cache (using echo 3 >/proc/sys/vm/drop_caches) before each test (and I run the read test first, so if the cache wouldn't be cleared, then mmap would be faster not slower). The results show that reading files using mmap() takes about the same time, regardless of how many threads I use (1,2,4,8,16), but using read has a near linear speedup with the number of threads. First lets see some numbers [3], time to run the program on /usr/bin in seconds [4] Number of CPUs, 4 Number of threads ->, 1,, 2,, 4,, 8,, 16 Kernel version, read, mmap, read, mmap, read, mmap, read, mmap, read, mmap 2.6.27-rc5, 16.70, 17.01, 12.86, 16.26, 7.31, 15.16, 4.01, 14.93, 3.79, 15.40 2.6.26-1-amd64, 17.90, 16.95, 13.30, 16.18, 7.31, 15.34, 3.87, 14.96, 3.86, 15.89 2.6.22-3-amd64, 15.12, 15.41, 11.98, 15.17, 6.36, 14.29, 3.15, 14.61, 3.08, 15.44 The kernels are standard Debian kernels, except for 2.6.27-rc5 which I've built myself (posted .config earlier in this thread). mmap and read are about the same speed with nthreads=1, so lets see speedups relative to nthreads=1 Kernel version, read, mmap, read, mmap, read, mmap, read, mmap, read, mmap "2.6.27-rc5",1.00,1.00,1.30,1.05,2.28,1.12,4.16,1.14,4.41,1.10 "2.6.26-1-amd64",1.00,1.00,1.35,1.05,2.45,1.10,4.63,1.13,4.64,1.07 "2.6.22-3-amd64",1.00,1.00,1.26,1.02,2.38,1.08,4.80,1.05,4.91,1.00 I was running this on a usr/bin/ directory that has 372M, average file size 160K. So mmap performance stays about the same (14% change at most) regardless number of threads, while read performance *improves* with the number of threads, it is 4.8 times *faster* than with single threaded case. I think what happens is the following: - thread A open a file with mmap, and starts reading it, this generates page faults (which is normal for reading from an mmaped region) - thread B opens another file with mmap, and starts reading it. It happened to find mmap_sem untaken, so it locks it for writing, makes the change, and unlocks - thread A reads from a page that is not present triggering a page fault, mmap_sem is taken, and thread A is waiting for the page to be read from the disk - thread B does the same, and takes mmap_sem for reading - thread C creates a new mapping, and tries to take mmap_sem for writing, it cannot because there are readers, so it blocks waiting - thread A finishes the pagefault, releases the mmap_sem - thread B hasn't finished the pagefault, C is still blocked - A encounters another pagefault and takes the mmap_sem for reading - B finishes, and releases, C still blocked because mmap_sem is taken for reading .... C eventually takes the mmap_sem for writing, blocking A and B who want to read from a file .... Even if C gets the semaphore as soon as one pagefault is done, it still has to wait for the disk I/O for that pagefault to be completed. Why do you need to hold the process-wide mmap_sem while waiting for the page to be read from disk? As I understand (I coudl be wrong, please correct me!) we need to make sure that the page we are reading into exists and doesn't change mapping details during the disk I/O read, meaning it must not be unmapped, flags changed, etc. Can't we have a per-vma lock that would ensure this? If a process would want to munmap something, it would take the mmap_sem, then the per-vma lock, remove the mapping, release locks If you want to mmap something, you take mmap_sem, create the mapping with the per-vma lock held, release locks When a page fault reads, it takes the mmap_sem for reading, finds the vma, locks the per-vma lock, releases mmap_sem; does disk I/O; makes any changes needed to the vma, acquires the mmap_sem, releases the per-vma lock, releases mmap_sem Anybody else wishing to modify a vma, needs to take mmap_sem, the per-vma lock, release mmap_sem, make modification, release per-vma lock Anybody wishing to add/remove/reorder vmas needs to take the mmap_sem, the locks of all affected vmas, make modifications, and release all locks Is there something wrong with the above approach? >>From a lock contention perspective it can't be worse than holding a single mmap_sem during the entire disk I/O. mmap_sem would now mean: take me if you add/remove vmas, and take me before taking a vma lock vma lock: take me if you modify this vma, or if you need to be sure that the vma doesn't go away Thoughts? [1] 2.6.18 won't boot on my box, because it won't recognize an ICH10 SATA controller [2] the test program is available here: http://edwintorok.googlepages.com/scalability.tar.gz You just build it using 'make' (has to be GNU make), and the run $ sh ./runtest.sh /usr/bin/ | tee log $ sh ./postproc.sh log [3] the raw logs are available here: http://edwintorok.googlepages.com/logs.tar.gz [4] The test program was run on this system: Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz 4 GB DDR3 RAM Motherboard GA-EP45T-DS3, chipset ICH10, SATA controller in AHCI mode HDD 6x750GB WD Caviar Black, RAID 1 for /, and RAID10 for the rest / has ext3 I've run the test on /mnt/bak/usr/bin which is on / You need to run the testprogram on a system with fast disks (for eg it doesn't really make a difference on my laptop with 5k4 rpm disks). This system has these timings: /dev/md3: Timing cached reads: 11112 MB in 2.00 seconds = 5561.88 MB/sec Timing buffered disk reads: 296 MB in 3.00 seconds = 98.57 MB/sec If you need more info, please ask. Best regards, --Edwin