From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-bn1bon0133.outbound.protection.outlook.com ([157.56.111.133]:17680 "EHLO na01-bn1-obe.outbound.protection.outlook.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1760610AbcBYOoA (ORCPT ); Thu, 25 Feb 2016 09:44:00 -0500 Message-ID: <56CF1322.2040609@hpe.com> Date: Thu, 25 Feb 2016 09:43:46 -0500 From: Waiman Long MIME-Version: 1.0 To: Ingo Molnar CC: Jan Kara , Alexander Viro , Jan Kara , Jeff Layton , "J. Bruce Fields" , Tejun Heo , Christoph Lameter , , , Ingo Molnar , Peter Zijlstra , Andi Kleen , Dave Chinner , Scott J Norton , Douglas Hatch Subject: Re: [PATCH v3 3/3] vfs: Use per-cpu list for superblock's inode list References: <1456254272-42313-1-git-send-email-Waiman.Long@hpe.com> <1456254272-42313-4-git-send-email-Waiman.Long@hpe.com> <20160224082840.GB10096@quack.suse.cz> <20160224083630.GA22868@gmail.com> <20160224085858.GE10096@quack.suse.cz> <20160225080635.GB10611@gmail.com> In-Reply-To: <20160225080635.GB10611@gmail.com> Content-Type: multipart/mixed; boundary="------------040703040006030204020606" Sender: linux-fsdevel-owner@vger.kernel.org List-ID: --------------040703040006030204020606 Content-Type: text/plain; charset="ISO-8859-1"; format=flowed Content-Transfer-Encoding: 7bit On 02/25/2016 03:06 AM, Ingo Molnar wrote: > * Jan Kara wrote: > >>>>> With an exit microbenchmark that creates a large number of threads, >>>>> attachs many inodes to them and then exits. The runtimes of that >>>>> microbenchmark with 1000 threads before and after the patch on a 4-socket >>>>> Intel E7-4820 v3 system (40 cores, 80 threads) were as follows: >>>>> >>>>> Kernel Elapsed Time System Time >>>>> ------ ------------ ----------- >>>>> Vanilla 4.5-rc4 65.29s 82m14s >>>>> Patched 4.5-rc4 22.81s 23m03s >>>>> >>>>> Before the patch, spinlock contention at the inode_sb_list_add() function >>>>> at the startup phase and the inode_sb_list_del() function at the exit >>>>> phase were about 79% and 93% of total CPU time respectively (as measured >>>>> by perf). After the patch, the percpu_list_add() function consumed only >>>>> about 0.04% of CPU time at startup phase. The percpu_list_del() function >>>>> consumed about 0.4% of CPU time at exit phase. There were still some >>>>> spinlock contention, but they happened elsewhere. >>>> While looking through this patch, I have noticed that the >>>> list_for_each_entry_safe() iterations in evict_inodes() and >>>> invalidate_inodes() are actually unnecessary. So if you first apply the >>>> attached patch, you don't have to implement safe iteration variants at all. >>>> >>>> As a second comment, I'd note that this patch grows struct inode by 1 >>>> pointer. It is probably acceptable for large machines given the speedup but >>>> it should be noted in the changelog. Furthermore for UP or even small SMP >>>> systems this is IMHO undesired bloat since the speedup won't be noticeable. >>>> >>>> So for these small systems it would be good if per-cpu list magic would just >>>> fall back to single linked list with a spinlock. Do you think that is >>>> reasonably doable? >>> Even many 'small' systems tend to be SMP these days. >> Yes, I know. But my tablet with 4 ARM cores is unlikely to benefit from this >> change either. [...] > I'm not sure about that at all, the above numbers are showing a 3x-4x speedup in > system time, which ought to be noticeable on smaller SMP systems as well. > > Waiman, could you please post the microbenchmark? > > Thanks, > > Ingo The microbenchmark that I used is attached. I do agree that performance benefit will decrease as the number of CPUs get smaller. The system that I used for testing have 4 sockets with 40 cores (80 threads). Dave Chinner had run his fstests on a 16-core system (probably 2-socket) which showed modest improvement in performance (~4m40s vs 4m30s in runtime). This patch enables parallel insertion and deletion to/from the inode list which used to be a serialized operation. So if that list operation is a bottleneck, you will see significant improvement. If it is not, we may not notice that much of a difference. For a single-socket 4-core system, I agree that the performance benefit, if any, will be limited. Cheers, Longman --------------040703040006030204020606 Content-Type: text/plain; name="exit_test.c" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="exit_test.c" /* * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * Authors: Waiman Long */ /* * This is an exit test */ #include #include #include #include #include #include #include #include #include #include #define do_exit() syscall(SYS_exit) #define gettid() syscall(SYS_gettid) #define MAX_THREADS 2048 static inline void cpu_relax(void) { __asm__ __volatile__("rep;nop": : :"memory"); } static inline void atomic_inc(volatile int *v) { __asm__ __volatile__("lock incl %0": "+m" (*v)); } static volatile int exit_now = 0; static volatile int threadcnt = 0; /* * Walk the /proc/ filesystem to make them fill the dentry cache */ static void walk_procfs(void) { char cmdbuf[256]; pid_t tid = gettid(); snprintf(cmdbuf, sizeof(cmdbuf), "find /proc/%d > /dev/null 2>&1", tid); if (system(cmdbuf) < 0) perror("system() failed!"); } static void *exit_thread(void *dummy) { long tid = (long)dummy; walk_procfs(); atomic_inc(&threadcnt); /* * Busy wait until the do_exit flag is set and then call exit */ while (!exit_now) sleep(1); do_exit(); } static void exit_test(int threads) { pthread_t thread[threads]; long i = 0, finish; time_t start = time(NULL); while (i++ < threads) { if (pthread_create(thread + i - 1, NULL, exit_thread, (void *)i)) { perror("pthread_create"); exit(1); } #if 0 /* * Pipelining to reduce contention & improve speed */ if ((i & 0xf) == 0) while (i - threadcnt > 12) usleep(1); #endif } while (threadcnt != threads) usleep(1); walk_procfs(); printf("Setup time = %lus\n", time(NULL) - start); printf("Process ready to exit!\n"); kill(0, SIGKILL); exit(0); } int main(int argc, char *argv[]) { int tcnt; /* Thread counts */ char *cmd = argv[0]; if ((argc != 2) || !isdigit(argv[1][0])) { fprintf(stderr, "Usage: %s \n", cmd); exit(1); } tcnt = strtoul(argv[1], NULL, 10); if (tcnt > MAX_THREADS) { fprintf(stderr, "Error: thread count should be <= %d\n", MAX_THREADS); exit(1); } exit_test(tcnt); return 0; /* Not reaachable */ } --------------040703040006030204020606--