From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Message-ID: <56C4946B.10102@hpe.com> Date: Wed, 17 Feb 2016 10:40:27 -0500 From: Waiman Long MIME-Version: 1.0 To: Ingo Molnar CC: Alexander Viro , Jan Kara , Jeff Layton , "J. Bruce Fields" , Tejun Heo , Christoph Lameter , , , Ingo Molnar , Peter Zijlstra , Andi Kleen , Dave Chinner , Scott J Norton , Douglas Hatch , Linus Torvalds , Andrew Morton , Peter Zijlstra , Thomas Gleixner Subject: Re: [RRC PATCH 2/2] vfs: Use per-cpu list for superblock's inode list References: <1455672680-7153-1-git-send-email-Waiman.Long@hpe.com> <1455672680-7153-3-git-send-email-Waiman.Long@hpe.com> <20160217071632.GA18403@gmail.com> In-Reply-To: <20160217071632.GA18403@gmail.com> Content-Type: text/plain; charset="ISO-8859-1"; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: On 02/17/2016 02:16 AM, Ingo Molnar wrote: > * Waiman Long wrote: > >> When many threads are trying to add or delete inode to or from >> a superblock's s_inodes list, spinlock contention on the list can >> become a performance bottleneck. >> >> This patch changes the s_inodes field to become a per-cpu list with >> per-cpu spinlocks. >> >> With an exit microbenchmark that creates a large number of threads, >> attachs many inodes to them and then exits. The runtimes of that >> microbenchmark with 1000 threads before and after the patch on a >> 4-socket Intel E7-4820 v3 system (40 cores, 80 threads) were as >> follows: >> >> Kernel Elapsed Time System Time >> ------ ------------ ----------- >> Vanilla 4.5-rc4 65.29s 82m14s >> Patched 4.5-rc4 22.81s 23m03s >> >> Before the patch, spinlock contention at the inode_sb_list_add() >> function at the startup phase and the inode_sb_list_del() function at >> the exit phase were about 79% and 93% of total CPU time respectively >> (as measured by perf). After the patch, the percpu_list_add() >> function consumed only about 0.04% of CPU time at startup phase. The >> percpu_list_del() function consumed about 0.4% of CPU time at exit >> phase. There were still some spinlock contention, but they happened >> elsewhere. > Pretty impressive IMHO! > > Just for the record, here's your former 'batched list' number inserted into the > above table: > > Kernel Elapsed Time System Time > ------ ------------ ----------- > Vanilla [v4.5-rc4] 65.29s 82m14s > batched list [v4.4] 45.69s 49m44s > percpu list [v4.5-rc4] 22.81s 23m03s > > i.e. the proper per CPU data structure and the resulting improvement in cache > locality gave another doubling in performance. > > Just out of curiosity, could you post the profile of the latest patches - is there > any (bigger) SMP overhead left, or is the profile pretty flat now? > > Thanks, > > Ingo Yes, there were still spinlock contention elsewhere in the exit path. Now the bulk of the CPU times was in: - 79.23% 79.23% a.out [kernel.kallsyms] [k] native_queued_spin - native_queued_spin_lock_slowpath - 99.99% queued_spin_lock_slowpath - 100.00% _raw_spin_lock - 99.98% list_lru_del - d_lru_del - 100.00% select_collect detach_and_collect d_walk d_invalidate proc_flush_task release_task do_exit do_group_exit get_signal do_signal exit_to_usermode_loop syscall_return_slowpath int_ret_from_sys_call The locks that were being contended were nlru->lock. For a 4-node system that I used, there will be four of those. Cheers, Longman