From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1760815AbcBYOoD (ORCPT <rfc822;w@1wt.eu>);
	Thu, 25 Feb 2016 09:44:03 -0500
Received: from mail-bn1bon0133.outbound.protection.outlook.com ([157.56.111.133]:17680
	"EHLO na01-bn1-obe.outbound.protection.outlook.com"
	rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP
	id S1760610AbcBYOoA (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
	Thu, 25 Feb 2016 09:44:00 -0500
Authentication-Results: kernel.org; dkim=none (message not signed)
 header.d=none;kernel.org; dmarc=none action=none header.from=hpe.com;
Message-ID: <56CF1322.2040609@hpe.com>
Date: Thu, 25 Feb 2016 09:43:46 -0500
From: Waiman Long <waiman.long@hpe.com>
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:10.0.12) Gecko/20130109 Thunderbird/10.0.12
MIME-Version: 1.0
To: Ingo Molnar <mingo@kernel.org>
CC: Jan Kara <jack@suse.cz>, Alexander Viro <viro@zeniv.linux.org.uk>,
        Jan Kara <jack@suse.com>, Jeff Layton <jlayton@poochiereds.net>,
        "J. Bruce Fields" <bfields@fieldses.org>, Tejun Heo <tj@kernel.org>,
        Christoph Lameter <cl@linux-foundation.org>,
        <linux-fsdevel@vger.kernel.org>, <linux-kernel@vger.kernel.org>,
        Ingo Molnar <mingo@redhat.com>, Peter Zijlstra <peterz@infradead.org>,
        Andi Kleen <andi@firstfloor.org>, Dave Chinner <dchinner@redhat.com>,
        Scott J Norton <scott.norton@hp.com>,
        Douglas Hatch <doug.hatch@hp.com>
Subject: Re: [PATCH v3 3/3] vfs: Use per-cpu list for superblock's inode list
References: <1456254272-42313-1-git-send-email-Waiman.Long@hpe.com> <1456254272-42313-4-git-send-email-Waiman.Long@hpe.com> <20160224082840.GB10096@quack.suse.cz> <20160224083630.GA22868@gmail.com> <20160224085858.GE10096@quack.suse.cz> <20160225080635.GB10611@gmail.com>
In-Reply-To: <20160225080635.GB10611@gmail.com>
Content-Type: multipart/mixed;
	boundary="------------040703040006030204020606"
X-Originating-IP: [72.71.243.170]
X-ClientProxiedBy: SN1PR12CA0014.namprd12.prod.outlook.com (25.162.96.152) To
 TU4PR84MB0319.NAMPRD84.PROD.OUTLOOK.COM (25.162.186.29)
X-Microsoft-Exchange-Diagnostics: 1;TU4PR84MB0319;2:E3bw14IBZKg4BUYuhTn+FyP21Mgw4oakSGcnh0g9c8i11aM+0llwqi1Wyk9+VAxHRah0axhBJDSaBNdA6pfAcDXj8c686qaO7yjWZ2Y3G0vAjhnuf5Vr94rHJ6G81TqjamLEZwxf7ra5cpqZdeHBzg==;3:kglMRmyG46nGr799q/het0izzkn0q3Bt4mu+pPhKwUqCQjv51G1Vp/NKthH/pHdTA6/4FudUeq8LI/IWK1AxdvPLYcOLOHX2iMlTm0oNGGwTbRKpfvlHFloAheOqHCCr;25:dKSTJhrDCPQYrJU7McWJ3Ho7PJqCWkS4O+FvFPTehgHxHQW+i98r2FUjcOvwHhOfddPwK7EYZKIngvZa3wxc/DFG6k3jY1pCfl8uuXLHaUQQ3Qfp/tAflv8R4QR+Xf4rkcs2DYbAnGm/7IREGwITYcTuRqQGNw0+B+iY0+Hz3Jbt0YSE7RX4LnuQXN4s1TAVneez+4KmeLMOR3eeuCazTREqZkft/RT/wZAbr0a0+9ZEA2aPCDYGSciXkNSbkiJ4z+xbztkAkR5c+hUWD956JwYu11PowolT0PjT5T3y+bffK2u5qP/RTbVgkgt4oEMMYQEZxooVDwysXroci6JCfg==;20:e2rkOYNYW2hz84DROAhw45xo+czBwzjl1QrFBeItkBPA2ja6R+LkG/9wyqhBbK1fU/5YSdF2IkYWPrbhMkyRqHopMlJbA6bRPJ5EEti9Nz7rnJoFRptQOwW1/wBW1KC3XTmg00UqrKp2lyYlQYguy2zarFhjicjNmcthdnHehMCiWoxTHhzYdDaV6cg1A4EHfCNnZDNJW0FY43X7UvlJDu057CceKIbjn5+peOjN2cngv849o9kEmJIYnXbFZS8S
X-Microsoft-Antispam: UriScan:;BCL:0;PCL:0;RULEID:;SRVR:TU4PR84MB0319;
X-MS-Office365-Filtering-Correlation-Id: 302a318d-a24c-4381-f5c5-08d33df216c7
X-Microsoft-Antispam-PRVS: <TU4PR84MB0319CF77F0BE82C186136840F1A60@TU4PR84MB0319.NAMPRD84.PROD.OUTLOOK.COM>
X-Exchange-Antispam-Report-Test: UriScan:;
X-Exchange-Antispam-Report-CFA-Test: BCL:0;PCL:0;RULEID:(102415293)(102615271)(601004)(2401047)(8121501046)(5005006)(3002001)(10201501046);SRVR:TU4PR84MB0319;BCL:0;PCL:0;RULEID:;SRVR:TU4PR84MB0319;
X-Microsoft-Exchange-Diagnostics: 1;TU4PR84MB0319;4:9Iz8bdYqrgYvkRX7S5SL74hrbFusN61VdTIelahlcbDz/CfHPT1ZxBDBrhjPMId0jPxRpy37yMUTj7kSzo5v7XQXmaSBFYU81fm2Lw0dbnD/UctKF55/bakwEeXN7Zh//pGXMTp0pSivCnH9sR8YEf8nrXzcYegzYDm1keK6GYTWlB0JSko0bbhKsDyZpZs2XuIOmGzcZb2PFQ1msvU9G83v6tVbo7YGxJ5G7K64VyOGlwb1bh3DmCtWfdMuvoSKaezRSmz/KReduCCK9cOsVLUsnJ+l7K8Ta33wCpXyAPftJJyLxOqfbKI7FT2CW+/bDFTGUtKm48m99DC7CKp7YUaTwIxn/f5uDHs7FVEjcx0xfiQCcz43XBPslmpfmkoPp6LEWAgFlbnkRKgRamwLZBSVVo9mJbTlvyLOhZG5uho=
X-Forefront-PRVS: 08635C03D4
X-Forefront-Antispam-Report: SFV:NSPM;SFS:(10019020)(4630300001)(6049001)(6009001)(54534003)(24454002)(164054003)(377454003)(479174004)(5890100001)(42186005)(270700001)(512934002)(5008740100001)(117156001)(33656002)(65816999)(5001960100002)(189998001)(110136002)(5004730100002)(76176999)(568964002)(2950100001)(6116002)(54356999)(40100003)(3846002)(36756003)(1096002)(50986999)(92566002)(84326002)(87976001)(86362001)(2476003)(77096005)(2906002)(66066001)(19580395003)(80316001)(19580405001)(4610100001)(65956001)(586003)(4001350100001)(4326007)(7059030);DIR:OUT;SFP:1102;SCL:1;SRVR:TU4PR84MB0319;H:[192.168.142.193];FPR:;SPF:None;MLV:sfv;LANG:en;
X-Microsoft-Exchange-Diagnostics: =?us-ascii?Q?1;TU4PR84MB0319;23:CagE5VfJGmpuobD1ITGYR/l+VR5PC5jeRZY6yvnGc?=
 =?us-ascii?Q?xLCaP95deOl/tYhGqMmje6kAaYVIwx4892/pkDfseAXk8u6gk2N+XOdR4o6b?=
 =?us-ascii?Q?MVjbqzJnplT1pbuBpqGljtfi527Gd833aHm+3vmGzcsli/k8h8YhV+xhe6v0?=
 =?us-ascii?Q?8ZunXekUn7uL5yl8Xuwp3SZRAg4CJ9Duw/GZHU2xZIyLH06yAFr4iD0YecvZ?=
 =?us-ascii?Q?lfBY9d7MkvEYIM/nAZoXKl/xISUiJ0omgJxR20mdL8rEHBWkmItCnaLq2X/W?=
 =?us-ascii?Q?GRuYCL7V+siMwgD5IfBO9QWX7jHxJLifV+5Hl/k4yvftaQUx9Vp+ojRIEy0B?=
 =?us-ascii?Q?T2bfoQKhAVNRQh4NPHNsMGMhJmc3n+6E0KFdPBZOUK18ovu6WolNFYwBFbCC?=
 =?us-ascii?Q?1JYwfu3+0LpT504KCaVFIxegPou/IpkG2UMQwsFRquFq9xDJ0QU7wr9+RVBI?=
 =?us-ascii?Q?CQbE6kWjCrLWqreSax6lR7J9whWesyWL0g36jMldpPgmf/TPj5XQR7tW0cTp?=
 =?us-ascii?Q?wjpiKEDEXniTEATkzZdXwdIKEKOdh3k0eVG+KaurDivea0LilL7DDbSzbFWS?=
 =?us-ascii?Q?a1icQ53hmoULcLJfBjXE+q5wtSgawJATxJ6IhGluLd759Jz48eI4JqaqLxya?=
 =?us-ascii?Q?zMNMcbPCOltS+ETWmKUZH3kkm3UHcTdLK9ROgWPJUqI0vUM6y5rIXV/NG8DF?=
 =?us-ascii?Q?Y1C9DlvRjXVSE6CaMQ5GeTM9fWaP3RFgnKGmLB8QVMOUSZRcU4zBteJTH+no?=
 =?us-ascii?Q?ohJD5uA8js1Mp6TzNbCO/W+zJHYKTK83gq81Gx79RHxGu1+QIDcRdgCVzPpx?=
 =?us-ascii?Q?VuWt/HXojTdbR2TT7F3FyGqh1B89gj5BhM6iipDeGPnUqala6ZgoyQ+RDrnJ?=
 =?us-ascii?Q?gJi0cQxNDnt3U5XQkCSrYu0G5Y2MgvbwGnHKNrUkfoAXvC6Y5sLH5dH25isi?=
 =?us-ascii?Q?YcAk7axlTii3QkDO5Gr5C+A0EZjRHV0fs/0ljqTFGlCrw/a1HQNJY9CEFiDD?=
 =?us-ascii?Q?ZzttPuZ+5Jc4HYRTf6b6mt4I8UK5EcBX+fdWhwH4ZbCCZDvJB+cu8elTfnid?=
 =?us-ascii?Q?l0b2hCAS6Qpal0inoFTSpft7c7NVb/EW/GS0j62I6FETneXQHheNMVMH2CJP?=
 =?us-ascii?Q?k//+tEaO37Ix1RwiZQXQqp9MVoMubvvFvAAyBl/7Vo/CrW7TcU1CO3Ujp7gh?=
 =?us-ascii?Q?RtZtodsjRvOimfZfhdEGIP7WSZNwI0MCGyImIXj/E16TIyo5nZ36lmPSd8UD?=
 =?us-ascii?Q?7Chafh4z0Nvr9Ii4V8=3D?=
X-Microsoft-Exchange-Diagnostics: 1;TU4PR84MB0319;5:yj16245ibkTCJa13seCy8Cwb187acHw60TVZzKBwpAlpxg32yAe30DwL779mxvPYAJy5jO/H/3Y4JATrNbge/q7TyB4OD9KJqEYL+Onfmn2Iyl8SihRjce/kc8iu8dsbadXfw+StJXCOERCrHxeQBw==;24:UJ6xmpykuRbt93LvPck4Hz+VFt/AVJgWQgTt4Pv9GnqWWflIWCHCEXsfeAcAV6R4utq1koy6kY0kLDF9SgpzhcaWVvmazwzIa/zqFTI3loA=
SpamDiagnosticOutput: 1:23
SpamDiagnosticMetadata: NSPM
X-OriginatorOrg: hpe.com
X-MS-Exchange-CrossTenant-OriginalArrivalTime: 25 Feb 2016 14:43:53.4360 (UTC)
X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted
X-MS-Exchange-Transport-CrossTenantHeadersStamped: TU4PR84MB0319
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

--------------040703040006030204020606
Content-Type: text/plain; charset="ISO-8859-1"; format=flowed
Content-Transfer-Encoding: 7bit

On 02/25/2016 03:06 AM, Ingo Molnar wrote:
> * Jan Kara<jack@suse.cz>  wrote:
>
>>>>> With an exit microbenchmark that creates a large number of threads,
>>>>> attachs many inodes to them and then exits. The runtimes of that
>>>>> microbenchmark with 1000 threads before and after the patch on a 4-socket
>>>>> Intel E7-4820 v3 system (40 cores, 80 threads) were as follows:
>>>>>
>>>>>    Kernel            Elapsed Time    System Time
>>>>>    ------            ------------    -----------
>>>>>    Vanilla 4.5-rc4      65.29s         82m14s
>>>>>    Patched 4.5-rc4      22.81s         23m03s
>>>>>
>>>>> Before the patch, spinlock contention at the inode_sb_list_add() function
>>>>> at the startup phase and the inode_sb_list_del() function at the exit
>>>>> phase were about 79% and 93% of total CPU time respectively (as measured
>>>>> by perf). After the patch, the percpu_list_add() function consumed only
>>>>> about 0.04% of CPU time at startup phase. The percpu_list_del() function
>>>>> consumed about 0.4% of CPU time at exit phase. There were still some
>>>>> spinlock contention, but they happened elsewhere.
>>>> While looking through this patch, I have noticed that the
>>>> list_for_each_entry_safe() iterations in evict_inodes() and
>>>> invalidate_inodes() are actually unnecessary. So if you first apply the
>>>> attached patch, you don't have to implement safe iteration variants at all.
>>>>
>>>> As a second comment, I'd note that this patch grows struct inode by 1
>>>> pointer. It is probably acceptable for large machines given the speedup but
>>>> it should be noted in the changelog. Furthermore for UP or even small SMP
>>>> systems this is IMHO undesired bloat since the speedup won't be noticeable.
>>>>
>>>> So for these small systems it would be good if per-cpu list magic would just
>>>> fall back to single linked list with a spinlock. Do you think that is
>>>> reasonably doable?
>>> Even many 'small' systems tend to be SMP these days.
>> Yes, I know. But my tablet with 4 ARM cores is unlikely to benefit from this
>> change either. [...]
> I'm not sure about that at all, the above numbers are showing a 3x-4x speedup in
> system time, which ought to be noticeable on smaller SMP systems as well.
>
> Waiman, could you please post the microbenchmark?
>
> Thanks,
>
> 	Ingo

The microbenchmark that I used is attached.

I do agree that performance benefit will decrease as the number of CPUs 
get smaller. The system that I used for testing have 4 sockets with 40 
cores (80 threads). Dave Chinner had run his fstests on a 16-core system 
(probably 2-socket) which showed modest improvement in performance 
(~4m40s vs 4m30s in runtime).

This patch enables parallel insertion and deletion to/from the inode 
list which used to be a serialized operation. So if that list operation 
is a bottleneck, you will see significant improvement. If it is not, we 
may not notice that much of a difference. For a single-socket 4-core 
system, I agree that the performance benefit, if any, will be limited.

Cheers,
Longman


--------------040703040006030204020606
Content-Type: text/plain; name="exit_test.c"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment; filename="exit_test.c"

/*
 * This program is free software; you can redistribute it and/or modify
 * it under the terms of the GNU General Public License as published by
 * the Free Software Foundation; either version 2 of the License, or
 * (at your option) any later version.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU General Public License for more details.
 *
 * Authors: Waiman Long <waiman.long@hp.com>
 */
/*
 * This is an exit test
 */
#include <ctype.h>
#include <errno.h>
#include <pthread.h>
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <unistd.h>
#include <signal.h>
#include <sys/types.h>
#include <sys/syscall.h>


#define do_exit()	syscall(SYS_exit)
#define	gettid()	syscall(SYS_gettid)
#define	MAX_THREADS	2048

static inline void cpu_relax(void)
{
        __asm__ __volatile__("rep;nop": : :"memory");
}

static inline void atomic_inc(volatile int *v)
{
	__asm__ __volatile__("lock incl %0": "+m" (*v));
}

static volatile int exit_now  = 0;
static volatile int threadcnt = 0;

/*
 * Walk the /proc/<pid> filesystem to make them fill the dentry cache
 */
static void walk_procfs(void)
{
	char cmdbuf[256];
	pid_t tid = gettid();

	snprintf(cmdbuf, sizeof(cmdbuf), "find /proc/%d > /dev/null 2>&1", tid);
	if (system(cmdbuf) < 0)
		perror("system() failed!");
}

static void *exit_thread(void *dummy)
{
	long tid = (long)dummy;

	walk_procfs();
	atomic_inc(&threadcnt);
	/*
	 * Busy wait until the do_exit flag is set and then call exit
	 */
	while (!exit_now)
		sleep(1);
	do_exit();
}

static void exit_test(int threads)
{
	pthread_t thread[threads];
	long i = 0, finish;
	time_t start = time(NULL);

	while (i++ < threads) {
		if (pthread_create(thread + i - 1, NULL, exit_thread,
				  (void *)i)) {
			perror("pthread_create");
			exit(1);
		}
#if 0
		/*
		 * Pipelining to reduce contention & improve speed
		 */
		if ((i & 0xf) == 0)
			 while (i - threadcnt > 12)
				usleep(1);
#endif
	}
	while (threadcnt != threads)
		usleep(1);
	walk_procfs();
	printf("Setup time = %lus\n", time(NULL) - start);
	printf("Process ready to exit!\n");
	kill(0, SIGKILL);
	exit(0);
}

int main(int argc, char *argv[])
{
	int   tcnt;	/* Thread counts */
	char *cmd = argv[0];

	if ((argc != 2) || !isdigit(argv[1][0])) {
		fprintf(stderr, "Usage: %s <thread count>\n", cmd);
		exit(1);
	}
	tcnt = strtoul(argv[1], NULL, 10);
	if (tcnt > MAX_THREADS) {
		fprintf(stderr, "Error: thread count should be <= %d\n",
			MAX_THREADS);
		exit(1);
	}
	exit_test(tcnt);
	return 0;	/* Not reaachable */
}

--------------040703040006030204020606--