From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S964788AbbGPRgt (ORCPT <rfc822;w@1wt.eu>);
	Thu, 16 Jul 2015 13:36:49 -0400
Received: from mx1.redhat.com ([209.132.183.28]:52498 "EHLO mx1.redhat.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1756345AbbGPRel (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
	Thu, 16 Jul 2015 13:34:41 -0400
Date: Thu, 16 Jul 2015 19:32:56 +0200
From: Oleg Nesterov <oleg@redhat.com>
To: Jan Kara <jack@suse.cz>
Cc: Dave Hansen <dave.hansen@linux.intel.com>,
        Al Viro <viro@zeniv.linux.org.uk>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        Paul McKenney <paulmck@linux.vnet.ibm.com>,
        Peter Zijlstra <peterz@infradead.org>,
        Daniel Wagner <daniel.wagner@bmw-carit.de>,
        Davidlohr Bueso <dave@stgolabs.net>, Ingo Molnar <mingo@redhat.com>,
        Tejun Heo <tj@kernel.org>, linux-kernel@vger.kernel.org
Subject: Re: [PATCH RFC 0/4] change sb_writers to use percpu_rw_semaphore
Message-ID: <20150716173256.GA17753@redhat.com>
References: <20150713212536.GA13855@redhat.com> <20150714104810.GB24369@quack.suse.cz> <20150714133731.GA24837@redhat.com> <55A57C50.1080406@linux.intel.com> <20150714212249.GA18441@redhat.com> <55A581F9.3090507@linux.intel.com> <20150715064705.GA22609@quack.suse.cz> <20150715181920.GA1797@redhat.com> <20150716072654.GE22847@quack.suse.cz>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20150716072654.GE22847@quack.suse.cz>
User-Agent: Mutt/1.5.18 (2008-05-17)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On 07/16, Jan Kara wrote:
>
> On Wed 15-07-15 20:19:20, Oleg Nesterov wrote:
> >
> > Perhaps it makes to merge other 2 patches from Dave first? (those which
> > change __sb_start/end_write to rely on RCU). Afaics these changes are
> > straightforward and correct. Although I'd suggest to use preempt_disable()
> > and synchronize_sched() instead. I will be happy to (try to) make this
> > conversion on top of his changes.
> >
> > Because I do not want to delay the performance improvements and I do not
> > know when exactly I'll send the next version: I need to finish the previous
> > discussion about rcu_sync first. And the necessary changes in fs/super.c
> > depend on whether percpu_rw_semaphore will have rcu_sync or not (not too
> > much, only destroy_super() depends, but still).
> >
> > And of course, I am worried that I missed something and percpu_rw_semaphore
> > can't work for some reason. The code in fs/super.c looks simple, but it
> > seems that filesystems do the "strange" things with lockdep at least.
>
> So Dave's patches would go in only in the next merge window anyway so we
> still have like two-three weeks to decide which patchset to take.

OK, good.

> If you
> think it will take you longer,

Hopefully not.

> then merging Dave's patches makes some sense
> although I personally don't think the issue is so important that we have to
> fix it ASAP and eventual delay of one more release would be OK for me.

OK. I will try to do this in any case, I just wanted to say that I can
equally do this on top of Dave's patches.

To remind, I need to finish the discussion about percpu_rw_semaphore
and rcu_sync, then I'll try to make v2.

And. The biggest problem is lockdep. Everything else looks really simple
although of course I could miss something. And not only because the
filesystems abuse lockdep and thus we need some cleanups first. Unless
I am totally confused fs/super.c needs some cleanups (and fixes) too,
with or without this conversion. Say, acquire_freeze_lock() logic does
do not look right:

	- wait_event(.frozen < level) without rwsem_acquire_read() is
	  just wrong from lockdep perspective. If we are going to deadlock
	  because the caller is buggy, lockdep can't warn us.

	- __sb_start_write() can race with thaw_super() + freeze_super(),
	  and after "goto retry" the 2nd  acquire_freeze_lock() is wrong.

	- The "tell lockdep we are doing trylock" hack doesn't look nice.

	  I think this is correct, but this logic should be more explicit.
	  Yes, the recursive read_lock() is fine if we hold the lock on
	  higher level. But we do not need to fool lockdep. If we can not
	  deadlock in this case (and I agree we can't) we should simply use
	  wait == F consistently.

Something like this (not even compiled tested):

	static int
	do_sb_start_write(struct super_block *sb, int level, bool wait, unsigned long ip)
	{

		if (wait)
			rwsem_acquire_read(&sb->s_writers.lock_map[level-1], 0, 0, ip);
	retry:
		if (unlikely(sb->s_writers.frozen >= level)) {
			if (!wait)
				return 0;
			wait_event(sb->s_writers.wait_unfrozen,
				   sb->s_writers.frozen < level);
		}

		percpu_counter_inc(&sb->s_writers.counter[level-1]);
		/*
		 * Make sure counter is updated before we check for frozen.
		 * freeze_super() first sets frozen and then checks the counter.
		 */
		smp_mb();
		if (unlikely(sb->s_writers.frozen >= level)) {
			__sb_end_write(sb, level);
			goto retry;
		}

		if (!wait)
			rwsem_acquire_read(&sb->s_writers.lock_map[level-1], 0, 1, ip);
		return 1;
	}

	/*
	 * This is an internal function, please use sb_start_{write,pagefault,intwrite}
	 * instead.
	 */
	int __sb_start_write(struct super_block *sb, int level, bool wait)
	{
		bool cantbelocked = false;
		int ret;

	#ifdef CONFIG_LOCKDEP
		/*
		 * We want lockdep to tell us about possible deadlocks with freezing but
		 * it's it bit tricky to properly instrument it. Getting a freeze protection
		 * works as getting a read lock but there are subtle problems. XFS for example
		 * gets freeze protection on internal level twice in some cases, which is OK
		 * only because we already hold a freeze protection also on higher level. Due
		 * to these cases we have to use wait == F (trylock mode) which must not fail.
		 */
		if (wait) {
			int i;

			for (i = 0; i < level - 1; i++)
				if (lock_is_held(&sb->s_writers.lock_map[i])) {
					cantbelocked = true;
					break;
				}
		}
	#endif
		ret = do_sb_start_write(sb, level, wait && !cantbelocked, _RET_IP_);
		WARN_ON(cantbelocked & !ret);
		return ret;
	}

This should not generate the additional code if CONFIG_LOCKDEP=n and
After this patch it will be trivial to convert __sb_start_write(), but
we need some more cleanups. And perhaps I'll send some changes (like
above) separately, because again, I think they make sense in any case.

In short: I'll try to make v2 asap, but this is all I can promise ;)

Oleg.