From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S964788AbbGPRgt (ORCPT ); Thu, 16 Jul 2015 13:36:49 -0400 Received: from mx1.redhat.com ([209.132.183.28]:52498 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756345AbbGPRel (ORCPT ); Thu, 16 Jul 2015 13:34:41 -0400 Date: Thu, 16 Jul 2015 19:32:56 +0200 From: Oleg Nesterov To: Jan Kara Cc: Dave Hansen , Al Viro , Linus Torvalds , Paul McKenney , Peter Zijlstra , Daniel Wagner , Davidlohr Bueso , Ingo Molnar , Tejun Heo , linux-kernel@vger.kernel.org Subject: Re: [PATCH RFC 0/4] change sb_writers to use percpu_rw_semaphore Message-ID: <20150716173256.GA17753@redhat.com> References: <20150713212536.GA13855@redhat.com> <20150714104810.GB24369@quack.suse.cz> <20150714133731.GA24837@redhat.com> <55A57C50.1080406@linux.intel.com> <20150714212249.GA18441@redhat.com> <55A581F9.3090507@linux.intel.com> <20150715064705.GA22609@quack.suse.cz> <20150715181920.GA1797@redhat.com> <20150716072654.GE22847@quack.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150716072654.GE22847@quack.suse.cz> User-Agent: Mutt/1.5.18 (2008-05-17) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 07/16, Jan Kara wrote: > > On Wed 15-07-15 20:19:20, Oleg Nesterov wrote: > > > > Perhaps it makes to merge other 2 patches from Dave first? (those which > > change __sb_start/end_write to rely on RCU). Afaics these changes are > > straightforward and correct. Although I'd suggest to use preempt_disable() > > and synchronize_sched() instead. I will be happy to (try to) make this > > conversion on top of his changes. > > > > Because I do not want to delay the performance improvements and I do not > > know when exactly I'll send the next version: I need to finish the previous > > discussion about rcu_sync first. And the necessary changes in fs/super.c > > depend on whether percpu_rw_semaphore will have rcu_sync or not (not too > > much, only destroy_super() depends, but still). > > > > And of course, I am worried that I missed something and percpu_rw_semaphore > > can't work for some reason. The code in fs/super.c looks simple, but it > > seems that filesystems do the "strange" things with lockdep at least. > > So Dave's patches would go in only in the next merge window anyway so we > still have like two-three weeks to decide which patchset to take. OK, good. > If you > think it will take you longer, Hopefully not. > then merging Dave's patches makes some sense > although I personally don't think the issue is so important that we have to > fix it ASAP and eventual delay of one more release would be OK for me. OK. I will try to do this in any case, I just wanted to say that I can equally do this on top of Dave's patches. To remind, I need to finish the discussion about percpu_rw_semaphore and rcu_sync, then I'll try to make v2. And. The biggest problem is lockdep. Everything else looks really simple although of course I could miss something. And not only because the filesystems abuse lockdep and thus we need some cleanups first. Unless I am totally confused fs/super.c needs some cleanups (and fixes) too, with or without this conversion. Say, acquire_freeze_lock() logic does do not look right: - wait_event(.frozen < level) without rwsem_acquire_read() is just wrong from lockdep perspective. If we are going to deadlock because the caller is buggy, lockdep can't warn us. - __sb_start_write() can race with thaw_super() + freeze_super(), and after "goto retry" the 2nd acquire_freeze_lock() is wrong. - The "tell lockdep we are doing trylock" hack doesn't look nice. I think this is correct, but this logic should be more explicit. Yes, the recursive read_lock() is fine if we hold the lock on higher level. But we do not need to fool lockdep. If we can not deadlock in this case (and I agree we can't) we should simply use wait == F consistently. Something like this (not even compiled tested): static int do_sb_start_write(struct super_block *sb, int level, bool wait, unsigned long ip) { if (wait) rwsem_acquire_read(&sb->s_writers.lock_map[level-1], 0, 0, ip); retry: if (unlikely(sb->s_writers.frozen >= level)) { if (!wait) return 0; wait_event(sb->s_writers.wait_unfrozen, sb->s_writers.frozen < level); } percpu_counter_inc(&sb->s_writers.counter[level-1]); /* * Make sure counter is updated before we check for frozen. * freeze_super() first sets frozen and then checks the counter. */ smp_mb(); if (unlikely(sb->s_writers.frozen >= level)) { __sb_end_write(sb, level); goto retry; } if (!wait) rwsem_acquire_read(&sb->s_writers.lock_map[level-1], 0, 1, ip); return 1; } /* * This is an internal function, please use sb_start_{write,pagefault,intwrite} * instead. */ int __sb_start_write(struct super_block *sb, int level, bool wait) { bool cantbelocked = false; int ret; #ifdef CONFIG_LOCKDEP /* * We want lockdep to tell us about possible deadlocks with freezing but * it's it bit tricky to properly instrument it. Getting a freeze protection * works as getting a read lock but there are subtle problems. XFS for example * gets freeze protection on internal level twice in some cases, which is OK * only because we already hold a freeze protection also on higher level. Due * to these cases we have to use wait == F (trylock mode) which must not fail. */ if (wait) { int i; for (i = 0; i < level - 1; i++) if (lock_is_held(&sb->s_writers.lock_map[i])) { cantbelocked = true; break; } } #endif ret = do_sb_start_write(sb, level, wait && !cantbelocked, _RET_IP_); WARN_ON(cantbelocked & !ret); return ret; } This should not generate the additional code if CONFIG_LOCKDEP=n and After this patch it will be trivial to convert __sb_start_write(), but we need some more cleanups. And perhaps I'll send some changes (like above) separately, because again, I think they make sense in any case. In short: I'll try to make v2 asap, but this is all I can promise ;) Oleg.