From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751890AbeBWRcJ (ORCPT ); Fri, 23 Feb 2018 12:32:09 -0500 Received: from mail-wm0-f49.google.com ([74.125.82.49]:55024 "EHLO mail-wm0-f49.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751805AbeBWRcG (ORCPT ); Fri, 23 Feb 2018 12:32:06 -0500 X-Google-Smtp-Source: AH8x224NP8PmUEwVaFSM9EIwZM9UPKwpbNiep1tbq+GqQIkJVaRe+xpxt4WtdRNwVuLzbiU5YKOwXw== Date: Fri, 23 Feb 2018 18:31:58 +0100 From: Andrea Parri To: Nikolay Borisov Cc: LKML , "Paul E. McKenney" , mathieu.desnoyers@efficios.com, Peter Zijlstra Subject: Re: Reasoning about memory ordering Message-ID: <20180223173158.GA3723@andrea> References: <0db16ef6-c805-b1f6-527f-8fec149e3df5@suse.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <0db16ef6-c805-b1f6-527f-8fec149e3df5@suse.com> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Feb 23, 2018 at 02:30:22PM +0200, Nikolay Borisov wrote: > Hello, > > I'm cc'ing a bunch of people I know are well-versed in > the black arts of memory ordering! > > Currently in btrfs we have roughly the following sequence: > > T1: T2: > i_size_write(inode, newsize); > set_bit(BTRFS_INODE_READDIO_NEED_LOCK, &inode->runtime_flags); atomic_inc(&inode->i_dio_count); > smp_mb(); if (iov_iter_rw(iter) == READ) { > if (test_bit(BTRFS_INODE_READDIO_NEED_LOCK, &BTRFS_I(inode)->runtime_flags)) { > if (atomic_read(&inode->i_dio_count)) { if (atomic_dec_and_test(&inode->i_dio_count)) > wait_queue_head_t *wq = bit_waitqueue(&inode->i_state, __I_DIO_WAKEUP); wake_up_bit(&inode->i_state, __I_DIO_WAKEUP); > DEFINE_WAIT_BIT(q, &inode->i_state, __I_DIO_WAKEUP); } > if (offset >= i_size_read(inode)) > do { return; > prepare_to_wait(wq, &q.wq_entry, TASK_UNINTERRUPTIBLE); } > if (atomic_read(&inode->i_dio_count)) > schedule(); > } while (atomic_read(&inode->i_dio_count)); > finish_wait(wq, &q.wq_entry); > } > > smp_mb__before_atomic(); > clear_bit(BTRFS_INODE_READDIO_NEED_LOCK, &inode->runtime_flags); > > The semantics I'm after are: > > 1. If T1 goes to sleep, then T2 would see the > BTRFS_INODE_READDIO_NEED_LOCK and hence will execute the > atomic_dec_and_test and possibly wake up T1. This flag serves as a way > to indicate to possibly multiple T2 (dio readers) that T1 is blocked > and they should unblock it and resort to acquiring some locks (this is not > visible in this excerpt of code for brevity). It's sort of a back-off > mechanism. I don't see how this could be guaranteed, even in a sequentially consistent world (disclaimer: I'm certainly not familiar with btrfs): what is wrong in T1 T2 atomic_inc(i_dio_count) test_bit(NEED_LOCK, flags) // unset set_bit(NEED_LOCK, flags) atomic_read(i_dio_count) // >1 --> go to sleep Thanks, Andrea > > 2. BTRFS_INODE_READDIO_NEED_LOCK bit must be set _before_ going to sleep > > 3. BTRFS_INODE_READDIO_NEED_LOCK must be cleared _after_ the thread has > been woken up. > > 4. After T1 is woken up, it's possible that a new T2 comes and doesn't see > the BTRFS_INODE_READDIO_NEED_LOCK flag set but this is fine, since the check > for i_size should cause T2 to just return (it will also execute atomic_dec_and_test) > > Given this is the current state of the code (it's part of btrfs) I believe > the following could/should be done: > > 1. The smp_mb after the set_bit in T1 could be removed, since there is > already an implied full mm in prepare_to_wait. That is if we go to sleep, > then T2 is guaranteed to see the flag/i_size_write happening by merit of > the implied memory barrier in prepare_to_wait/schedule. But what if it doesn't > go to sleep? I still would like the i_size_write to be visible to T2 > > 2. The bit clearing code in T1 should be possible to be replaced by > clear_bit_unlock (this was suggested by PeterZ on IRC). > > 3. I suspect there is a memory barrier in T2 that is missing. Perhaps > there should be an smp_mb__before_atomic right before the test_bit so that > it's ordered with the implied smp_mb in T1's prepare_to_wait.