From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.0 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4C0CBC433DB for ; Mon, 11 Jan 2021 19:01:18 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id EF51A22795 for ; Mon, 11 Jan 2021 19:01:17 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2390625AbhAKTAy (ORCPT ); Mon, 11 Jan 2021 14:00:54 -0500 Received: from us-smtp-delivery-124.mimecast.com ([216.205.24.124]:21788 "EHLO us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2390531AbhAKTAx (ORCPT ); Mon, 11 Jan 2021 14:00:53 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1610391566; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=g2GdxugnWr3ZKzDNa7wKoHZrbCSv3DhlAbpXvzRS6pY=; b=FH3Rj7pUKT3hB9e2j+yqE4bxoQUj1tOiJfGdiu8gIlwXz69jdR4X5bSXi4y/he9nKoq/YV 6XxNnCrqHuvGlNlSYIc7XosWEfZxJc8hRvweeiqFevcESkRzUxRvvyTVlO9ZcEkS85yo/R ugYuSclXR3KUD8QamKS8S1C+yYwZabA= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-181-sXuHQNqlOdeD05WT_9Okdg-1; Mon, 11 Jan 2021 13:59:24 -0500 X-MC-Unique: sXuHQNqlOdeD05WT_9Okdg-1 Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.phx2.redhat.com [10.5.11.12]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 26365107ACFE; Mon, 11 Jan 2021 18:59:23 +0000 (UTC) Received: from bfoster (ovpn-114-23.rdu2.redhat.com [10.10.114.23]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 82E4F60BE5; Mon, 11 Jan 2021 18:59:22 +0000 (UTC) Date: Mon, 11 Jan 2021 13:59:20 -0500 From: Brian Foster To: Christoph Hellwig Cc: linux-xfs@vger.kernel.org, Avi Kivity Subject: Re: [PATCH 3/3] xfs: try to avoid the iolock exclusive for non-aligned direct writes Message-ID: <20210111185920.GF1091932@bfoster> References: <20210111161212.1414034-1-hch@lst.de> <20210111161212.1414034-4-hch@lst.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20210111161212.1414034-4-hch@lst.de> X-Scanned-By: MIMEDefang 2.79 on 10.5.11.12 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org On Mon, Jan 11, 2021 at 05:12:12PM +0100, Christoph Hellwig wrote: > We only need the exclusive iolock for direct writes to protect sub-block > zeroing after an allocation or conversion of unwritten extents, and the > synchronous execution of these writes is also only needed because the > iolock is dropped early for the dodgy i_dio_count synchronisation. > > Always start out with the shared iolock in xfs_file_dio_aio_write for > non-appending writes and only upgrade it to exclusive if the start and > end of the write range are not already allocated and in written > state. This means one or two extra lookups in the in-core extent tree, > but with our btree data structure those lookups are very cheap and do > not show up in profiles on NVMe hardware for me. On the other hand > avoiding the lock allows for a high concurrency using aio or io_uring. > > Signed-off-by: Christoph Hellwig > Reported-by: Avi Kivity > Suggested-by: Brian Foster > --- > fs/xfs/xfs_file.c | 127 +++++++++++++++++++++++++++++++++++----------- > 1 file changed, 96 insertions(+), 31 deletions(-) > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c > index 1470fc4f2e0255..59d4c6e90f06c1 100644 > --- a/fs/xfs/xfs_file.c > +++ b/fs/xfs/xfs_file.c > @@ -521,6 +521,57 @@ static const struct iomap_dio_ops xfs_dio_write_ops = { > .end_io = xfs_dio_write_end_io, > }; > > +static int > +xfs_dio_write_exclusive( > + struct kiocb *iocb, > + size_t count, > + bool *exclusive_io) > +{ > + struct xfs_inode *ip = XFS_I(file_inode(iocb->ki_filp)); > + struct xfs_mount *mp = ip->i_mount; > + struct xfs_ifork *ifp = &ip->i_df; > + loff_t offset = iocb->ki_pos; > + loff_t end = offset + count; > + xfs_fileoff_t offset_fsb = XFS_B_TO_FSBT(mp, offset); > + xfs_fileoff_t end_fsb = XFS_B_TO_FSB(mp, end); > + struct xfs_bmbt_irec got = { }; > + struct xfs_iext_cursor icur; > + int ret; > + > + *exclusive_io = true; > + > + /* > + * Bmap information not read in yet or no blocks allocated at all? > + */ > + if (!(ifp->if_flags & XFS_IFEXTENTS) || !ip->i_d.di_nblocks) > + return 0; > + > + ret = xfs_ilock_iocb(iocb, XFS_ILOCK_SHARED); > + if (ret) > + return ret; It looks like this helper is only called with ILOCK_SHARED already held. > + > + if (offset & mp->m_blockmask) { > + if (!xfs_iext_lookup_extent(ip, ifp, offset_fsb, &icur, &got) || > + got.br_startoff > offset_fsb || > + got.br_state == XFS_EXT_UNWRITTEN) > + goto out_unlock; > + } > + > + if ((end & mp->m_blockmask) && > + got.br_startoff + got.br_blockcount <= end_fsb) { > + if (!xfs_iext_lookup_extent(ip, ifp, end_fsb, &icur, &got) || > + got.br_startoff > end_fsb || > + got.br_state == XFS_EXT_UNWRITTEN) > + goto out_unlock; This line and the same goto in the previous block are both whitespace damaged (tabs before spaces). > + } > + > + *exclusive_io = false; > + > +out_unlock: > + xfs_iunlock(ip, XFS_ILOCK_SHARED); > + return ret; > +} > + > /* > * xfs_file_dio_aio_write - handle direct IO writes > * > @@ -557,8 +608,9 @@ xfs_file_dio_aio_write( > struct xfs_inode *ip = XFS_I(inode); > struct xfs_mount *mp = ip->i_mount; > ssize_t ret = 0; > - int unaligned_io = 0; > - int iolock; > + int iolock = XFS_IOLOCK_SHARED; > + bool subblock_io = false; > + bool exclusive_io = false; > size_t count = iov_iter_count(from); > struct xfs_buftarg *target = xfs_inode_buftarg(ip); > > @@ -566,45 +618,58 @@ xfs_file_dio_aio_write( > if ((iocb->ki_pos | count) & target->bt_logical_sectormask) > return -EINVAL; > > - /* > - * Don't take the exclusive iolock here unless the I/O is unaligned to > - * the file system block size. We don't need to consider the EOF > - * extension case here because xfs_file_aio_write_checks() will relock > - * the inode as necessary for EOF zeroing cases and fill out the new > - * inode size as appropriate. > - */ > + /* I/O that is not aligned to the fsblock size will need special care */ > if ((iocb->ki_pos & mp->m_blockmask) || > - ((iocb->ki_pos + count) & mp->m_blockmask)) { > - unaligned_io = 1; > + ((iocb->ki_pos + count) & mp->m_blockmask)) > + subblock_io = true; > > - /* > - * We can't properly handle unaligned direct I/O to reflink > - * files yet, as we can't unshare a partial block. > - */ > - if (xfs_is_cow_inode(ip)) { > - trace_xfs_reflink_bounce_dio_write(ip, iocb->ki_pos, count); > - return -ENOTBLK; > - } > - iolock = XFS_IOLOCK_EXCL; > - } else { > - iolock = XFS_IOLOCK_SHARED; > + /* > + * We can't properly handle unaligned direct I/O to reflink files yet, > + * as we can't unshare a partial block. > + */ > + if (subblock_io && xfs_is_cow_inode(ip)) { > + trace_xfs_reflink_bounce_dio_write(ip, iocb->ki_pos, count); > + return -ENOTBLK; > } > > - if (iocb->ki_flags & IOCB_NOWAIT) { > - /* unaligned dio always waits, bail */ > - if (unaligned_io) > - return -EAGAIN; > - if (!xfs_ilock_nowait(ip, iolock)) > + /* > + * Racy shortcut for obvious appends to avoid too much relocking: s/:/./ > + */ > + if (iocb->ki_pos > i_size_read(inode)) { > + if (iocb->ki_flags & IOCB_NOWAIT) > return -EAGAIN; Not sure why we need this check here if we'll eventually fall into the serialized check. It seems safer to me to just do 'iolock = XFS_IOLOCK_EXCL;' here and carry on. > - } else { > - xfs_ilock(ip, iolock); > + iolock = XFS_IOLOCK_EXCL; > } > > +relock: > + ret = xfs_ilock_iocb(iocb, iolock); > + if (ret) > + return ret; > ret = xfs_file_aio_write_checks(iocb, from, &iolock); > if (ret) > goto out; > count = iov_iter_count(from); > > + /* > + * Upgrade to an exclusive lock and force synchronous completion if the > + * I/O will require partial block zeroing. > + * We don't need to consider the EOF extension case here because > + * xfs_file_aio_write_checks() will relock the inode as necessary for > + * EOF zeroing cases and fill out the new inode size as appropriate. > + */ > + if (iolock != XFS_IOLOCK_EXCL && subblock_io) { > + ret = xfs_dio_write_exclusive(iocb, count, &exclusive_io); > + if (ret) > + goto out; > + if (exclusive_io) { > + xfs_iunlock(ip, iolock); > + if (iocb->ki_flags & IOCB_NOWAIT) > + return -EAGAIN; > + iolock = XFS_IOLOCK_EXCL; > + goto relock; > + } > + } > + > /* > * If we are doing unaligned IO, we can't allow any other overlapping IO > * in-flight at the same time or we risk data corruption. Wait for all > @@ -612,7 +677,7 @@ xfs_file_dio_aio_write( > * iolock if we had to take the exclusive lock in > * xfs_file_aio_write_checks() for other reasons. > */ > - if (unaligned_io) { > + if (exclusive_io) { Hmm.. so if we hold or upgrade to ILOCK_EXCL from the start for whatever reason, we'd never actually check whether the I/O is "exclusive" or not. Then we fall into here, demote the lock and the iomap layer may very well end up doing subblock zeroing. I suspect if we wanted to maintain this logic, the exclusive I/O check should occur for any subblock_io regardless of how the lock is held. Also, the comment above this check could use an update since it still refers to "unaligned IO." Brian > inode_dio_wait(inode); > } else if (iolock == XFS_IOLOCK_EXCL) { > xfs_ilock_demote(ip, XFS_IOLOCK_EXCL); > @@ -626,7 +691,7 @@ xfs_file_dio_aio_write( > */ > ret = iomap_dio_rw(iocb, from, &xfs_direct_write_iomap_ops, > &xfs_dio_write_ops, > - is_sync_kiocb(iocb) || unaligned_io); > + is_sync_kiocb(iocb) || exclusive_io); > out: > if (iolock) > xfs_iunlock(ip, iolock); > -- > 2.29.2 >