From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.2 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id F1920C432C1 for ; Tue, 24 Sep 2019 10:29:36 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id BB1EE2146E for ; Tue, 24 Sep 2019 10:29:36 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=mbobrowski-org.20150623.gappssmtp.com header.i=@mbobrowski-org.20150623.gappssmtp.com header.b="NXTlTyI2" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2409454AbfIXK3g (ORCPT ); Tue, 24 Sep 2019 06:29:36 -0400 Received: from mail-pg1-f194.google.com ([209.85.215.194]:37413 "EHLO mail-pg1-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2405224AbfIXK3g (ORCPT ); Tue, 24 Sep 2019 06:29:36 -0400 Received: by mail-pg1-f194.google.com with SMTP id c17so1082314pgg.4 for ; Tue, 24 Sep 2019 03:29:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=mbobrowski-org.20150623.gappssmtp.com; s=20150623; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to:user-agent; bh=cWHGhu/xEva0NCUnsouhp2Rqjl7y8RfVbbdYtfLeN1A=; b=NXTlTyI2vIC+atFvgwWyJ3X4f1wCJA+4VK9JPTs1mEoi1HY2sx4ltdBARifdKoKcS9 2yx9TcD/NNgkKGikxuWo09JtjlGYqkNrS4fGR5K76CoDkswsT9/pudKgmJYuMUwxkQim AVFMwL6sCWE2Jv2X/udeSLAJL3CdAT2QewEdvE4vqp+tP5my453goJ5AA05WSH2//mXI nipsqAuVLye7htJ4SftWF8jyqGju1+piB2jpn3C0SpvaebgJAlbfN7tJKZeSvd91DLMW JW5XV4ES9R6hTdYZrVjx188QWdHIiODMNHHxuV8LOEYIXwe0n6MtfBBRwhWlC2GPjCuv Fnxg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to:user-agent; bh=cWHGhu/xEva0NCUnsouhp2Rqjl7y8RfVbbdYtfLeN1A=; b=mym6emA8EEfnOXT7f0HsA/U7mLfa+C6BL2rzh8qQtudiFMypuEXh6LHBaz9clpiP35 9FlRZt2hr4VG1zY2G8FbcaIMxr5qjkjBD0STf4r3RVDl5/MyS4WiPsAZxbkKt8uGXfae YXxu2ZbHqYEXcLX4w7//ORfRJF5W3aHWq36117RRiei87DkXlbMV+OB3waB6sEoGl70N ITJLzQOn25lFYEnTMVcYixdrnz82P4iueKfHtIHDRfaavilC6poCzuV8Ql6SfF7W73lr VdfJoPKxHTOzIleEY6cwBXBCsQ8vfkak96h2wqOTmGxPwzL/zxBpAbocg7WY+kZ89vgI fyog== X-Gm-Message-State: APjAAAW0cABvPH4WP8Wzci/sSi+MhwtmEE9HQJyFj305aM8GCR1jGn30 xtCmumFmrZpI9UqVfoSOBkzU X-Google-Smtp-Source: APXvYqy5qJMRHiBOqREZujbqzO1hT2uDA4E6FhAximKLjJJKWV2Mj0Z9FbBtcBXhVGD2Kb56U2B9IA== X-Received: by 2002:a17:90a:2301:: with SMTP id f1mr2163987pje.121.1569320973166; Tue, 24 Sep 2019 03:29:33 -0700 (PDT) Received: from bobrowski ([110.232.114.101]) by smtp.gmail.com with ESMTPSA id s1sm5023004pjs.31.2019.09.24.03.29.29 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 24 Sep 2019 03:29:32 -0700 (PDT) Date: Tue, 24 Sep 2019 20:29:26 +1000 From: Matthew Bobrowski To: Jan Kara Cc: tytso@mit.edu, adilger.kernel@dilger.ca, linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, david@fromorbit.com, hch@infradead.org, darrick.wong@oracle.com Subject: Re: [PATCH v3 5/6] ext4: introduce direct IO write path using iomap infrastructure Message-ID: <20190924102926.GC17526@bobrowski> References: <20190923211011.GH20367@quack2.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20190923211011.GH20367@quack2.suse.cz> User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org On Mon, Sep 23, 2019 at 11:10:11PM +0200, Jan Kara wrote: > On Thu 12-09-19 21:04:46, Matthew Bobrowski wrote: > > +/* > > + * For a write that extends the inode size, ext4_dio_write_iter() will > > + * wait for the write to complete. Consequently, operations performed > > + * within this function are still covered by the inode_lock(). On > > + * success, this function returns 0. > > + */ > > +static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size, int error, > > + unsigned int flags) > > +{ > > + int ret; > > + loff_t offset = iocb->ki_pos; > > + struct inode *inode = file_inode(iocb->ki_filp); > > + > > + if (error) { > > + ret = ext4_handle_failed_inode_extension(inode, offset + size); > > + return ret ? ret : error; > > + } > > + > > + if (flags & IOMAP_DIO_UNWRITTEN) { > > + ret = ext4_convert_unwritten_extents(NULL, inode, > > + offset, size); > > + if (ret) > > + return ret; > > + } > > + > > + if (offset + size > i_size_read(inode)) { > > + ret = ext4_handle_inode_extension(inode, offset, size, 0); > > + if (ret) > > + return ret; > > + } > > With the suggestions I made to your patch 3/6 this could be simplified to: > > if (!error && flags & IOMAP_DIO_UNWRITTEN) { > error = ext4_convert_unwritten_extents(NULL, inode, offset, > size); > } > return ext4_handle_inode_extension(inode, offset, error ? : size, size); > > > Note the change that when ext4_convert_unwritten_extents() fails (although > this should not really happen unless there's some corruption going on), we > do properly truncate possible extents beyond i_size. This sounds good to me. I like this idea. > > +static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from) > > +{ > > + ssize_t ret; > > + size_t count; > > + loff_t offset = iocb->ki_pos; > > + struct inode *inode = file_inode(iocb->ki_filp); > > + bool extend = false, overwrite = false, unaligned_aio = false; > > + > > + if (!inode_trylock(inode)) { > > + if (iocb->ki_flags & IOCB_NOWAIT) > > + return -EAGAIN; > > + inode_lock(inode); > > + } > > + > > + if (!ext4_dio_checks(inode)) { > > + inode_unlock(inode); > > + /* > > + * Fallback to buffered IO if the operation on the > > + * inode is not supported by direct IO. > > + */ > > + return ext4_buffered_write_iter(iocb, from); > > + } > > + > > + ret = ext4_write_checks(iocb, from); > > + if (ret <= 0) { > > + inode_unlock(inode); > > + return ret; > > + } > > + > > + /* > > + * Unaligned direct AIO must be serialized among each other as > > + * the zeroing of partial blocks of two competing unaligned > > + * AIOs can result in data corruption. > > + */ > > + if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) && > > + !is_sync_kiocb(iocb) && ext4_unaligned_aio(inode, from, offset)) { > > + unaligned_aio = true; > > + inode_dio_wait(inode); > > + } > > + > > + /* > > + * Determine whether the IO operation will overwrite allocated > > + * and initialized blocks. If so, check to see whether it is > > + * possible to take the dioread_nolock path. > > + */ > > + count = iov_iter_count(from); > > + if (!unaligned_aio && ext4_overwrite_io(inode, offset, count) && > > + ext4_should_dioread_nolock(inode)) { > > + overwrite = true; > > + downgrade_write(&inode->i_rwsem); > > + } > > + > > + if (offset + count > i_size_read(inode) || > > + offset + count > EXT4_I(inode)->i_disksize) { > > + ext4_update_i_disksize(inode, inode->i_size); > > + extend = true; > > + } > > This call to ext4_update_i_disksize() is definitely wrong. If nothing else, > you need to also have transaction started and call ext4_mark_inode_dirty() > to actually journal the change of i_disksize (ext4_update_i_disksize() > updates only the in-memory copy of the entry). Also the direct IO code > needs to add the inode to the orphan list so that in case of crash, blocks > allocated beyond EOF get truncated on next mount. That is the whole point > of this excercise with i_disksize after all. > > But I'm wondering if i_disksize update is needed. Truncate cannot be in > progress (we hold i_rwsem) and dirty pages will be flushed by > iomap_dio_rw() before we start to allocate any blocks. So it should be > enough to have here: Well, I initially thought the same, however doing some research shows that we have the following edge case: - 45d8ec4d9fd54 and - 73fdad00b208b In fact you can reproduce the exact same i_size corruption issue by running the generic/475 xfstests mutitple times, as articulated within 45d8ec4d9fd54. So with that, I'm kind of confused and thinking that there may be a problem that resides elsewhere that may need addressing? > if (offset + count > i_size_read(inode)) { > /* > * Add inode to orphan list so that blocks allocated beyond > * EOF get properly truncated in case of crash. > */ > start transaction handle > add inode to orphan list > stop transaction handle > } > > And just leave i_disksize at whatever it currently is. I originally had the code which added the inode to the orphan list here, but then I thought to myself that it'd make more sense to actually do this step closer to the point where we've managed to successfully allocate the required blocks for the write. This prevents the need to spray orphan list clean up code all over the place just to cover the case that a write which had intended to extend the inode beyond i_size had failed prematurely (i.e. before block allocation). So, hence the reason why I thought having it in ext4_iomap_begin() would make more sense, because at that point in the write path, there is enough/or more assurance to make the call around whether we will in fact be able to perform the write which will be extending beyond i_size, or not and consequently whether the inode should be placed onto the orphan list? Ideally I'd like to turn this statement into: if (offset + count > i_size_read(inode)) extend = true; Maybe I'm missing something here and there's actually a really good reason for doing this nice and early? What are your thoughts about what I've mentioned above? > > - * If we added blocks beyond i_size, we need to make sure they > > - * will get truncated if we crash before updating i_size in > > - * ext4_iomap_end(). For faults we don't need to do that (and > > - * even cannot because for orphan list operations inode_lock is > > - * required) - if we happen to instantiate block beyond i_size, > > - * it is because we race with truncate which has already added > > - * the inode to the orphan list. > > + * If we added blocks beyond i_size, we need to make > > + * sure they will get truncated if we crash before > > + * updating the i_size. For faults we don't need to do > > + * that (and even cannot because for orphan list > > + * operations inode_lock is required) - if we happen > > + * to instantiate block beyond i_size, it is because > > + * we race with truncate which has already added the > > + * inode to the orphan list. > > */ > > Just a nit but it would be nice to use full width of 80 columns when > formatting comments so that they don't get unnecessarily long. Sure. I'm blaming emacs for this. :P ----