From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from userp2120.oracle.com ([156.151.31.85]:47598 "EHLO userp2120.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753946AbeGFHL0 (ORCPT ); Fri, 6 Jul 2018 03:11:26 -0400 Subject: Re: [PATCH RFC 0/8] xfs: introduce inode data inline feature From: Shan Hai References: <1530846750-6686-1-git-send-email-shan.hai@oracle.com> <20180706054250.GU2234@dastard> <85b8d2ed-5749-c357-96c3-0f6d0d424d6e@oracle.com> Message-ID: <4e1632e5-d470-bf90-88f5-b82e4e18e7f9@oracle.com> Date: Fri, 6 Jul 2018 15:11:19 +0800 MIME-Version: 1.0 In-Reply-To: <85b8d2ed-5749-c357-96c3-0f6d0d424d6e@oracle.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Content-Language: en-US Sender: linux-xfs-owner@vger.kernel.org List-ID: List-Id: xfs To: Dave Chinner Cc: linux-xfs@vger.kernel.org On 2018年07月06日 14:39, Shan Hai wrote: > > > On 2018年07月06日 13:42, Dave Chinner wrote: >> On Fri, Jul 06, 2018 at 11:12:21AM +0800, Shan Hai wrote: >>> This series implements xfs inode data inlining feature. >> Cool. I was actually thinking about this earlier this week. :) >> > > Yes, this is a nice feature, and thanks for your previous suggestions > in the below link, > which saved my time a lot :) > >>> Refered below link during development: >>> https://marc.info/?l=linux-xfs&m=120493585731509&w=2 >>> >>> >>> How it works: >>> - the data inlining happens at: >>>    write_iter: for DIO/DAX write >>>    writeback: for buffered write >>> - extents to local format conversion is done in writeback but not in >>> write_iter >>> - local to extents format conversion is done in the write_iter and >>> writeback >> The problem with the way local to extents conversion is imlpemented >> in this patchset is that we do not synchronise data writeback with >> journal commits and log recovery.  The local to extents conversion >> appears to be done like so: >> >> ---- >> writeback arrives in local format, data in extent cached page >> convert inode to extent format >> allocate new block >> commit allocation and inode format conversion >> submit data IO to newly allocated block. >> ---- >> >> This works when nothing goes wrong, but it's not failsafe. The >> problem is that we've logged the inode data fork conversion before >> we've completed writing the data to the new block. IOWs, we can end >> up in the state where the active journal recoery window does not contain >> the data in the inode and the data has not yet reached the new >> allocated block on disk, either. >> >> If we crash in this window, we lose the data that was in the inode - >> log recovery finishes with the inode in extent form pointing to >> uninitialised data blocks. i.e. not only is it a data loss event, >> it's also a security issue because it exposes stale data. >> >> This has been the unsolved problem from all previous attempts to >> implement inline data in XFS. I actually outlined this problem and >> ways to solve it in the mailing list post linked above, but it >> doesn't appear that either mechanism was implemented in this >> patchset. >> >> However, unlike that post from 2008, we now have infrastructure that >> can be used to solve to this problem: the data path copy-on-write >> mechanism.  The COW mechanism is essentially a generic >> implementation of the "Method 1: use an intent/done transaction >> pair" solution I describe in the above post. >> >> This mechanism solves the above problem by storing the newly >> allocated block in the in-memory COW fork and doesn't modify the >> data fork until after the data write IO completes. IOWs, we do >> allocation before the write IO, and do the data fork and BMBT >> manipulation after the IO completes. i.e. we do the local->extent >> data fork modification at IO completion using the extent that was >> stored in the in memory COW fork. > > Yes, this can fix the race totally, I will try to implement it in the > subsequent series. > > Or how about below? It's similar to what we did in the > xfs_setattr_size, but the sync > write in this situation is ugly I know :) > > ---- > write arrives at local inode > allocate a page and insert it into page cache, copy the data from data > fork to the page > filemap_write_and_wait_range Oh no, please ignore this, it would never can work, I will try the data path COW idea. Thanks Shan Hai > > xfs_trans_roll > convert inode to extent format > allocate a new block > commit allocation and inode format conversion > ---- > > Thanks > Shan Hai >> Hence I think the inline data write path needs to piggy back on the >> iomap COW path that we use for writing to shared extents if the >> write would cause a data fork format change. >> >> Cheers, >> >> Dave. > > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at  http://vger.kernel.org/majordomo-info.html