From mboxrd@z Thu Jan 1 00:00:00 1970 From: Junxiao Bi Date: Mon, 3 May 2021 10:25:31 -0700 Subject: [Cluster-devel] [PATCH 1/3] fs/buffer.c: add new api to allow eof writeback In-Reply-To: <20210503102904.GC2994@quack2.suse.cz> References: <20210426220552.45413-1-junxiao.bi@oracle.com> <3f06d108-1b58-6473-35fa-0d6978e219b8@oracle.com> <20210430124756.GA5315@quack2.suse.cz> <20210503102904.GC2994@quack2.suse.cz> Message-ID: <72cde802-bd8a-9ce5-84d7-57b34a6a8b03@oracle.com> List-Id: To: cluster-devel.redhat.com MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit On 5/3/21 3:29 AM, Jan Kara wrote: > On Fri 30-04-21 14:18:15, Junxiao Bi wrote: >> On 4/30/21 5:47 AM, Jan Kara wrote: >> >>> On Thu 29-04-21 11:07:15, Junxiao Bi wrote: >>>> On 4/29/21 10:14 AM, Andreas Gruenbacher wrote: >>>>> On Tue, Apr 27, 2021 at 4:44 AM Junxiao Bi wrote: >>>>>> When doing truncate/fallocate for some filesytem like ocfs2, it >>>>>> will zero some pages that are out of inode size and then later >>>>>> update the inode size, so it needs this api to writeback eof >>>>>> pages. >>>>> is this in reaction to Jan's "[PATCH 0/12 v4] fs: Hole punch vs page >>>>> cache filling races" patch set [*]? It doesn't look like the kind of >>>>> patch Christoph would be happy with. >>>> Thank you for pointing the patch set. I think that is fixing a different >>>> issue. >>>> >>>> The issue here is when extending file size with fallocate/truncate, if the >>>> original inode size >>>> >>>> is in the middle of the last cluster block(1M), eof part will be zeroed with >>>> buffer write first, >>>> >>>> and then new inode size is updated, so there is a window that dirty pages is >>>> out of inode size, >>>> >>>> if writeback is kicked in, block_write_full_page will drop all those eof >>>> pages. >>> I agree that the buffers describing part of the cluster beyond i_size won't >>> be written. But page cache will remain zeroed out so that is fine. So you >>> only need to zero out the on disk contents. Since this is actually >>> physically contiguous range of blocks why don't you just use >>> sb_issue_zeroout() to zero out the tail of the cluster? It will be more >>> efficient than going through the page cache and you also won't have to >>> tweak block_write_full_page()... >> Thanks for the review. >> >> The physical blocks to be zeroed were continuous only when sparse mode is >> enabled, if sparse mode is disabled, unwritten extent was not supported for >> ocfs2, then all the blocks to the new size will be zeroed by the buffer >> write, since sb_issue_zeroout() will need waiting io done, there will be a >> lot of delay when extending file size. Use writeback to flush async seemed >> more efficient? > It depends. Higher end storage (e.g. NVME or NAS, maybe some better SATA > flash disks as well) do support WRITE_ZERO command so you don't actually > have to write all those zeros. The storage will just internally mark all > those blocks as having zeros. This is rather fast so I'd expect the overall > result to be faster that zeroing page cache and then writing all those > pages with zeroes on transaction commit. But I agree that for lower end > storage this may be slower because of synchronous writing of zeroes. That > being said your transaction commit has to write those zeroes anyway so the > cost is only mostly shifted but it could still make a difference for some > workloads. Not sure if that matters, that is your call I'd say. Ocfs2 is mostly used with SAN, i don't think it's common for SAN storage to support WRITE_ZERO command. Anything bad to add a new api to support eof writeback? Thanks, Junxiao. > > Also note that you could submit those zeroing bios asynchronously but that > would be more coding and you need to make sure they are completed on > transaction commit so probably it isn't worth the complexity. > > Honza