From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.5 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,MENTIONS_GIT_HOSTING,SPF_PASS,USER_AGENT_MUTT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E143BC282C3 for ; Thu, 24 Jan 2019 14:46:00 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id B7DC3218AE for ; Thu, 24 Jan 2019 14:46:00 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728288AbfAXOqA (ORCPT ); Thu, 24 Jan 2019 09:46:00 -0500 Received: from nautica.notk.org ([91.121.71.147]:47712 "EHLO nautica.notk.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728243AbfAXOp7 (ORCPT ); Thu, 24 Jan 2019 09:45:59 -0500 Received: by nautica.notk.org (Postfix, from userid 1001) id D5127C009; Thu, 24 Jan 2019 15:45:54 +0100 (CET) Date: Thu, 24 Jan 2019 15:45:39 +0100 From: Dominique Martinet To: Hou Tao Cc: v9fs-developer@lists.sourceforge.net, lucho@ionkov.net, ericvh@gmail.com, linux-fsdevel@vger.kernel.org, xingaopeng@huawei.com Subject: Re: [PATCH v3 2/2] 9p: use i_lock to protect update of inode->i_blocks under 32-bit Message-ID: <20190124144539.GA1457@nautica> References: <20190124063514.8571-1-houtao1@huawei.com> <20190124063514.8571-3-houtao1@huawei.com> <20190124072532.GA16424@nautica> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Hou Tao wrote on Thu, Jan 24, 2019: > > hmm, part of me wants to abuse v9fs_i_size_write to also update i_blocks > > as I think we'd usually want to keep both in sync.We can do that for > > v9fs_stat2inode(), but for v9fs_stat2inode_dotl() > > the updates of i_block & i_size are controlled by different flags, so > separated updates are needed (maybe also add an extra flags ?). Is there any case where i_block and i_size are not updated together? I would be happy with an extra flag if required, I just do not see where right now (admitedly didn't look for long, will have a second look tomorrow) > > In v9fs_file_write_iter we do not update i_blocks directly but > > inode_add_bytes basically does the same thing... Speaking of which it's > > assuming inode->i_bytes is set correctly initially; I _think_ we're > > supposed to call inode_set_bytes() instead of writing i_blocks directly > > in stat2inode? > > Sorry, I'm not too familiar with this code and there doesn't seem to be > > many users of inode_set_bytes so it's a bit puzzling to me. > > In my understanding, now only disk quota uses i_bytes, so for 9p there is not so > much different between updating i_blocks directly and using inode_set_bytes(), > but maybe using inode_set_bytes() will be appropriate. I also think only quota uses i_bytes directly, but 9p also uses it indirectly through inode_add_bytes() - i_blocks is incremented by 1 every 512 bytes and i_bytes is reduced to its last few bits (basically %512) > > I'm not using cache mode much but I have some very weird behaviour > > currently if I mount something with cache=fscache : > > - if I create a new file from 9p and stat on the same client, a size of > > 0-511 has block = 0 then it's just always off by 1 > Is this not the expected result ? v9fs_write_end() will call > inode_add_bytes(inode,511) if we write 511 bytes to the new file and > i_blocks will be 0. That is the obvious result from the current code, yes - but it definitely is wrong. i_blocks should be 1 from byte 1 onwards until 512, and bumped to 2 at 513. From a filesystem point of view, a block is already "consumed" from the first byte onward. (This isn't what inode_set_bytes does..) In practice, this actually is fairly important - tools like tar are optimized such that if a file has no blocks it is considered sparse and isn't read at all, so this could lead to data loss. Interestingly I cannot reproduce with a recent gnu tar but I am sure I have seen the behaviour (some lustre bug[1] talks about this as well and the code is still present[2]); we should not report a st_blocks of 0 is there is data -- v9fs_stat2inode (not v9fs_stat2inode_dotl) also rounds up i_blocks so this is the right behaviour, we're just not doing it when creating new files apparently. [1] https://jira.whamcloud.com/browse/LU-3864 [2] http://git.savannah.gnu.org/cgit/tar.git/tree/src/sparse.c#n273 If this optimization didn't exist and given i_bytes existence it does seem more correct to do like inode_set_bytes (e.g. blocks = bytes / 512 and bytes = bytes % 512) but this really strikes me as odd; this might be why hardly anyone uses this... > > - if I create a file outside of 9p and stat it, for example starting > > with 400 bytes, it starts with 8 blocks then the number of blocks > > increase by 1 everytime 512 bytes are written regardless of initial > > size. > > The initial i_blocks comes from the underlying filesystem, so I think 8 blocks > just mean the block size of the underlying filesystem is 4KB. If another 512 bytes > are written into the file, inode_add_bytes() will increase i_blocks by one instead of > fetching i_blocks from server, though i_blocks in server side will > still be 8. Ah, I had only looked at v9fs_stat2inode() which creates the value from i_size, not v9fs_stat2inode_dotl(). v9fs_stat2inode_dotl() does take the value from the RPC, and then 8 is indeed correct -- it will just be very painful to keep synchronized with the underlying fs; we have no ways of knowing when the remote filesystem decides to allocate new blocks. If for example the write iter only writes 0 and the underlying fs is optimized st_blocks might not increase at all. > Maybe we should call v9fs_invalidate_inode_attr() in v9fs_write_end() ? Hmm, That will probably add quite a lot of RPCs and make the cache option rather useless... Honestly as far as I'm concerned the value of i_blocks is mostly informative (it's used by tools like du to give a size estimate, and that's about it?) - with the exception of the 0 value I was talking earlier. I'd advocate to never display the remote st_blocks but do like v9fs_stat2inode() and always just set blocks appropriately from the size (i.e. i_blocks = (i_size + 512 - 1) >> 9 and i_bytes = 0), but I'd like some opinions first. I'll send a mail to fsdevel asking vfs folks for advices tomorrow; let's put this patch on hold for a few days. Thanks, -- Dominique