linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
To: Chris Mason <chris.mason@oracle.com>
Cc: Mike Fedyk <mfedyk@mikefedyk.com>,
	Linux Btrfs <linux-btrfs@vger.kernel.org>
Subject: Re: The value displayed by 'ls -s' command is strange.
Date: Wed, 08 Dec 2010 09:15:34 +0900	[thread overview]
Message-ID: <4CFECE26.1050403@jp.fujitsu.com> (raw)
In-Reply-To: <1291752549-sup-5932@think>


(2010/12/08 5:15), Chris Mason wrote:
> Excerpts from Mike Fedyk's message of 2010-12-07 15:07:08 -0500:
>> On Tue, Dec 7, 2010 at 11:29 AM, Chris Mason <chris.mason@oracle.com> wrote:
>>> Excerpts from Mike Fedyk's message of 2010-12-07 14:16:55 -0500:
>>>> On Tue, Dec 7, 2010 at 10:44 AM, Chris Mason <chris.mason@oracle.com> wrote:
>>>>> Excerpts from Tsutomu Itoh's message of 2010-12-07 02:59:52 -0500:
>>>>>> Hi,
>>>>>>
>>>>>> I think that the disk allocation size of each file becomes a monotone increase
>>>>>> when the file is made.
>>>>>> But, it sometimes return to 0.  Is it correct?
>>>>>
>>>>> Well, there's a window during the processing of delayed allocation where
>>>>> we don't have the bytes recorded as delalloc and we don't have the bytes
>>>>> recorded in the inode yet.  That's why they are showing up as zero.
>>>>>
>>>>> We don't call inode_add_bytes() until after we insert the extent, but we
>>>>> drop the delalloc byte count on the file before the IO is done.
>>>>>
>>>>> Fixing it will be a little tricky because all the extent accounting
>>>>> assumes the inode_add_bytes happens at extent insertion time.
>>>>>
>>>>
>>>> How does opening the inode with O_APPEND during this window know where
>>>> to write the bytes?  If it's a pointer/cursor to the EOF then that
>>>> size could be used during the window.  Is that right?
>>>
>>> This counter records the number of blocks allocated to the file, and
>>> reading it with ls -l or stat is somewhat racey by nature.  Most of the
>>> time its fine, btrfs just has a really big window where the results from
>>> ls -l seem wrong.
>>>
>>
>> I see.  Is it using per-cpu vars or something similar?
> 
> Our stat function returns the block count in the inode plus the number
> of bytes we have accounted as delayed allocation.
> 
> As we do writes to the file, the delayed allocation count goes up and
> then eventually we decide we need to do some IO.
> 
> Before we do the IO, we have to decide where on the disk to write the
> extents.  Once that is decided, we decrement the count of delayed
> allocation bytes.
> 
> This is when stat starts returning the wrong answer.
> 
> Then we do the IO, and when the IO is done we actually insert the file
> extents into the file metadata.  This is when stat starts returning the
> right answer again.

I understood. 
However, I worry that the user is confused because the wrong condition
is too long. 

> 
> The whole setup sounds strange, but this is how btrfs implements the
> semantics from data=ordered.  We don't update the file to point to
> the new blocks until after the IO is done, so we never have to wait on
> the data IO before we can do a transaction commit.  It avoids all kinds
> of latencies with fsync and other problems.
> 
> One easy solution is to just add another counter in the in-memory inode
> for the number of bytes in flight that aren't accounted for in other
> places.  But I'd rather not make the inode any bigger, so I'll have to
> think if we can solve this another way.
> 
>>
>>> But, the counter really means nothing to the btrfs internals.  When we
>>> do file operations we go based on the extent pointers we find in the
>>> tree and i_size (i_size is strictly maintained).
>>>
>>
>> Would it be too heavy of an operation to have stat walk the btrfs tree
>> to get its data?
>>
> 
> I'm afraid so, stat is fairly performance critical.
> 
> -chris


      parent reply	other threads:[~2010-12-08  0:15 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-12-07  7:59 The value displayed by 'ls -s' command is strange Tsutomu Itoh
2010-12-07  9:25 ` Li Zefan
2010-12-07 23:53   ` Tsutomu Itoh
2010-12-09 10:42     ` Miao Xie
2010-12-07 18:44 ` Chris Mason
2010-12-07 19:16   ` Mike Fedyk
2010-12-07 19:29     ` Chris Mason
2010-12-07 20:07       ` Mike Fedyk
2010-12-07 20:15         ` Chris Mason
2010-12-07 22:06           ` Mike Fedyk
2010-12-08  0:15           ` Tsutomu Itoh [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4CFECE26.1050403@jp.fujitsu.com \
    --to=t-itoh@jp.fujitsu.com \
    --cc=chris.mason@oracle.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=mfedyk@mikefedyk.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).