All of lore.kernel.org
 help / color / mirror / Atom feed
From: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
To: Chris Mason <chris.mason@oracle.com>
Cc: Mike Fedyk <mfedyk@mikefedyk.com>,
	Linux Btrfs <linux-btrfs@vger.kernel.org>
Subject: Re: The value displayed by 'ls -s' command is strange.
Date: Wed, 08 Dec 2010 09:15:34 +0900	[thread overview]
Message-ID: <4CFECE26.1050403@jp.fujitsu.com> (raw)
In-Reply-To: <1291752549-sup-5932@think>


(2010/12/08 5:15), Chris Mason wrote:
> Excerpts from Mike Fedyk's message of 2010-12-07 15:07:08 -0500:
>> On Tue, Dec 7, 2010 at 11:29 AM, Chris Mason <chris.mason@oracle.com> wrote:
>>> Excerpts from Mike Fedyk's message of 2010-12-07 14:16:55 -0500:
>>>> On Tue, Dec 7, 2010 at 10:44 AM, Chris Mason <chris.mason@oracle.com> wrote:
>>>>> Excerpts from Tsutomu Itoh's message of 2010-12-07 02:59:52 -0500:
>>>>>> Hi,
>>>>>>
>>>>>> I think that the disk allocation size of each file becomes a monotone increase
>>>>>> when the file is made.
>>>>>> But, it sometimes return to 0.  Is it correct?
>>>>>
>>>>> Well, there's a window during the processing of delayed allocation where
>>>>> we don't have the bytes recorded as delalloc and we don't have the bytes
>>>>> recorded in the inode yet.  That's why they are showing up as zero.
>>>>>
>>>>> We don't call inode_add_bytes() until after we insert the extent, but we
>>>>> drop the delalloc byte count on the file before the IO is done.
>>>>>
>>>>> Fixing it will be a little tricky because all the extent accounting
>>>>> assumes the inode_add_bytes happens at extent insertion time.
>>>>>
>>>>
>>>> How does opening the inode with O_APPEND during this window know where
>>>> to write the bytes?  If it's a pointer/cursor to the EOF then that
>>>> size could be used during the window.  Is that right?
>>>
>>> This counter records the number of blocks allocated to the file, and
>>> reading it with ls -l or stat is somewhat racey by nature.  Most of the
>>> time its fine, btrfs just has a really big window where the results from
>>> ls -l seem wrong.
>>>
>>
>> I see.  Is it using per-cpu vars or something similar?
> 
> Our stat function returns the block count in the inode plus the number
> of bytes we have accounted as delayed allocation.
> 
> As we do writes to the file, the delayed allocation count goes up and
> then eventually we decide we need to do some IO.
> 
> Before we do the IO, we have to decide where on the disk to write the
> extents.  Once that is decided, we decrement the count of delayed
> allocation bytes.
> 
> This is when stat starts returning the wrong answer.
> 
> Then we do the IO, and when the IO is done we actually insert the file
> extents into the file metadata.  This is when stat starts returning the
> right answer again.

I understood. 
However, I worry that the user is confused because the wrong condition
is too long. 

> 
> The whole setup sounds strange, but this is how btrfs implements the
> semantics from data=ordered.  We don't update the file to point to
> the new blocks until after the IO is done, so we never have to wait on
> the data IO before we can do a transaction commit.  It avoids all kinds
> of latencies with fsync and other problems.
> 
> One easy solution is to just add another counter in the in-memory inode
> for the number of bytes in flight that aren't accounted for in other
> places.  But I'd rather not make the inode any bigger, so I'll have to
> think if we can solve this another way.
> 
>>
>>> But, the counter really means nothing to the btrfs internals.  When we
>>> do file operations we go based on the extent pointers we find in the
>>> tree and i_size (i_size is strictly maintained).
>>>
>>
>> Would it be too heavy of an operation to have stat walk the btrfs tree
>> to get its data?
>>
> 
> I'm afraid so, stat is fairly performance critical.
> 
> -chris


      parent reply	other threads:[~2010-12-08  0:15 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-12-07  7:59 The value displayed by 'ls -s' command is strange Tsutomu Itoh
2010-12-07  9:25 ` Li Zefan
2010-12-07 23:53   ` Tsutomu Itoh
2010-12-09 10:42     ` Miao Xie
2010-12-07 18:44 ` Chris Mason
2010-12-07 19:16   ` Mike Fedyk
2010-12-07 19:29     ` Chris Mason
2010-12-07 20:07       ` Mike Fedyk
2010-12-07 20:15         ` Chris Mason
2010-12-07 22:06           ` Mike Fedyk
2010-12-08  0:15           ` Tsutomu Itoh [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4CFECE26.1050403@jp.fujitsu.com \
    --to=t-itoh@jp.fujitsu.com \
    --cc=chris.mason@oracle.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=mfedyk@mikefedyk.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.