From: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
To: Chris Mason <chris.mason@oracle.com>
Cc: Mike Fedyk <mfedyk@mikefedyk.com>,
Linux Btrfs <linux-btrfs@vger.kernel.org>
Subject: Re: The value displayed by 'ls -s' command is strange.
Date: Wed, 08 Dec 2010 09:15:34 +0900 [thread overview]
Message-ID: <4CFECE26.1050403@jp.fujitsu.com> (raw)
In-Reply-To: <1291752549-sup-5932@think>
(2010/12/08 5:15), Chris Mason wrote:
> Excerpts from Mike Fedyk's message of 2010-12-07 15:07:08 -0500:
>> On Tue, Dec 7, 2010 at 11:29 AM, Chris Mason <chris.mason@oracle.com> wrote:
>>> Excerpts from Mike Fedyk's message of 2010-12-07 14:16:55 -0500:
>>>> On Tue, Dec 7, 2010 at 10:44 AM, Chris Mason <chris.mason@oracle.com> wrote:
>>>>> Excerpts from Tsutomu Itoh's message of 2010-12-07 02:59:52 -0500:
>>>>>> Hi,
>>>>>>
>>>>>> I think that the disk allocation size of each file becomes a monotone increase
>>>>>> when the file is made.
>>>>>> But, it sometimes return to 0. Is it correct?
>>>>>
>>>>> Well, there's a window during the processing of delayed allocation where
>>>>> we don't have the bytes recorded as delalloc and we don't have the bytes
>>>>> recorded in the inode yet. That's why they are showing up as zero.
>>>>>
>>>>> We don't call inode_add_bytes() until after we insert the extent, but we
>>>>> drop the delalloc byte count on the file before the IO is done.
>>>>>
>>>>> Fixing it will be a little tricky because all the extent accounting
>>>>> assumes the inode_add_bytes happens at extent insertion time.
>>>>>
>>>>
>>>> How does opening the inode with O_APPEND during this window know where
>>>> to write the bytes? If it's a pointer/cursor to the EOF then that
>>>> size could be used during the window. Is that right?
>>>
>>> This counter records the number of blocks allocated to the file, and
>>> reading it with ls -l or stat is somewhat racey by nature. Most of the
>>> time its fine, btrfs just has a really big window where the results from
>>> ls -l seem wrong.
>>>
>>
>> I see. Is it using per-cpu vars or something similar?
>
> Our stat function returns the block count in the inode plus the number
> of bytes we have accounted as delayed allocation.
>
> As we do writes to the file, the delayed allocation count goes up and
> then eventually we decide we need to do some IO.
>
> Before we do the IO, we have to decide where on the disk to write the
> extents. Once that is decided, we decrement the count of delayed
> allocation bytes.
>
> This is when stat starts returning the wrong answer.
>
> Then we do the IO, and when the IO is done we actually insert the file
> extents into the file metadata. This is when stat starts returning the
> right answer again.
I understood.
However, I worry that the user is confused because the wrong condition
is too long.
>
> The whole setup sounds strange, but this is how btrfs implements the
> semantics from data=ordered. We don't update the file to point to
> the new blocks until after the IO is done, so we never have to wait on
> the data IO before we can do a transaction commit. It avoids all kinds
> of latencies with fsync and other problems.
>
> One easy solution is to just add another counter in the in-memory inode
> for the number of bytes in flight that aren't accounted for in other
> places. But I'd rather not make the inode any bigger, so I'll have to
> think if we can solve this another way.
>
>>
>>> But, the counter really means nothing to the btrfs internals. When we
>>> do file operations we go based on the extent pointers we find in the
>>> tree and i_size (i_size is strictly maintained).
>>>
>>
>> Would it be too heavy of an operation to have stat walk the btrfs tree
>> to get its data?
>>
>
> I'm afraid so, stat is fairly performance critical.
>
> -chris
prev parent reply other threads:[~2010-12-08 0:15 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-12-07 7:59 The value displayed by 'ls -s' command is strange Tsutomu Itoh
2010-12-07 9:25 ` Li Zefan
2010-12-07 23:53 ` Tsutomu Itoh
2010-12-09 10:42 ` Miao Xie
2010-12-07 18:44 ` Chris Mason
2010-12-07 19:16 ` Mike Fedyk
2010-12-07 19:29 ` Chris Mason
2010-12-07 20:07 ` Mike Fedyk
2010-12-07 20:15 ` Chris Mason
2010-12-07 22:06 ` Mike Fedyk
2010-12-08 0:15 ` Tsutomu Itoh [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4CFECE26.1050403@jp.fujitsu.com \
--to=t-itoh@jp.fujitsu.com \
--cc=chris.mason@oracle.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=mfedyk@mikefedyk.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).