From: Mike Fedyk <mfedyk@mikefedyk.com>
To: Chris Mason <chris.mason@oracle.com>
Cc: Tsutomu Itoh <t-itoh@jp.fujitsu.com>,
Linux Btrfs <linux-btrfs@vger.kernel.org>
Subject: Re: The value displayed by 'ls -s' command is strange.
Date: Tue, 7 Dec 2010 14:06:51 -0800 [thread overview]
Message-ID: <AANLkTinwzdjw4oCa9ZjGg6mRSputs9s1vCSx5jDq3U8q@mail.gmail.com> (raw)
In-Reply-To: <1291752549-sup-5932@think>
On Tue, Dec 7, 2010 at 12:15 PM, Chris Mason <chris.mason@oracle.com> w=
rote:
> Excerpts from Mike Fedyk's message of 2010-12-07 15:07:08 -0500:
>> On Tue, Dec 7, 2010 at 11:29 AM, Chris Mason <chris.mason@oracle.com=
> wrote:
>> > Excerpts from Mike Fedyk's message of 2010-12-07 14:16:55 -0500:
>> >> On Tue, Dec 7, 2010 at 10:44 AM, Chris Mason <chris.mason@oracle.=
com> wrote:
>> >> > Excerpts from Tsutomu Itoh's message of 2010-12-07 02:59:52 -05=
00:
>> >> >> Hi,
>> >> >>
>> >> >> I think that the disk allocation size of each file becomes a m=
onotone increase
>> >> >> when the file is made.
>> >> >> But, it sometimes return to 0. =C2=A0Is it correct?
>> >> >
>> >> > Well, there's a window during the processing of delayed allocat=
ion where
>> >> > we don't have the bytes recorded as delalloc and we don't have =
the bytes
>> >> > recorded in the inode yet. =C2=A0That's why they are showing up=
as zero.
>> >> >
>> >> > We don't call inode_add_bytes() until after we insert the exten=
t, but we
>> >> > drop the delalloc byte count on the file before the IO is done.
>> >> >
>> >> > Fixing it will be a little tricky because all the extent accoun=
ting
>> >> > assumes the inode_add_bytes happens at extent insertion time.
>> >> >
>> >>
>> >> How does opening the inode with O_APPEND during this window know =
where
>> >> to write the bytes? =C2=A0If it's a pointer/cursor to the EOF the=
n that
>> >> size could be used during the window. =C2=A0Is that right?
>> >
>> > This counter records the number of blocks allocated to the file, a=
nd
>> > reading it with ls -l or stat is somewhat racey by nature. =C2=A0M=
ost of the
>> > time its fine, btrfs just has a really big window where the result=
s from
>> > ls -l seem wrong.
>> >
>>
>> I see. =C2=A0Is it using per-cpu vars or something similar?
>
Ok, so to make sure I fully understand I'm going to make some psuedo
code based on your description.
> Our stat function returns the block count in the inode plus the numbe=
r
> of bytes we have accounted as delayed allocation.
>
stat =3D inode_a1.bytes + inode_a1_delayed_allocation_bytes
> As we do writes to the file, the delayed allocation count goes up and
> then eventually we decide we need to do some IO.
>
> Before we do the IO, we have to decide where on the disk to write the
> extents.
inode_a2 =3D inode_a1
inode_a1 and inode_a2 are the same inode, but inode_a2 has a different
list of extents and is not written yet (in the case of appending, most
of the extents will be the same in the two extent lists, but inode_a2
will have more extents for the newly appended data)
> Once that is decided, we decrement the count of delayed
> allocation bytes.
>
> This is when stat starts returning the wrong answer.
>
inode_a2.bytes +=3D inode_a1_delayed_allocation_bytes
inode_a1_delayed_allocation_bytes -=3D inode_a1_delayed_allocation_byte=
s
stat =3D inode_a1.bytes + inode_a1_delayed_allocation_bytes
Is it possible to have stat read from inode_a2 during this window?
So it would be instead:
stat =3D inode_a2.bytes
> Then we do the IO, and when the IO is done we actually insert the fil=
e
> extents into the file metadata. =C2=A0This is when stat starts return=
ing the
> right answer again.
>
/* implicit when write completes */
inode_a1 =3D inode_a2
kfree(inode_a2)
stat =3D inode_a1.bytes + inode_a1_delayed_allocation_bytes
> The whole setup sounds strange, but this is how btrfs implements the
> semantics from data=3Dordered. =C2=A0We don't update the file to poin=
t to
> the new blocks until after the IO is done, so we never have to wait o=
n
> the data IO before we can do a transaction commit. =C2=A0It avoids al=
l kinds
> of latencies with fsync and other problems.
>
> One easy solution is to just add another counter in the in-memory ino=
de
> for the number of bytes in flight that aren't accounted for in other
> places. =C2=A0But I'd rather not make the inode any bigger, so I'll h=
ave to
> think if we can solve this another way.
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
next prev parent reply other threads:[~2010-12-07 22:06 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-12-07 7:59 The value displayed by 'ls -s' command is strange Tsutomu Itoh
2010-12-07 9:25 ` Li Zefan
2010-12-07 23:53 ` Tsutomu Itoh
2010-12-09 10:42 ` Miao Xie
2010-12-07 18:44 ` Chris Mason
2010-12-07 19:16 ` Mike Fedyk
2010-12-07 19:29 ` Chris Mason
2010-12-07 20:07 ` Mike Fedyk
2010-12-07 20:15 ` Chris Mason
2010-12-07 22:06 ` Mike Fedyk [this message]
2010-12-08 0:15 ` Tsutomu Itoh
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=AANLkTinwzdjw4oCa9ZjGg6mRSputs9s1vCSx5jDq3U8q@mail.gmail.com \
--to=mfedyk@mikefedyk.com \
--cc=chris.mason@oracle.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=t-itoh@jp.fujitsu.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).