From: Dave Chinner <david@fromorbit.com>
To: Yongqiang Yang <xiaoqiangnk@gmail.com>
Cc: "Andreas Dilger" <adilger@dilger.ca>,
"Theodore Tso" <tytso@mit.edu>,
"Eric Sandeen" <sandeen@sandeen.net>, xfs-oss <xfs@oss.sgi.com>,
"coreutils@gnu.org" <coreutils@gnu.org>,
"linux-ext4@vger.kernel.org" <linux-ext4@vger.kernel.org>,
"Pádraig Brady" <P@draigbrady.com>,
"Markus Trippelsdorf" <markus@trippelsdorf.de>
Subject: Re: Files full of zeros with coreutils-8.11 and xfs (FIEMAP related?)
Date: Tue, 19 Apr 2011 17:45:38 +1000 [thread overview]
Message-ID: <20110419074538.GG23985@dastard> (raw)
In-Reply-To: <BANLkTinjh968ECqAobQ677hnV5yzke1ncw@mail.gmail.com>
On Tue, Apr 19, 2011 at 02:53:20PM +0800, Yongqiang Yang wrote:
> On Tue, Apr 19, 2011 at 11:44 AM, Dave Chinner <david@fromorbit.com> wrote:
> > On Tue, Apr 19, 2011 at 09:58:15AM +0800, Yongqiang Yang wrote:
> >> On Mon, Apr 18, 2011 at 10:45 AM, Andreas Dilger <adilger@dilger.ca> wrote:
> >> > On 2011-04-17, at 6:40 PM, Dave Chinner <david@fromorbit.com> wrote:
> >> >
> >> > On Sat, Apr 16, 2011 at 08:21:28AM -0400, Theodore Tso wrote:
> >> >
> >> > On Apr 16, 2011, at 1:11 AM, Andreas Dilger wrote:
> >> >
> >> > In that case, it means cp should just always use FIEMAP_FLAG_SYNC, which is
> >> > fine.
> >> >
> >> > Except that if someone is copying a large delay allocated file, it will
> >> > cause
> >> >
> >> > the file to immediately snapped to disk, which might not be the greatest
> >> >
> >> > thing in the world.
> >> >
> >> > Obvious workaround - if the initial fiemap call shows unwritten
> >> > extents, redo it with the sync flag set. Though that assumeѕ that
> >> > you can trust things like delalloc extents to only cover the range
> >> > that valid data exists in. Which, of course, you can't assume,
> >> > either. :/
> >> >
> >> > Always passing FIEMAP_FLAG_SYNC is fine in this case. It should only do
> >> > anything if there is unwritten data, which is the only case we are concerned
> >> > with at this point. In any case, this is a simple solution for coreutils
> >> > until such a time that a more complex solution is added in the kernel (if
> >> > ever).
> >> >
> >> > Christoph is write, SEEK_HOLE and SEEK_DATA are
> >> >
> >> > a much better API for what cp woulld lke to do. Unfortunately it hasn't
> >> >
> >> > been implemented yet in the VFS...
> >> >
> >> > Agreed, SEEK_HOLE/SEEK_DATA is the right way to solve this problem.
> >> >
> >> > I don't see how this will change the problem in any meaningful way. There
> >> > will still need to be code that is traversing the on-disk mapping, and also
> >> > keeping it coherent with unwritten data in the page cache.
> >>
> >> It seems that we are being messed up by page cache and disk.
> >> Unwritten flag returned from FIEMAP indicates blocks on disk are not
> >> written, but it does not say if there is data in page cache. So
> >> FIEMAP itself just tells user the map on disk. However there is an
> >> exception for delayed allocation, FIEMAP tells users the data is in
> >> page cache.
> >
> > No, FIEMAP does not tell the user there is data in the page cache.
> > It tells there user there is a delayed allocation extent. For XFS, a
> > delayed allocation extent can cover a range _greater_ than there is
> > data in the page cache - we do allocation allignment, speculative
> > allocation and other tricks to avoid fragmentation via
> > delayed allocation. When XFSs says there is a delalloc extent, it is
> > simply showing the in-memory representation of the extent. if you
> > want to know where the data in the page cache actually is, you need
> > to sync the file to disk to get those ranges converted to real
> > extents. This is how xfs_bmap has worked for 15 years....
> >
> >> Maybe FIEMAP should return all known messages for unwritten extent, if
> >> unwritten data exists in page cache, FIEMAP should let users know that
> >> data is in page cache and space on disk has been preallocated, but
> >> data has not been flushed into disk. Actually, delayed allocation has
> >> done like this. Then user-space applications can determine how to do.
> >> Taking cp as an example, it will copy from page cache rather ignore
> >> it.
> >
> > Once again, FIEMAP is for showing the filesystem's current extent
> > state, not the page cache state. Ext4 may implement FIEMAP by doing
> > page cache walks, but that is a filesystem specific implementation
> > detail.
> >
> >> We need a definite definition for FIEMAP, in other words, it tells
> >> users map on disk or both disk and page cache.
> >
> > We already have a definition - and it has nothing to do with the
> > page cache state.
> >
> >> If the former one is taken, then FIEMAP should not consider
> >> delayed allocation.
> >
> > Not at all. the delayed allocation extent is a first class extent
> > type in XFS and it is reported directly from the extent list. Your
> > viewpoint is very ext4-specific and ignores the fact that other
> > filesystems were doing this sort of mapping long before even ext3
> > (let alone ext4) was a glint in the designer's eye....
> >
> >> otherwise, FIEMAP should return all known messages for unwritten case
> >> like delayed allocation.
> >
> > See my previous comments about extents being unwritten until data is
> > physically written to them.
> Understood, thank you for your explanation.
>
> Ok. Let's look at it from a higher view. What you described about
> extent state is specific to xfs.
>
> I think there are 2 ways to provide a definite definition for FIEMAP
> for all filesystems:
>
> 1. FIEMAP returns extent state on disk.
> 2. FIEMAP returns extent both in memory and on disk.
You are *not listening*. There is no #2. FIEMAP returns the extent
state _on disk_ at the time of the call. If you want it to reflect
the in-memory state at the time of the call (for data or metadata),
you *must* use the the SYNC flag to convert that in-memory state to
on-disk state, which FIEMAP then reports just fine.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
WARNING: multiple messages have this Message-ID (diff)
From: Dave Chinner <david@fromorbit.com>
To: Yongqiang Yang <xiaoqiangnk@gmail.com>
Cc: "Andreas Dilger" <adilger@dilger.ca>,
"Theodore Tso" <tytso@mit.edu>,
"Eric Sandeen" <sandeen@sandeen.net>, xfs-oss <xfs@oss.sgi.com>,
"coreutils@gnu.org" <coreutils@gnu.org>,
"linux-ext4@vger.kernel.org" <linux-ext4@vger.kernel.org>,
"Pádraig Brady" <P@draigbrady.com>,
"Markus Trippelsdorf" <markus@trippelsdorf.de>
Subject: Re: Files full of zeros with coreutils-8.11 and xfs (FIEMAP related?)
Date: Tue, 19 Apr 2011 17:45:38 +1000 [thread overview]
Message-ID: <20110419074538.GG23985@dastard> (raw)
In-Reply-To: <BANLkTinjh968ECqAobQ677hnV5yzke1ncw@mail.gmail.com>
On Tue, Apr 19, 2011 at 02:53:20PM +0800, Yongqiang Yang wrote:
> On Tue, Apr 19, 2011 at 11:44 AM, Dave Chinner <david@fromorbit.com> wrote:
> > On Tue, Apr 19, 2011 at 09:58:15AM +0800, Yongqiang Yang wrote:
> >> On Mon, Apr 18, 2011 at 10:45 AM, Andreas Dilger <adilger@dilger.ca> wrote:
> >> > On 2011-04-17, at 6:40 PM, Dave Chinner <david@fromorbit.com> wrote:
> >> >
> >> > On Sat, Apr 16, 2011 at 08:21:28AM -0400, Theodore Tso wrote:
> >> >
> >> > On Apr 16, 2011, at 1:11 AM, Andreas Dilger wrote:
> >> >
> >> > In that case, it means cp should just always use FIEMAP_FLAG_SYNC, which is
> >> > fine.
> >> >
> >> > Except that if someone is copying a large delay allocated file, it will
> >> > cause
> >> >
> >> > the file to immediately snapped to disk, which might not be the greatest
> >> >
> >> > thing in the world.
> >> >
> >> > Obvious workaround - if the initial fiemap call shows unwritten
> >> > extents, redo it with the sync flag set. Though that assumeѕ that
> >> > you can trust things like delalloc extents to only cover the range
> >> > that valid data exists in. Which, of course, you can't assume,
> >> > either. :/
> >> >
> >> > Always passing FIEMAP_FLAG_SYNC is fine in this case. It should only do
> >> > anything if there is unwritten data, which is the only case we are concerned
> >> > with at this point. In any case, this is a simple solution for coreutils
> >> > until such a time that a more complex solution is added in the kernel (if
> >> > ever).
> >> >
> >> > Christoph is write, SEEK_HOLE and SEEK_DATA are
> >> >
> >> > a much better API for what cp woulld lke to do. Unfortunately it hasn't
> >> >
> >> > been implemented yet in the VFS...
> >> >
> >> > Agreed, SEEK_HOLE/SEEK_DATA is the right way to solve this problem.
> >> >
> >> > I don't see how this will change the problem in any meaningful way. There
> >> > will still need to be code that is traversing the on-disk mapping, and also
> >> > keeping it coherent with unwritten data in the page cache.
> >>
> >> It seems that we are being messed up by page cache and disk.
> >> Unwritten flag returned from FIEMAP indicates blocks on disk are not
> >> written, but it does not say if there is data in page cache. So
> >> FIEMAP itself just tells user the map on disk. However there is an
> >> exception for delayed allocation, FIEMAP tells users the data is in
> >> page cache.
> >
> > No, FIEMAP does not tell the user there is data in the page cache.
> > It tells there user there is a delayed allocation extent. For XFS, a
> > delayed allocation extent can cover a range _greater_ than there is
> > data in the page cache - we do allocation allignment, speculative
> > allocation and other tricks to avoid fragmentation via
> > delayed allocation. When XFSs says there is a delalloc extent, it is
> > simply showing the in-memory representation of the extent. if you
> > want to know where the data in the page cache actually is, you need
> > to sync the file to disk to get those ranges converted to real
> > extents. This is how xfs_bmap has worked for 15 years....
> >
> >> Maybe FIEMAP should return all known messages for unwritten extent, if
> >> unwritten data exists in page cache, FIEMAP should let users know that
> >> data is in page cache and space on disk has been preallocated, but
> >> data has not been flushed into disk. Actually, delayed allocation has
> >> done like this. Then user-space applications can determine how to do.
> >> Taking cp as an example, it will copy from page cache rather ignore
> >> it.
> >
> > Once again, FIEMAP is for showing the filesystem's current extent
> > state, not the page cache state. Ext4 may implement FIEMAP by doing
> > page cache walks, but that is a filesystem specific implementation
> > detail.
> >
> >> We need a definite definition for FIEMAP, in other words, it tells
> >> users map on disk or both disk and page cache.
> >
> > We already have a definition - and it has nothing to do with the
> > page cache state.
> >
> >> If the former one is taken, then FIEMAP should not consider
> >> delayed allocation.
> >
> > Not at all. the delayed allocation extent is a first class extent
> > type in XFS and it is reported directly from the extent list. Your
> > viewpoint is very ext4-specific and ignores the fact that other
> > filesystems were doing this sort of mapping long before even ext3
> > (let alone ext4) was a glint in the designer's eye....
> >
> >> otherwise, FIEMAP should return all known messages for unwritten case
> >> like delayed allocation.
> >
> > See my previous comments about extents being unwritten until data is
> > physically written to them.
> Understood, thank you for your explanation.
>
> Ok. Let's look at it from a higher view. What you described about
> extent state is specific to xfs.
>
> I think there are 2 ways to provide a definite definition for FIEMAP
> for all filesystems:
>
> 1. FIEMAP returns extent state on disk.
> 2. FIEMAP returns extent both in memory and on disk.
You are *not listening*. There is no #2. FIEMAP returns the extent
state _on disk_ at the time of the call. If you want it to reflect
the in-memory state at the time of the call (for data or metadata),
you *must* use the the SYNC flag to convert that in-memory state to
on-disk state, which FIEMAP then reports just fine.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
next prev parent reply other threads:[~2011-04-19 7:45 UTC|newest]
Thread overview: 117+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-04-14 10:26 Files full of zeros with coreutils-8.11 and xfs (FIEMAP related?) Markus Trippelsdorf
2011-04-14 12:06 ` Markus Trippelsdorf
2011-04-14 14:02 ` Markus Trippelsdorf
[not found] ` <20110414140222.GB1679-tLCgZGx+iJ+kxVt8IV0GqQ@public.gmane.org>
2011-04-14 14:59 ` Pádraig Brady
2011-04-14 14:59 ` Pádraig Brady
[not found] ` <4DA70BD3.1070409-V8g9lnOeT5ydJdNcDFJN0w@public.gmane.org>
2011-04-14 15:50 ` Eric Sandeen
2011-04-14 15:50 ` Eric Sandeen
[not found] ` <4DA717B2.3020305-+82itfer+wXR7s880joybQ@public.gmane.org>
2011-04-14 15:52 ` Pádraig Brady
2011-04-14 15:52 ` Pádraig Brady
2011-04-14 15:56 ` Eric Sandeen
2011-04-14 15:56 ` Eric Sandeen
2011-04-14 16:03 ` Markus Trippelsdorf
2011-04-14 16:03 ` Markus Trippelsdorf
2011-04-14 16:14 ` Eric Sandeen
2011-04-14 16:14 ` Eric Sandeen
[not found] ` <20110414160343.GA12787-tLCgZGx+iJ+kxVt8IV0GqQ@public.gmane.org>
2011-04-14 16:21 ` Yongqiang Yang
2011-04-14 16:21 ` Yongqiang Yang
[not found] ` <BANLkTimRxvBMp9M7zwiUY_UmmFOY5N58+A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2011-04-14 16:28 ` Markus Trippelsdorf
2011-04-14 16:28 ` Markus Trippelsdorf
2011-04-14 16:31 ` Eric Sandeen
2011-04-14 16:31 ` Eric Sandeen
2011-04-14 16:48 ` Markus Trippelsdorf
2011-04-14 16:48 ` Markus Trippelsdorf
2011-04-14 16:49 ` Eric Sandeen
2011-04-14 16:49 ` Eric Sandeen
2011-04-14 16:04 ` Yongqiang Yang
2011-04-14 16:04 ` Yongqiang Yang
2011-04-14 16:10 ` Yongqiang Yang
2011-04-14 16:10 ` Yongqiang Yang
[not found] ` <BANLkTimoLeWMJgNFGW+zdeUeJyZ-_+8fMQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2011-05-05 11:29 ` Pádraig Brady
2011-05-05 11:29 ` Pádraig Brady
2011-05-05 11:47 ` Yongqiang Yang
2011-05-05 11:47 ` Yongqiang Yang
[not found] ` <4DA7182B.8050409-V8g9lnOeT5ydJdNcDFJN0w@public.gmane.org>
2011-04-14 17:27 ` Jim Meyering
2011-04-14 17:27 ` Jim Meyering
2011-04-14 19:13 ` Pádraig Brady
2011-04-14 19:13 ` Pádraig Brady
[not found] ` <878vvcspz0.fsf-CybKA8TIZ99x3y/oJEDuiw@public.gmane.org>
2011-04-14 19:39 ` Jim Meyering
2011-04-14 19:39 ` Jim Meyering
2011-04-14 22:59 ` Dave Chinner
2011-04-14 23:29 ` Pádraig Brady
2011-04-14 23:29 ` Pádraig Brady
2011-04-15 0:09 ` Dave Chinner
2011-04-15 0:09 ` Dave Chinner
2011-04-15 5:01 ` Andreas Dilger
2011-04-15 5:01 ` Andreas Dilger
2011-04-16 0:50 ` Dave Chinner
2011-04-16 0:50 ` Dave Chinner
2011-04-16 5:11 ` Andreas Dilger
2011-04-16 5:11 ` Andreas Dilger
2011-04-16 12:21 ` Theodore Tso
2011-04-16 12:21 ` Theodore Tso
2011-04-18 0:40 ` Dave Chinner
2011-04-18 0:40 ` Dave Chinner
2011-04-18 2:45 ` Andreas Dilger
2011-04-18 2:45 ` Andreas Dilger
2011-04-19 1:58 ` Yongqiang Yang
2011-04-19 1:58 ` Yongqiang Yang
[not found] ` <BANLkTin=WEpSf6ddiOMNMOpCPP-wiEttSw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2011-04-19 2:59 ` Ted Ts'o
2011-04-19 2:59 ` Ted Ts'o
[not found] ` <20110419025949.GA3030-AKGzg7BKzIDYtjvyW6yDsg@public.gmane.org>
2011-04-19 3:05 ` Eric Sandeen
2011-04-19 3:05 ` Eric Sandeen
[not found] ` <4DACFBEB.9040909-+82itfer+wXR7s880joybQ@public.gmane.org>
2011-04-21 20:12 ` Jim Meyering
2011-04-21 20:12 ` Jim Meyering
2011-04-19 3:30 ` Yongqiang Yang
2011-04-19 3:30 ` Yongqiang Yang
2011-04-19 4:14 ` Dave Chinner
2011-04-19 4:14 ` Dave Chinner
2011-04-19 5:27 ` Christoph Hellwig
2011-04-19 5:27 ` Christoph Hellwig
2011-04-19 3:44 ` Dave Chinner
2011-04-19 3:44 ` Dave Chinner
2011-04-19 6:53 ` Yongqiang Yang
2011-04-19 6:53 ` Yongqiang Yang
2011-04-19 7:45 ` Dave Chinner [this message]
2011-04-19 7:45 ` Dave Chinner
2011-04-19 8:11 ` Yongqiang Yang
2011-04-19 8:11 ` Yongqiang Yang
2011-04-19 14:05 ` Eric Sandeen
2011-04-19 14:05 ` Eric Sandeen
2011-04-19 14:09 ` Ted Ts'o
2011-04-19 14:09 ` Ted Ts'o
2011-04-19 14:13 ` Eric Sandeen
2011-04-19 14:13 ` Eric Sandeen
2011-04-19 16:01 ` Ted Ts'o
2011-04-19 16:01 ` Ted Ts'o
2011-04-20 1:53 ` Yongqiang Yang
2011-04-20 1:53 ` Yongqiang Yang
2011-04-20 15:21 ` Christoph Hellwig
2011-04-20 15:21 ` Christoph Hellwig
2011-04-20 17:21 ` Ted Ts'o
2011-04-20 17:21 ` Ted Ts'o
[not found] ` <20110419140909.GD3030-AKGzg7BKzIDYtjvyW6yDsg@public.gmane.org>
2011-04-19 21:08 ` Dave Chinner
2011-04-19 21:08 ` Dave Chinner
2011-04-20 15:29 ` Christoph Hellwig
2011-04-20 15:29 ` Christoph Hellwig
2011-04-16 6:05 ` Yongqiang Yang
2011-04-16 6:05 ` Yongqiang Yang
2011-04-18 0:35 ` Dave Chinner
2011-04-18 0:35 ` Dave Chinner
2011-04-15 8:53 ` Jim Meyering
2011-04-15 8:53 ` Jim Meyering
2011-04-15 17:16 ` Christoph Hellwig
2011-04-15 17:16 ` Christoph Hellwig
[not found] ` <20110415171629.GA9088-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2011-04-15 17:24 ` Eric Blake
2011-04-15 17:24 ` Eric Blake
2011-04-15 17:26 ` Christoph Hellwig
2011-04-15 17:26 ` Christoph Hellwig
[not found] ` <20110415172603.GA20086-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2011-04-15 22:28 ` Andreas Dilger
2011-04-15 22:28 ` Andreas Dilger
2011-04-16 0:25 ` Dave Chinner
2011-04-16 0:25 ` Dave Chinner
2011-04-14 14:39 ` Eric Sandeen
[not found] ` <20110414102608.GA1678-tLCgZGx+iJ+kxVt8IV0GqQ@public.gmane.org>
2011-04-20 14:39 ` Jim Meyering
2011-04-20 14:39 ` Jim Meyering
[not found] ` <87d3khugv1.fsf-CybKA8TIZ99x3y/oJEDuiw@public.gmane.org>
2011-04-21 20:01 ` Jim Meyering
2011-04-21 20:01 ` Jim Meyering
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20110419074538.GG23985@dastard \
--to=david@fromorbit.com \
--cc=P@draigbrady.com \
--cc=adilger@dilger.ca \
--cc=coreutils@gnu.org \
--cc=linux-ext4@vger.kernel.org \
--cc=markus@trippelsdorf.de \
--cc=sandeen@sandeen.net \
--cc=tytso@mit.edu \
--cc=xfs@oss.sgi.com \
--cc=xiaoqiangnk@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.