Efficient handling of sparse files

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Efficient handling of sparse files
@ 2005-02-28 17:41 Matthew Wilcox
  2005-02-28 17:44 ` Jeremy Allison
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Matthew Wilcox @ 2005-02-28 17:41 UTC (permalink / raw)
  To: linux-fsdevel

This problem came up with the systemimager program which uses rsync to
install files from a master server to many clients.  Red Hat has a system
user with uid 2^32-1 which causes lastlog to grow to 1.2GB in size.
rsync does understand the concept of sparse files (with the -S flag), but
it has to read every block to discover that it is indeed empty.  This sucks.

I was wondering if we could introduce a new system call (or ioctl?) that,
given an fd would find the next block with data in it.  We could use the
->bmap method ... except that has dire warnings about adding new callers
and viro may soon be in testicle-gouging range.

One system interface hack would be to introduce lseek(fd, 0, SEEK_DATA)
... but without permission to reuse ->bmap for this purpose, it's
pointless to discuss user interfaces.

Suggestions?

-- 
"Next the statesmen will invent cheap lies, putting the blame upon 
the nation that is attacked, and every man will be glad of those
conscience-soothing falsities, and will diligently study them, and refuse
to examine any refutations of them; and thus he will by and by convince 
himself that the war is just, and will thank God for the better sleep 
he enjoys after this process of grotesque self-deception." -- Mark Twain

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Efficient handling of sparse files
  2005-02-28 17:41 Efficient handling of sparse files Matthew Wilcox
@ 2005-02-28 17:44 ` Jeremy Allison
  2005-02-28 20:13   ` Bryan Henderson
  2005-02-28 18:57 ` Szakacsits Szabolcs
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 10+ messages in thread
From: Jeremy Allison @ 2005-02-28 17:44 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel

On Mon, Feb 28, 2005 at 05:41:49PM +0000, Matthew Wilcox wrote:
> 
> This problem came up with the systemimager program which uses rsync to
> install files from a master server to many clients.  Red Hat has a system
> user with uid 2^32-1 which causes lastlog to grow to 1.2GB in size.
> rsync does understand the concept of sparse files (with the -S flag), but
> it has to read every block to discover that it is indeed empty.  This sucks.
> 
> I was wondering if we could introduce a new system call (or ioctl?) that,
> given an fd would find the next block with data in it.  We could use the
> ->bmap method ... except that has dire warnings about adding new callers
> and viro may soon be in testicle-gouging range.
> 
> One system interface hack would be to introduce lseek(fd, 0, SEEK_DATA)
> ... but without permission to reuse ->bmap for this purpose, it's
> pointless to discuss user interfaces.
> 
> Suggestions?

This is very similar to the Windows ability to do a query to
get the block map of a sparse file. Might be worth looking at
that interface to see what we can learn.

Jeremy.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Efficient handling of sparse files
  2005-02-28 17:41 Efficient handling of sparse files Matthew Wilcox
  2005-02-28 17:44 ` Jeremy Allison
@ 2005-02-28 18:57 ` Szakacsits Szabolcs
  2005-02-28 19:55 ` Zach Brown
  2005-02-28 20:40 ` Anton Altaparmakov
  3 siblings, 0 replies; 10+ messages in thread
From: Szakacsits Szabolcs @ 2005-02-28 18:57 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel


On Mon, 28 Feb 2005, Matthew Wilcox wrote:

> This problem came up with the systemimager program which uses rsync to
> install files from a master server to many clients.  Red Hat has a system
> user with uid 2^32-1 which causes lastlog to grow to 1.2GB in size.
> rsync does understand the concept of sparse files (with the -S flag), but
> it has to read every block to discover that it is indeed empty.  This sucks.

XFS supports what you want. I made a related benchmark some years ago, if
you are interested,

	http://marc.theaimsgroup.com/?l=reiserfs&m=105827549109079
 
> I was wondering if we could introduce a new system call (or ioctl?) that,
> given an fd would find the next block with data in it.  

I think, this is a bad idea. The XFS or the NTFS interface could be a
better start.

	Szaka


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Efficient handling of sparse files
  2005-02-28 17:41 Efficient handling of sparse files Matthew Wilcox
  2005-02-28 17:44 ` Jeremy Allison
  2005-02-28 18:57 ` Szakacsits Szabolcs
@ 2005-02-28 19:55 ` Zach Brown
  2005-02-28 20:40 ` Anton Altaparmakov
  3 siblings, 0 replies; 10+ messages in thread
From: Zach Brown @ 2005-02-28 19:55 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel

> I was wondering if we could introduce a new system call (or ioctl?) that,
> given an fd would find the next block with data in it.  We could use the
> ->bmap method ... except that has dire warnings about adding new callers
> and viro may soon be in testicle-gouging range.

Hmm.  What you're talking about reminds me of some ioctl()s Alex has for
ext3+extents and it feels like the pagevec apis that want to find
populated pages across the page cache index space.

Sooo.

struct fs_extent {
	u64 file_start;
	u64 block_start;
	u64 contig;
};

(I don't really care if those are in bytes or blocks or whatever.
someone with strong opinions can pick a unit :))

long sys_find_extents_please(int fd, off_t file_start,
	struct fs_extent *extents, long nr_extents);

so it'll fill in as many extent structs in the caller as it finds
contiguous regions on disk starting with the given file position,
returning the number populated.

I'd, somewhat obviously, want to push this into an fs method perhaps
with a generic_ that just spins on bmap().

I think this would let Alex kill his ioctl() and ocfs2 could certainly
fill this with reasonable results.

To move into lala land, I wonder if we would want to consider the
difference between mapped blocks with data and mapped blocks that
haven't been touched and which are going to return zeros. One could
argue that it's marginally ridiculous that there isn't a shared
interface to reserve blocks without having to manually zero them.  If we
did have such an interface, something like rsync doesn't actually care
that they're mapped if the fs knows that they're still just zeros.  I
acknowledge that this is a bit out there :)

In any case, if you want help rolling proofs-of-concept I could lend a
few hours.

- z

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Efficient handling of sparse files
  2005-02-28 17:44 ` Jeremy Allison
@ 2005-02-28 20:13   ` Bryan Henderson
  2005-02-28 21:49     ` Jamie Lokier
  0 siblings, 1 reply; 10+ messages in thread
From: Bryan Henderson @ 2005-02-28 20:13 UTC (permalink / raw)
  To: Jeremy Allison; +Cc: linux-fsdevel, Matthew Wilcox

>This is very similar to the Windows ability to do a query to
>get the block map of a sparse file. Might be worth looking at
>that interface to see what we can learn.

XDSM (better but incorrectly known by the generic term DMAPI) also has one 
of those, for use in migrating or backing up sparse files and restoring 
them to their original sparseness.

I'd resist any interface that exposes implementation details like that. 
The user program shouldn't know anything about block allocations.

On the other hand, I can see the value in exposing the concept of a clear 
section of file (a hole), as distinct from one filled with zeroes.

I once had to deal with this in a system that would have to transfer mass 
quantities of zero bytes over a network for sparse files.  I found then 
that the most convenient interface was a new form of the read call.  It 
returned an indicator of whether the offsets being read were clear or 
filled plus, if filled, the values.  If clear, the values are by 
definition zero.  At boundaries between clear and filled sections of the 
file, it would do a short read.  Otherwise, the semantics were pretty much 
the same as classic Unix character stream read.

My interface didn't have the ability to tell you how far the hole extends 
without you having to allocate a buffer that big (because you don't know 
until you do the read if you're reading a hole or not), but that seems 
like a reasonable addition.

If someone's expending development effort on exploiting file sparseness, 
I'd rather see it spent implementing a clear (aka punch) system call 
first.  Or has that been done when I wasn't looking?

--
Bryan Henderson                          IBM Almaden Research Center
San Jose CA                              Filesystems

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Efficient handling of sparse files
  2005-02-28 17:41 Efficient handling of sparse files Matthew Wilcox
                   ` (2 preceding siblings ...)
  2005-02-28 19:55 ` Zach Brown
@ 2005-02-28 20:40 ` Anton Altaparmakov
  2005-02-28 20:53   ` Zach Brown
  3 siblings, 1 reply; 10+ messages in thread
From: Anton Altaparmakov @ 2005-02-28 20:40 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel

On Mon, 28 Feb 2005, Matthew Wilcox wrote:
> This problem came up with the systemimager program which uses rsync to
> install files from a master server to many clients.  Red Hat has a system
> user with uid 2^32-1 which causes lastlog to grow to 1.2GB in size.
> rsync does understand the concept of sparse files (with the -S flag), but
> it has to read every block to discover that it is indeed empty.  This sucks.
> 
> I was wondering if we could introduce a new system call (or ioctl?) that,
> given an fd would find the next block with data in it.  We could use the
> ->bmap method ... except that has dire warnings about adding new callers
> and viro may soon be in testicle-gouging range.
> 
> One system interface hack would be to introduce lseek(fd, 0, SEEK_DATA)
> ... but without permission to reuse ->bmap for this purpose, it's
> pointless to discuss user interfaces.
> 
> Suggestions?

Please keep one thing in mind and that is that there are file systems 
where ->bmap actually makes no sense whatsoever - for example NTFS where 
you can have compressed or encrypted file in both of which you do not have 
any blocks on disk where you can read/write the actual data and in 
addition to those you also have resident files where the file content 
itself is stored inside the on-disk inode (at a variable and unaligned 
offset) so here again there is no block a ->bmap could return that would 
only contain the file data - it would also contain metadata and the data 
would certainly not start at a block boundary.

This is one of the reasons why noone should be using ->bmap.  It is a 
stupid interface that only fits very particular sets of file systems and 
cannot be applied generically.

Best regards,

	Anton
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Efficient handling of sparse files
  2005-02-28 20:40 ` Anton Altaparmakov
@ 2005-02-28 20:53   ` Zach Brown
  2005-03-01  7:50     ` Anton Altaparmakov
  0 siblings, 1 reply; 10+ messages in thread
From: Zach Brown @ 2005-02-28 20:53 UTC (permalink / raw)
  To: Anton Altaparmakov; +Cc: Matthew Wilcox, linux-fsdevel

> Please keep one thing in mind and that is that there are file systems 
> where ->bmap actually makes no sense whatsoever

Of course, so return -ESORRY.

> This is one of the reasons why noone should be using ->bmap.  It is a 
> stupid interface that only fits very particular sets of file systems and 
> cannot be applied generically.

No, it's a reason to only ask about the details of block mapping in
cases where it actually makes sense (like, wanting to find out of
concurrent file extension is getting good batched contiguous allocation,
etc).  Just because file systems x, y, and z can't answer the question
meaningfully doesn't mean it isn't a reasonble thing to ask of file
systems m, n, and o.

Now, I'm not at all opposed to an explicit sparse-testing interface that
doesn't confuse that functionality with querying specific block mappings.

- z

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Efficient handling of sparse files
  2005-02-28 20:13   ` Bryan Henderson
@ 2005-02-28 21:49     ` Jamie Lokier
  2005-03-01 18:37       ` Bryan Henderson
  0 siblings, 1 reply; 10+ messages in thread
From: Jamie Lokier @ 2005-02-28 21:49 UTC (permalink / raw)
  To: Bryan Henderson; +Cc: Jeremy Allison, linux-fsdevel, Matthew Wilcox

Bryan Henderson wrote:
> I'd resist any interface that exposes implementation details like that. 
> The user program shouldn't know anything about block allocations.

A database or file scanner that must read a lot of data can benefit
from having even a rough idea of the layout of the data on disk.

When you have to scan a lot of data, the performance improvement from
sweeping across a disk in block order can be significant - more so than
trying to schedule lots of asynchronous I/O for the elevator to sort.

But despite this use, ->bmap is not ideal for that type of
optimisation because it does not always correspond to position on the
physical device.

-- Jamie

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Efficient handling of sparse files
  2005-02-28 20:53   ` Zach Brown
@ 2005-03-01  7:50     ` Anton Altaparmakov
  0 siblings, 0 replies; 10+ messages in thread
From: Anton Altaparmakov @ 2005-03-01  7:50 UTC (permalink / raw)
  To: Zach Brown; +Cc: Matthew Wilcox, linux-fsdevel

On Mon, 28 Feb 2005, Zach Brown wrote:
> > Please keep one thing in mind and that is that there are file systems 
> > where ->bmap actually makes no sense whatsoever
> 
> Of course, so return -ESORRY.

Ah but it gets even worse, ->bmap uses 0 to mean sparse when in NTFS 0 is 
a valid block so it cannot be sparse, sparse needs its own namespace 
outside 0-2^63-1.  Internally in NTFS I use -1 for sparse for example.

> > This is one of the reasons why noone should be using ->bmap.  It is a 
> > stupid interface that only fits very particular sets of file systems and 
> > cannot be applied generically.
> 
> No, it's a reason to only ask about the details of block mapping in
> cases where it actually makes sense (like, wanting to find out of
> concurrent file extension is getting good batched contiguous allocation,
> etc).  Just because file systems x, y, and z can't answer the question
> meaningfully doesn't mean it isn't a reasonble thing to ask of file
> systems m, n, and o.
> 
> Now, I'm not at all opposed to an explicit sparse-testing interface that
> doesn't confuse that functionality with querying specific block mappings.

That's cool.  Just an array of zero-data[file offset, length]n would be 
sufficient, no?  Note I think doing just sparse is not as good as doing 
zero-data because on NTFS you can have on-disk blocks allocated but the 
file can be marked as empty beyond a certain offset (in that case the 
on-disk blocks contain random garbage and the driver just knows to fill in 
with zero on read - i.e. no disk access needed at all on reads and to do 
actual writes to disk on non-zero write).  This is actually used in 
Windows by applications (office and outlook for example) that want to have 
guaranteed storage allocations (e.g. for the mail INBOX) so deliveries 
cannot fail but want to efficiently clear the file contents beyond a 
certain offset.  I suppose the ntfs driver could simply pretend that this 
non-initialized but allocated space is sparse if a 
sys_get_sparse_regions() rather than a sys_get_zero_regions() is 
implemented so it wouldn't be such a big problem.

Best regards,

	Anton
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Efficient handling of sparse files
  2005-02-28 21:49     ` Jamie Lokier
@ 2005-03-01 18:37       ` Bryan Henderson
  0 siblings, 0 replies; 10+ messages in thread
From: Bryan Henderson @ 2005-03-01 18:37 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Jeremy Allison, linux-fsdevel, Matthew Wilcox

>A database or file scanner that must read a lot of data can benefit
>from having even a rough idea of the layout of the data on disk.

True.  There's always room for interfaces that dive into the lower layers 
for those users who want to be there.  (Of course, you end up crossing a 
line fairly quickly where you shouldn't be pretending to use a filesystem 
at all and should just use a block disk).

But I first want to see an abstract interface where an application can 
recognize cleared regions of file without actually knowing anything about 
how the filesystem represents them or what the filesystem does with them. 
In particular, there's no reason to give up the character stream notion of 
a file and start talking about blocks just to have visible cleared regions 
(holes).

--
Bryan Henderson                          IBM Almaden Research Center
San Jose CA                              Filesystems

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2005-03-01 18:36 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-02-28 17:41 Efficient handling of sparse files Matthew Wilcox
2005-02-28 17:44 ` Jeremy Allison
2005-02-28 20:13   ` Bryan Henderson
2005-02-28 21:49     ` Jamie Lokier
2005-03-01 18:37       ` Bryan Henderson
2005-02-28 18:57 ` Szakacsits Szabolcs
2005-02-28 19:55 ` Zach Brown
2005-02-28 20:40 ` Anton Altaparmakov
2005-02-28 20:53   ` Zach Brown
2005-03-01  7:50     ` Anton Altaparmakov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).