fragmentation && blocks "realloc"

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* fragmentation && blocks "realloc"
@ 2006-01-20 11:47 Jan Koss
  2006-01-20 13:34 ` Anton Altaparmakov
  0 siblings, 1 reply; 13+ messages in thread
From: Jan Koss @ 2006-01-20 11:47 UTC (permalink / raw)
  To: kernelnewbies; +Cc: linux-fsdevel

Hello.

Let's suppose that we have file which consist of two blocks
and user resizing file and now we need 4 blocks.

Near this two blocks there are no 2 free blocks,
and instead of allocating 2 additional blocks somewhere,
I want allocate chunk of 4 blocks.

The main problem is choose way of invalidate "old" blocks and copy
data to new buffers,

how it possible on linux?

something like
struct buffer_head *oldbh, *newbh;
memcpy(newbh->b_data, oldbh->b_data);
block_invalidatepage(oldbh->b_this_page,...)

[ block_invalidatepage right choise ? ]

or it is possible just change b_blocknr?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: fragmentation && blocks "realloc"
  2006-01-20 11:47 fragmentation && blocks "realloc" Jan Koss
@ 2006-01-20 13:34 ` Anton Altaparmakov
  2006-01-20 15:46   ` Jan Koss
  0 siblings, 1 reply; 13+ messages in thread
From: Anton Altaparmakov @ 2006-01-20 13:34 UTC (permalink / raw)
  To: Jan Koss; +Cc: kernelnewbies, linux-fsdevel

On Fri, 2006-01-20 at 14:47 +0300, Jan Koss wrote:
> Hello.
> 
> Let's suppose that we have file which consist of two blocks
> and user resizing file and now we need 4 blocks.
> 
> Near this two blocks there are no 2 free blocks,
> and instead of allocating 2 additional blocks somewhere,
> I want allocate chunk of 4 blocks.
> 
> The main problem is choose way of invalidate "old" blocks and copy
> data to new buffers,
> 
> how it possible on linux?
> 
> something like
> struct buffer_head *oldbh, *newbh;
> memcpy(newbh->b_data, oldbh->b_data);
> block_invalidatepage(oldbh->b_this_page,...)

No need to invalidate or copy anything as long as you are working inside
a file system driver and those buffers are attached to page cache of a
file.

> or it is possible just change b_blocknr?

Yes, just change b_blocknr, and mark the buffer dirty so it gets written
out to the new location or indeed you can do the write (or submission
thereof) yourself if you want.

Note since you are effectively "allocating" the buffer(s), after you
have done the block allocation on your file system and updated
bh->b_blocknr, you need to call unmap_underlying_metadata(bh->b_dev,
bh->b_blocknr); for each block before you write it out or your write
could get trampled on by a different write.

And of course do not forget to deallocate the two blocks you just freed
in your fs...  (-:

Best regards,

        Anton
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: fragmentation && blocks "realloc"
  2006-01-20 13:34 ` Anton Altaparmakov
@ 2006-01-20 15:46   ` Jan Koss
  2006-01-20 19:22     ` Jan Koss
  2006-01-20 20:04     ` Anton Altaparmakov
  0 siblings, 2 replies; 13+ messages in thread
From: Jan Koss @ 2006-01-20 15:46 UTC (permalink / raw)
  To: Anton Altaparmakov; +Cc: kernelnewbies, linux-fsdevel

In fact, I expected yes for the first abbility and no for the second :)

Now code looks like:
bh = sb_bread(sb, oldblock);
if (!bh)
  goto err;
bh->b_blocknr = newblk;				
mark_buffer_dirty (bh);
unmap_underlying_metadata(bh->b_bdev, bh->b_blocknr);

Let's suppose such test case:
after situation, which I described in the first email,
user resize file and new size 5 blocks,
and there are no free blocks except 2 blocks which we deallocated in
the frist email,
so we have to allocate them.

When I reproduced this test case, I got such messages from kernel:
__find_get_block_slow failed block=oldblock...

So as I can see I missed something in "art of changing b_blocknr".

Error in __find_get_block_slow may happen only if all buffers on page mapped.

May be this is because of buffer_head change b_blocknr, but didn't
change b_this_page?

On 1/20/06, Anton Altaparmakov <aia21@cam.ac.uk> wrote:
> On Fri, 2006-01-20 at 14:47 +0300, Jan Koss wrote:
> > Hello.
> >
> > Let's suppose that we have file which consist of two blocks
> > and user resizing file and now we need 4 blocks.
> >
> > Near this two blocks there are no 2 free blocks,
> > and instead of allocating 2 additional blocks somewhere,
> > I want allocate chunk of 4 blocks.
> >
> > The main problem is choose way of invalidate "old" blocks and copy
> > data to new buffers,
> >
> > how it possible on linux?
> >
> > something like
> > struct buffer_head *oldbh, *newbh;
> > memcpy(newbh->b_data, oldbh->b_data);
> > block_invalidatepage(oldbh->b_this_page,...)
>
> No need to invalidate or copy anything as long as you are working inside
> a file system driver and those buffers are attached to page cache of a
> file.
>
> > or it is possible just change b_blocknr?
>
> Yes, just change b_blocknr, and mark the buffer dirty so it gets written
> out to the new location or indeed you can do the write (or submission
> thereof) yourself if you want.
>
> Note since you are effectively "allocating" the buffer(s), after you
> have done the block allocation on your file system and updated
> bh->b_blocknr, you need to call unmap_underlying_metadata(bh->b_dev,
> bh->b_blocknr); for each block before you write it out or your write
> could get trampled on by a different write.
>
> And of course do not forget to deallocate the two blocks you just freed
> in your fs...  (-:
>
> Best regards,
>
>         Anton
> --
> Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
> Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
> Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
> WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/
>
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: fragmentation && blocks "realloc"
  2006-01-20 15:46   ` Jan Koss
@ 2006-01-20 19:22     ` Jan Koss
  2006-01-20 20:11       ` Anton Altaparmakov
  2006-01-20 20:04     ` Anton Altaparmakov
  1 sibling, 1 reply; 13+ messages in thread
From: Jan Koss @ 2006-01-20 19:22 UTC (permalink / raw)
  To: Anton Altaparmakov; +Cc: kernelnewbies, linux-fsdevel

In comparison with this
> bh = sb_bread(sb, oldblock);
> if (!bh)
>   goto err;
> bh->b_blocknr = newblk;
> mark_buffer_dirty (bh);
> unmap_underlying_metadata(bh->b_bdev, bh->b_blocknr);
>

this code like this didn't cause any "warrnings":
struct buffer_head *newbh;

bh = sb_bread(sb, oldblock);
newbh = sb_bread(sb, newblock);
if (!(bh || newbh))
  goto err;

memcpy(newbh->b_data, bh->b_data, sb->s_blocksize);
mark_buffer_dirty(newbh);
brelse(bh);
brelse(newbh);
invalidate_inode_buffers(inode);

but it didn't optimal,
much better what you suggest, but...

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: fragmentation && blocks "realloc"
  2006-01-20 19:22     ` Jan Koss
@ 2006-01-20 20:11       ` Anton Altaparmakov
  2006-01-21  9:42         ` Jan Koss
  0 siblings, 1 reply; 13+ messages in thread
From: Anton Altaparmakov @ 2006-01-20 20:11 UTC (permalink / raw)
  To: Jan Koss; +Cc: kernelnewbies, linux-fsdevel

On Fri, 20 Jan 2006, Jan Koss wrote:
> In comparison with this
> > bh = sb_bread(sb, oldblock);
> > if (!bh)
> >   goto err;
> > bh->b_blocknr = newblk;
> > mark_buffer_dirty (bh);
> > unmap_underlying_metadata(bh->b_bdev, bh->b_blocknr);
> 
> this code like this didn't cause any "warrnings":
> struct buffer_head *newbh;
> 
> bh = sb_bread(sb, oldblock);
> newbh = sb_bread(sb, newblock);
> if (!(bh || newbh))
>   goto err;
> 
> memcpy(newbh->b_data, bh->b_data, sb->s_blocksize);
> mark_buffer_dirty(newbh);
> brelse(bh);
> brelse(newbh);
> invalidate_inode_buffers(inode);

Yes, that is almost correct.  Although it is wrong.  (-;

You do not want the invalidate_inode_buffers() call.  It makes no sense 
for your fs at all given how you are dealing with the buffers with 
sb_bread()/brelse()...  Your method never attaches buffers to the inode so 
there is no point in trying to invalidate anything.  It will all just work 
fine.  (Unless you have omitted to say things about your fs that are 
important.  Why don't you show all your code rather than just those 
snippets and then proper advice can be given...)

Best regards,

	Anton
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: fragmentation && blocks "realloc"
  2006-01-20 20:11       ` Anton Altaparmakov
@ 2006-01-21  9:42         ` Jan Koss
  2006-01-21 20:28           ` Anton Altaparmakov
  0 siblings, 1 reply; 13+ messages in thread
From: Jan Koss @ 2006-01-21  9:42 UTC (permalink / raw)
  To: Anton Altaparmakov; +Cc: kernelnewbies, linux-fsdevel

>It will all just work
>fine.  (Unless you have omitted to say things about your fs that are
>important.  Why don't you show all your code rather than just those
>snippets and then proper advice can be given...)

fs is just a simple analog of ufs/ext2/minix/sysv. It is block
oriented, and I suppose that working with pages, instead of blocks
make it more complicated, then it should to be.

>You do not want the invalidate_inode_buffers() call.  It makes no sense

Great, we reached the point.
Yes, my file system based on usage sb_bread/brelse.

As ordinary file system my file system implements readpage and writepage,
it is similar to
static int sysv_writepage(struct page *page, struct writeback_control *wbc)
{
	return block_write_full_page(page,get_block,wbc);
}
static int sysv_readpage(struct file *file, struct page *page)
{
	return block_read_full_page(page,get_block);
}

get_block make such thing
map_bh(...)

So, when we "realloc" blocks, what happen with these "old" mapped (or
used in some other way) blocks?

How can I prevent usage "old" blocks instead of "new" blocks?
Or if I mark old and new blocks as dirty all will be right?

--
Kernelnewbies: Help each other learn about the Linux kernel.
Archive:       http://mail.nl.linux.org/kernelnewbies/
FAQ:           http://kernelnewbies.org/faq/

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: fragmentation && blocks "realloc"
  2006-01-21  9:42         ` Jan Koss
@ 2006-01-21 20:28           ` Anton Altaparmakov
  2006-01-22 20:58             ` Jan Koss
  0 siblings, 1 reply; 13+ messages in thread
From: Anton Altaparmakov @ 2006-01-21 20:28 UTC (permalink / raw)
  To: Jan Koss; +Cc: kernelnewbies, linux-fsdevel

On Sat, 21 Jan 2006, Jan Koss wrote:
> >It will all just work
> >fine.  (Unless you have omitted to say things about your fs that are
> >important.  Why don't you show all your code rather than just those
> >snippets and then proper advice can be given...)
> 
> fs is just a simple analog of ufs/ext2/minix/sysv. It is block

All of the above a page cache users, not block device oriented at all.

> oriented, and I suppose that working with pages, instead of blocks
> make it more complicated, then it should to be.

It also makes it very slow not to use the page cache...

> >You do not want the invalidate_inode_buffers() call.  It makes no sense
> 
> Great, we reached the point.
> Yes, my file system based on usage sb_bread/brelse.

Right, so nothing like the other file systems you compare yourself to 
then.  None of them are sb_bread/brelse based.

> As ordinary file system my file system implements readpage and writepage,
> it is similar to
> static int sysv_writepage(struct page *page, struct writeback_control *wbc)
> {
> 	return block_write_full_page(page,get_block,wbc);
> }
> static int sysv_readpage(struct file *file, struct page *page)
> {
> 	return block_read_full_page(page,get_block);
> }
> 
> get_block make such thing
> map_bh(...)

Err, so you are page cache based and not sb_bread/brelse based at all.

I think you are confused...  (-;

> So, when we "realloc" blocks, what happen with these "old" mapped (or
> used in some other way) blocks?
> 
> How can I prevent usage "old" blocks instead of "new" blocks?
> Or if I mark old and new blocks as dirty all will be right?

You cannot do the reallocation using your method if the above page cache 
functions are used like that by your fs.  You need to do it how I showed 
it to you first, i.e. without sb_bread/brelse as those make no sense 
whatsoever for you.  (They access the block device directly, completely 
bypassing the page cache so you are breaking cache coherency and are 100% 
broken by design.)

You seem to be extremely conused I am afraid.  They only way to help you 
is to see your whole file system code unless you start becoming clearer 
about what you are really doing so that you don't keep making 
contradictory statements in two successive sentences...

Best regards,

	Anton
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: fragmentation && blocks "realloc"
  2006-01-21 20:28           ` Anton Altaparmakov
@ 2006-01-22 20:58             ` Jan Koss
  2006-01-22 21:32               ` Anton Altaparmakov
  0 siblings, 1 reply; 13+ messages in thread
From: Jan Koss @ 2006-01-22 20:58 UTC (permalink / raw)
  To: Anton Altaparmakov; +Cc: kernelnewbies, linux-fsdevel

Hello.

>(They access the block device directly, completely
>bypassing the page cache so you are breaking cache coherency and are 100%
>broken by design.)

Oh... I thought that start from 2.4.x there are no separate implementation
of working with blocks and pages, when you read block, kernel read whole page,
am I wrong?

> They only way to help you
> is to see your whole file system code

If we need some handhold for discussion, lets talk about minix v.1
(my file system derive from this code).
Lets suppose I want make algorigth of allocation blocks in
fs/minix/bitmap.c: minix_new_block more inteligent.

I should say that minix code use sb_bread/brelse and work with pages (for
example fs/minix/dir.c).

So instead of allocation one additional block,
I want "realloc" blocks, so all file will occupy several consecutive blocks.

And we stop on such code
 bh->b_blocknr = newblk;
 unmap_underlying_metadata(bh->b_bdev, bh->b_blocknr);
 mark_buffer_dirty (bh);

And question how should I get this _bh_, if I can not use sb_bread?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: fragmentation && blocks "realloc"
  2006-01-22 20:58             ` Jan Koss
@ 2006-01-22 21:32               ` Anton Altaparmakov
  2006-01-22 22:05                 ` Jan Koss
  0 siblings, 1 reply; 13+ messages in thread
From: Anton Altaparmakov @ 2006-01-22 21:32 UTC (permalink / raw)
  To: Jan Koss; +Cc: kernelnewbies, linux-fsdevel

Hi,

On Sun, 22 Jan 2006, Jan Koss wrote:
> >(They access the block device directly, completely
> >bypassing the page cache so you are breaking cache coherency and are 100%
> >broken by design.)
> 
> Oh... I thought that start from 2.4.x there are no separate implementation
> of working with blocks and pages, when you read block, kernel read whole page,
> am I wrong?

There is a very big difference.  If you do sb_bread() you are reading a 
block from the block device.  And yes this block is attached to a page but 
it is a page belonging to the block device address space mapping.  You 
cannot do anything to this block other than read/write it.

If you use the page cache to access the contents of a file, then that file 
(or more precisely the inode of that file) will have an address space 
mapping of its own, completely independent of the address space mapping of 
the block device inode.  Those pages will (or will not) have buffers 
attached to them (your getblock() callback is there exactly to allow the 
buffers to be created and mapped if they are not there).  Those buffers 
will be part of the file page cache page, thus part of the inode's address 
space mapping, and those buffers have no meaning other than to say "the 
data in this part of the page belongs to blockdevice so and so and to 
blocknumber on that block device so and so".  So you can change the 
b_blocknr on those buffers to your hearts content (well you need to 
observe necessarily locking so buffers under i/o don't get screwed) and 
that is no problem.

Note that the buffers from the block device address space mapping are 
COMPLETELY separate from the buffers from a file inode address space 
mapping.  So writes from one are NOT seen in the other and you NEVER can 
mix the two forms of i/o and expect to have a working file system.  You 
will get random results and tons of weird data corruption that way.

> > They only way to help you
> > is to see your whole file system code
> 
> If we need some handhold for discussion, lets talk about minix v.1
> (my file system derive from this code).
> Lets suppose I want make algorigth of allocation blocks in
> fs/minix/bitmap.c: minix_new_block more inteligent.
> 
> I should say that minix code use sb_bread/brelse and work with pages (for
> example fs/minix/dir.c).

Er, not on current kernels:

$ grep bread linux-2.6/fs/minix/*
bitmap.c:       *bh = sb_bread(sb, block);
bitmap.c:       *bh = sb_bread(sb, block);
inode.c:        if (!(bh = sb_bread(s, 1)))
inode.c:                if (!(sbi->s_imap[i]=sb_bread(s, block)))
inode.c:                if (!(sbi->s_zmap[i]=sb_bread(s, block)))
itree_common.c:         bh = sb_bread(sb, block_to_cpu(p->key));
itree_common.c:                 bh = sb_bread(inode->i_sb, nr);

Are you working on 2.4 by any chance?  If you are writing a new fs I would 
strongly recommend you to work on 2.6 kernels otherwise you are writing 
something that is already out of date...

The only thing minix in current 2.6 kernel uses bread for is to read the 
on-disk inodes themselves.  It never uses it to access file data at all 
and I very much doubt that even old 2.4 kernels ever use bread for 
anything that is not strictly metadata rather than file data.

> So instead of allocation one additional block,
> I want "realloc" blocks, so all file will occupy several consecutive blocks.
> 
> And we stop on such code
>  bh->b_blocknr = newblk;
>  unmap_underlying_metadata(bh->b_bdev, bh->b_blocknr);
>  mark_buffer_dirty (bh);
> 
> And question how should I get this _bh_, if I can not use sb_bread?

That depends entirely in which function you are / which call path you are 
in at present.  Taking minix as an example, tell me the call path where 
you end up wanting to do the above and I will tell you where to get the bh 
from...  (-:

Btw. don't think this is all that easy.  If you want to keep whole files 
rather than whole pages of buffers in consecutive blocks you are in for 
some very serious fun with multi-page locking and/or complete i/o 
serialisation, i.e. when a write is happening all other writes on the same 
file will just block...

Best regards,

	Anton
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: fragmentation && blocks "realloc"
  2006-01-22 21:32               ` Anton Altaparmakov
@ 2006-01-22 22:05                 ` Jan Koss
  2006-01-24 10:37                   ` Anton Altaparmakov
  0 siblings, 1 reply; 13+ messages in thread
From: Jan Koss @ 2006-01-22 22:05 UTC (permalink / raw)
  To: Anton Altaparmakov; +Cc: kernelnewbies, linux-fsdevel

On 1/23/06, Anton Altaparmakov <aia21@cam.ac.uk> wrote:
...
> Note that the buffers from the block device address space mapping are
> COMPLETELY separate from the buffers from a file inode address space
> mapping.  So writes from one are NOT seen in the other and you NEVER can
> mix the two forms of i/o and expect to have a working file system.  You
> will get random results and tons of weird data corruption that way.
>

Thanks a lot, this is clear for me several important things.

> That depends entirely in which function you are / which call path you are
> in at present.  Taking minix as an example, tell me the call path where
> you end up wanting to do the above and I will tell you where to get the bh
> from...  (-:
>

I told about 2.6.15.

in fs/minix/bitmap.c there is minix_new_block we come in it from get_block in
fs/minix/itree_common.c.

After analizing blocks<->file I want move some blocks to another location
and update page cache correspondingly, what should I do?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: fragmentation && blocks "realloc"
  2006-01-22 22:05                 ` Jan Koss
@ 2006-01-24 10:37                   ` Anton Altaparmakov
  2006-02-23 21:47                     ` Nate Diller
  0 siblings, 1 reply; 13+ messages in thread
From: Anton Altaparmakov @ 2006-01-24 10:37 UTC (permalink / raw)
  To: Jan Koss; +Cc: kernelnewbies, linux-fsdevel

On Mon, 23 Jan 2006, Jan Koss wrote:
> On 1/23/06, Anton Altaparmakov <aia21@cam.ac.uk> wrote:
> > That depends entirely in which function you are / which call path you are
> > in at present.  Taking minix as an example, tell me the call path where
> > you end up wanting to do the above and I will tell you where to get the bh
> > from...  (-:
> 
> I told about 2.6.15.
> 
> in fs/minix/bitmap.c there is minix_new_block we come in it from get_block in
> fs/minix/itree_common.c.
> 
> After analizing blocks<->file I want move some blocks to another location
> and update page cache correspondingly, what should I do?

<Argh I just spent ages writing an email and it got lost when the internet 
connection died...  I only have what was visible on the terminal screen, 
so starting again on the rest...>

You cannot do what you want from such a low level because the upper layers 
hold locks that you need.  For example a readpage/writepage/prepare_write 
can be running concurrently with get_block() and even other instances of 
get_block() can be running at the same time and it would then be unsafe to 
do any sort of reallocation.  So you have to scrap that idea.

You could do it in higher up levels, i.e. in file ->write itself but again 
this introduces a lot of complexity to your file system.

Basically what you are trying to so is much harder than you think and
involves a lot of work...

There is a possible alternative however.  Your get_block function could 
take a reference on the inode (i_count), set a flag in the file system 
specific inode "need realloc" and add the inode to a queue of a 
"realloc-demon" for your fs which is just a kernel thread which will run 
periodically, say every five seconds, and it will take inodes one after 
the other from its queue, then take all necessary locks so you can do this 
(e.g i_mutex on the inode as well as i_alloc_sem/i_alloc_mutex - whatever 
it is called now) - note you will probably need an extra lock to prevent 
entry into readpage/writepage whilst this is happening and your 
readpage/writepage will need to take that lock for reading whilst your 
daemon takes it for writing so multiple read/writepage can run 
simultaneously but your deamon runs exclusive.

Then, if the inode is marked "need realloc" it will allocate a contiguous 
chunk of space equal to the file size, clear the "need-realloc" bit, do 
the reallocation by starting at the first page (index 0) and working 
upwards, getting it (warning: deadlock possible with a read or writepage 
holding that page's lock and blocked on your "realloc lock" so maybe 
trylock and if fails abort and requeue the inode to the daemon at the end 
of the queue), then when you have a page, loop around its buffers and for 
each buffer move it from the old allocation to the new one as I described 
earlier (i.e. just change b_blocknr, invalidate underlying metadata, mark 
the buffer dirty).

That or something simillar should work with minimal impact on your 
existing fs code.  And it has the huge benefit or performing the reallocs 
in the back ground.  Otherwise your original idea would be disastrous to 
performance.  Imagine a 8G file that you are appending data to.  Every 
time you append a new block you may end up having to reallocate the file 
from inside your get_block (you don't know that more writes are coming in 
a second) and each time it will take a few minutes so each little write 
will hang the system for a few minutes - hardly what you want...

And the daemon at least batches things in 5 second intervals so multiple 
"need realloc" settings on an inode will be done in one go every 5 
seconds.

You know, if it was that easy to keep fragmentation close or even equal to 
zero at all times without impact on performance, all file systems would be 
already doing that.  (-;

Hope this gives you a starting point if nothing else.

Best regards,

	Anton
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: fragmentation && blocks "realloc"
  2006-01-24 10:37                   ` Anton Altaparmakov
@ 2006-02-23 21:47                     ` Nate Diller
  0 siblings, 0 replies; 13+ messages in thread
From: Nate Diller @ 2006-02-23 21:47 UTC (permalink / raw)
  To: Anton Altaparmakov; +Cc: Jan Koss, kernelnewbies, linux-fsdevel

On 1/24/06, Anton Altaparmakov <aia21@cam.ac.uk> wrote:
> On Mon, 23 Jan 2006, Jan Koss wrote:
> > On 1/23/06, Anton Altaparmakov <aia21@cam.ac.uk> wrote:
> > > That depends entirely in which function you are / which call path you are
> > > in at present.  Taking minix as an example, tell me the call path where
> > > you end up wanting to do the above and I will tell you where to get the bh
> > > from...  (-:
> >
> > I told about 2.6.15.
> >
> > in fs/minix/bitmap.c there is minix_new_block we come in it from get_block in
> > fs/minix/itree_common.c.
> >
> > After analizing blocks<->file I want move some blocks to another location
> > and update page cache correspondingly, what should I do?
>
> <Argh I just spent ages writing an email and it got lost when the internet
> connection died...  I only have what was visible on the terminal screen,
> so starting again on the rest...>
>
> You cannot do what you want from such a low level because the upper layers
> hold locks that you need.  For example a readpage/writepage/prepare_write
> can be running concurrently with get_block() and even other instances of
> get_block() can be running at the same time and it would then be unsafe to
> do any sort of reallocation.  So you have to scrap that idea.
>
> You could do it in higher up levels, i.e. in file ->write itself but again
> this introduces a lot of complexity to your file system.
>
> Basically what you are trying to so is much harder than you think and
> involves a lot of work...
>
> There is a possible alternative however.  Your get_block function could
> take a reference on the inode (i_count), set a flag in the file system
> specific inode "need realloc" and add the inode to a queue of a
> "realloc-demon" for your fs which is just a kernel thread which will run
> periodically, say every five seconds, and it will take inodes one after
> the other from its queue, then take all necessary locks so you can do this
> (e.g i_mutex on the inode as well as i_alloc_sem/i_alloc_mutex - whatever
> it is called now) - note you will probably need an extra lock to prevent
> entry into readpage/writepage whilst this is happening and your
> readpage/writepage will need to take that lock for reading whilst your
> daemon takes it for writing so multiple read/writepage can run
> simultaneously but your deamon runs exclusive.
>
> Then, if the inode is marked "need realloc" it will allocate a contiguous
> chunk of space equal to the file size, clear the "need-realloc" bit, do
> the reallocation by starting at the first page (index 0) and working
> upwards, getting it (warning: deadlock possible with a read or writepage
> holding that page's lock and blocked on your "realloc lock" so maybe
> trylock and if fails abort and requeue the inode to the daemon at the end
> of the queue), then when you have a page, loop around its buffers and for
> each buffer move it from the old allocation to the new one as I described
> earlier (i.e. just change b_blocknr, invalidate underlying metadata, mark
> the buffer dirty).
>
> That or something simillar should work with minimal impact on your
> existing fs code.  And it has the huge benefit or performing the reallocs
> in the back ground.  Otherwise your original idea would be disastrous to
> performance.  Imagine a 8G file that you are appending data to.  Every
> time you append a new block you may end up having to reallocate the file
> from inside your get_block (you don't know that more writes are coming in
> a second) and each time it will take a few minutes so each little write
> will hang the system for a few minutes - hardly what you want...
>
> And the daemon at least batches things in 5 second intervals so multiple
> "need realloc" settings on an inode will be done in one go every 5
> seconds.
>
> You know, if it was that easy to keep fragmentation close or even equal to
> zero at all times without impact on performance, all file systems would be
> already doing that.  (-;

well, the above is a reasonable solution, but if you were willing to
put up with more allocation and flush complexity, you could try a
strict allocate-on-flush design.  Just read in a page, and promptly
unmap it, then you don't have to worry about CPU overhead until flush
time, when you map all the pages and write them out.  That would
result in the lowest amount of fragmentation you can get without a
repacker of some sort.  It's not even all that hard unless you try
supporting file holes, transactions, non-4k blocks, or other
complexities.  There are also potential OOM issues if you are using
something as old-fashioned as bitmaps in your allocation code, and
need to read them in under memory pressure...

NATE

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: fragmentation && blocks "realloc"
  2006-01-20 15:46   ` Jan Koss
  2006-01-20 19:22     ` Jan Koss
@ 2006-01-20 20:04     ` Anton Altaparmakov
  1 sibling, 0 replies; 13+ messages in thread
From: Anton Altaparmakov @ 2006-01-20 20:04 UTC (permalink / raw)
  To: Jan Koss; +Cc: kernelnewbies, linux-fsdevel

On Fri, 20 Jan 2006, Jan Koss wrote:
> In fact, I expected yes for the first abbility and no for the second :)
> 
> Now code looks like:
> bh = sb_bread(sb, oldblock);
> if (!bh)
>   goto err;
> bh->b_blocknr = newblk;				
> mark_buffer_dirty (bh);
> unmap_underlying_metadata(bh->b_bdev, bh->b_blocknr);

No, no, no!!!  You cannot do this.  You are not using the page cache for 
your file system (why not?) so you cannot remap buffers like I suggested.

Even if you could your code is wrong.  You need the 
unmap_underlying_metadata() _before_ mark_buffer_dirty().

> Let's suppose such test case:
> after situation, which I described in the first email,
> user resize file and new size 5 blocks,
> and there are no free blocks except 2 blocks which we deallocated in
> the frist email,
> so we have to allocate them.
> 
> When I reproduced this test case, I got such messages from kernel:
> __find_get_block_slow failed block=oldblock...
> 
> So as I can see I missed something in "art of changing b_blocknr".
> 
> Error in __find_get_block_slow may happen only if all buffers on page mapped.
> 
> May be this is because of buffer_head change b_blocknr, but didn't
> change b_this_page?

You cannot touch b_this_page on buffers you access via sb_bread().  The 
correct solution for a file system like yours would be to copy the buffer 
data to the correct buffer and write that and release the old one.  I.e. 
your first suggestion, i.e. do not touch b_blocknr or b_this_page.  And 
you do not need to call unmap_underlying_metadata() either or invalidate 
any pages.

You are working with the block device directly.  Bypassing the per file 
page cache thus you cannot do anything to the buffers at all other than 
read/write them.

It would be far better if you start using the page cache (via ->readpage, 
->writepage, and probably ->prepare_write and ->commit_write as well) and 
then from inside readpage/writepage/prepare_write/commit_write you can do 
with the buffers as I suggested...

Best regards,

	Anton
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2006-02-23 21:47 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-01-20 11:47 fragmentation && blocks "realloc" Jan Koss
2006-01-20 13:34 ` Anton Altaparmakov
2006-01-20 15:46   ` Jan Koss
2006-01-20 19:22     ` Jan Koss
2006-01-20 20:11       ` Anton Altaparmakov
2006-01-21  9:42         ` Jan Koss
2006-01-21 20:28           ` Anton Altaparmakov
2006-01-22 20:58             ` Jan Koss
2006-01-22 21:32               ` Anton Altaparmakov
2006-01-22 22:05                 ` Jan Koss
2006-01-24 10:37                   ` Anton Altaparmakov
2006-02-23 21:47                     ` Nate Diller
2006-01-20 20:04     ` Anton Altaparmakov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).