* fragmentation && blocks "realloc"
@ 2006-01-20 11:47 Jan Koss
2006-01-20 13:34 ` Anton Altaparmakov
0 siblings, 1 reply; 13+ messages in thread
From: Jan Koss @ 2006-01-20 11:47 UTC (permalink / raw)
To: kernelnewbies; +Cc: linux-fsdevel
Hello.
Let's suppose that we have file which consist of two blocks
and user resizing file and now we need 4 blocks.
Near this two blocks there are no 2 free blocks,
and instead of allocating 2 additional blocks somewhere,
I want allocate chunk of 4 blocks.
The main problem is choose way of invalidate "old" blocks and copy
data to new buffers,
how it possible on linux?
something like
struct buffer_head *oldbh, *newbh;
memcpy(newbh->b_data, oldbh->b_data);
block_invalidatepage(oldbh->b_this_page,...)
[ block_invalidatepage right choise ? ]
or it is possible just change b_blocknr?
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: fragmentation && blocks "realloc"
2006-01-20 11:47 fragmentation && blocks "realloc" Jan Koss
@ 2006-01-20 13:34 ` Anton Altaparmakov
2006-01-20 15:46 ` Jan Koss
0 siblings, 1 reply; 13+ messages in thread
From: Anton Altaparmakov @ 2006-01-20 13:34 UTC (permalink / raw)
To: Jan Koss; +Cc: kernelnewbies, linux-fsdevel
On Fri, 2006-01-20 at 14:47 +0300, Jan Koss wrote:
> Hello.
>
> Let's suppose that we have file which consist of two blocks
> and user resizing file and now we need 4 blocks.
>
> Near this two blocks there are no 2 free blocks,
> and instead of allocating 2 additional blocks somewhere,
> I want allocate chunk of 4 blocks.
>
> The main problem is choose way of invalidate "old" blocks and copy
> data to new buffers,
>
> how it possible on linux?
>
> something like
> struct buffer_head *oldbh, *newbh;
> memcpy(newbh->b_data, oldbh->b_data);
> block_invalidatepage(oldbh->b_this_page,...)
No need to invalidate or copy anything as long as you are working inside
a file system driver and those buffers are attached to page cache of a
file.
> or it is possible just change b_blocknr?
Yes, just change b_blocknr, and mark the buffer dirty so it gets written
out to the new location or indeed you can do the write (or submission
thereof) yourself if you want.
Note since you are effectively "allocating" the buffer(s), after you
have done the block allocation on your file system and updated
bh->b_blocknr, you need to call unmap_underlying_metadata(bh->b_dev,
bh->b_blocknr); for each block before you write it out or your write
could get trampled on by a different write.
And of course do not forget to deallocate the two blocks you just freed
in your fs... (-:
Best regards,
Anton
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: fragmentation && blocks "realloc"
2006-01-20 13:34 ` Anton Altaparmakov
@ 2006-01-20 15:46 ` Jan Koss
2006-01-20 19:22 ` Jan Koss
2006-01-20 20:04 ` Anton Altaparmakov
0 siblings, 2 replies; 13+ messages in thread
From: Jan Koss @ 2006-01-20 15:46 UTC (permalink / raw)
To: Anton Altaparmakov; +Cc: kernelnewbies, linux-fsdevel
In fact, I expected yes for the first abbility and no for the second :)
Now code looks like:
bh = sb_bread(sb, oldblock);
if (!bh)
goto err;
bh->b_blocknr = newblk;
mark_buffer_dirty (bh);
unmap_underlying_metadata(bh->b_bdev, bh->b_blocknr);
Let's suppose such test case:
after situation, which I described in the first email,
user resize file and new size 5 blocks,
and there are no free blocks except 2 blocks which we deallocated in
the frist email,
so we have to allocate them.
When I reproduced this test case, I got such messages from kernel:
__find_get_block_slow failed block=oldblock...
So as I can see I missed something in "art of changing b_blocknr".
Error in __find_get_block_slow may happen only if all buffers on page mapped.
May be this is because of buffer_head change b_blocknr, but didn't
change b_this_page?
On 1/20/06, Anton Altaparmakov <aia21@cam.ac.uk> wrote:
> On Fri, 2006-01-20 at 14:47 +0300, Jan Koss wrote:
> > Hello.
> >
> > Let's suppose that we have file which consist of two blocks
> > and user resizing file and now we need 4 blocks.
> >
> > Near this two blocks there are no 2 free blocks,
> > and instead of allocating 2 additional blocks somewhere,
> > I want allocate chunk of 4 blocks.
> >
> > The main problem is choose way of invalidate "old" blocks and copy
> > data to new buffers,
> >
> > how it possible on linux?
> >
> > something like
> > struct buffer_head *oldbh, *newbh;
> > memcpy(newbh->b_data, oldbh->b_data);
> > block_invalidatepage(oldbh->b_this_page,...)
>
> No need to invalidate or copy anything as long as you are working inside
> a file system driver and those buffers are attached to page cache of a
> file.
>
> > or it is possible just change b_blocknr?
>
> Yes, just change b_blocknr, and mark the buffer dirty so it gets written
> out to the new location or indeed you can do the write (or submission
> thereof) yourself if you want.
>
> Note since you are effectively "allocating" the buffer(s), after you
> have done the block allocation on your file system and updated
> bh->b_blocknr, you need to call unmap_underlying_metadata(bh->b_dev,
> bh->b_blocknr); for each block before you write it out or your write
> could get trampled on by a different write.
>
> And of course do not forget to deallocate the two blocks you just freed
> in your fs... (-:
>
> Best regards,
>
> Anton
> --
> Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
> Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
> Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
> WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/
>
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: fragmentation && blocks "realloc"
2006-01-20 15:46 ` Jan Koss
@ 2006-01-20 19:22 ` Jan Koss
2006-01-20 20:11 ` Anton Altaparmakov
2006-01-20 20:04 ` Anton Altaparmakov
1 sibling, 1 reply; 13+ messages in thread
From: Jan Koss @ 2006-01-20 19:22 UTC (permalink / raw)
To: Anton Altaparmakov; +Cc: kernelnewbies, linux-fsdevel
In comparison with this
> bh = sb_bread(sb, oldblock);
> if (!bh)
> goto err;
> bh->b_blocknr = newblk;
> mark_buffer_dirty (bh);
> unmap_underlying_metadata(bh->b_bdev, bh->b_blocknr);
>
this code like this didn't cause any "warrnings":
struct buffer_head *newbh;
bh = sb_bread(sb, oldblock);
newbh = sb_bread(sb, newblock);
if (!(bh || newbh))
goto err;
memcpy(newbh->b_data, bh->b_data, sb->s_blocksize);
mark_buffer_dirty(newbh);
brelse(bh);
brelse(newbh);
invalidate_inode_buffers(inode);
but it didn't optimal,
much better what you suggest, but...
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: fragmentation && blocks "realloc"
2006-01-20 15:46 ` Jan Koss
2006-01-20 19:22 ` Jan Koss
@ 2006-01-20 20:04 ` Anton Altaparmakov
1 sibling, 0 replies; 13+ messages in thread
From: Anton Altaparmakov @ 2006-01-20 20:04 UTC (permalink / raw)
To: Jan Koss; +Cc: kernelnewbies, linux-fsdevel
On Fri, 20 Jan 2006, Jan Koss wrote:
> In fact, I expected yes for the first abbility and no for the second :)
>
> Now code looks like:
> bh = sb_bread(sb, oldblock);
> if (!bh)
> goto err;
> bh->b_blocknr = newblk;
> mark_buffer_dirty (bh);
> unmap_underlying_metadata(bh->b_bdev, bh->b_blocknr);
No, no, no!!! You cannot do this. You are not using the page cache for
your file system (why not?) so you cannot remap buffers like I suggested.
Even if you could your code is wrong. You need the
unmap_underlying_metadata() _before_ mark_buffer_dirty().
> Let's suppose such test case:
> after situation, which I described in the first email,
> user resize file and new size 5 blocks,
> and there are no free blocks except 2 blocks which we deallocated in
> the frist email,
> so we have to allocate them.
>
> When I reproduced this test case, I got such messages from kernel:
> __find_get_block_slow failed block=oldblock...
>
> So as I can see I missed something in "art of changing b_blocknr".
>
> Error in __find_get_block_slow may happen only if all buffers on page mapped.
>
> May be this is because of buffer_head change b_blocknr, but didn't
> change b_this_page?
You cannot touch b_this_page on buffers you access via sb_bread(). The
correct solution for a file system like yours would be to copy the buffer
data to the correct buffer and write that and release the old one. I.e.
your first suggestion, i.e. do not touch b_blocknr or b_this_page. And
you do not need to call unmap_underlying_metadata() either or invalidate
any pages.
You are working with the block device directly. Bypassing the per file
page cache thus you cannot do anything to the buffers at all other than
read/write them.
It would be far better if you start using the page cache (via ->readpage,
->writepage, and probably ->prepare_write and ->commit_write as well) and
then from inside readpage/writepage/prepare_write/commit_write you can do
with the buffers as I suggested...
Best regards,
Anton
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: fragmentation && blocks "realloc"
2006-01-20 19:22 ` Jan Koss
@ 2006-01-20 20:11 ` Anton Altaparmakov
2006-01-21 9:42 ` Jan Koss
0 siblings, 1 reply; 13+ messages in thread
From: Anton Altaparmakov @ 2006-01-20 20:11 UTC (permalink / raw)
To: Jan Koss; +Cc: kernelnewbies, linux-fsdevel
On Fri, 20 Jan 2006, Jan Koss wrote:
> In comparison with this
> > bh = sb_bread(sb, oldblock);
> > if (!bh)
> > goto err;
> > bh->b_blocknr = newblk;
> > mark_buffer_dirty (bh);
> > unmap_underlying_metadata(bh->b_bdev, bh->b_blocknr);
>
> this code like this didn't cause any "warrnings":
> struct buffer_head *newbh;
>
> bh = sb_bread(sb, oldblock);
> newbh = sb_bread(sb, newblock);
> if (!(bh || newbh))
> goto err;
>
> memcpy(newbh->b_data, bh->b_data, sb->s_blocksize);
> mark_buffer_dirty(newbh);
> brelse(bh);
> brelse(newbh);
> invalidate_inode_buffers(inode);
Yes, that is almost correct. Although it is wrong. (-;
You do not want the invalidate_inode_buffers() call. It makes no sense
for your fs at all given how you are dealing with the buffers with
sb_bread()/brelse()... Your method never attaches buffers to the inode so
there is no point in trying to invalidate anything. It will all just work
fine. (Unless you have omitted to say things about your fs that are
important. Why don't you show all your code rather than just those
snippets and then proper advice can be given...)
Best regards,
Anton
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: fragmentation && blocks "realloc"
2006-01-20 20:11 ` Anton Altaparmakov
@ 2006-01-21 9:42 ` Jan Koss
2006-01-21 20:28 ` Anton Altaparmakov
0 siblings, 1 reply; 13+ messages in thread
From: Jan Koss @ 2006-01-21 9:42 UTC (permalink / raw)
To: Anton Altaparmakov; +Cc: kernelnewbies, linux-fsdevel
>It will all just work
>fine. (Unless you have omitted to say things about your fs that are
>important. Why don't you show all your code rather than just those
>snippets and then proper advice can be given...)
fs is just a simple analog of ufs/ext2/minix/sysv. It is block
oriented, and I suppose that working with pages, instead of blocks
make it more complicated, then it should to be.
>You do not want the invalidate_inode_buffers() call. It makes no sense
Great, we reached the point.
Yes, my file system based on usage sb_bread/brelse.
As ordinary file system my file system implements readpage and writepage,
it is similar to
static int sysv_writepage(struct page *page, struct writeback_control *wbc)
{
return block_write_full_page(page,get_block,wbc);
}
static int sysv_readpage(struct file *file, struct page *page)
{
return block_read_full_page(page,get_block);
}
get_block make such thing
map_bh(...)
So, when we "realloc" blocks, what happen with these "old" mapped (or
used in some other way) blocks?
How can I prevent usage "old" blocks instead of "new" blocks?
Or if I mark old and new blocks as dirty all will be right?
--
Kernelnewbies: Help each other learn about the Linux kernel.
Archive: http://mail.nl.linux.org/kernelnewbies/
FAQ: http://kernelnewbies.org/faq/
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: fragmentation && blocks "realloc"
2006-01-21 9:42 ` Jan Koss
@ 2006-01-21 20:28 ` Anton Altaparmakov
2006-01-22 20:58 ` Jan Koss
0 siblings, 1 reply; 13+ messages in thread
From: Anton Altaparmakov @ 2006-01-21 20:28 UTC (permalink / raw)
To: Jan Koss; +Cc: kernelnewbies, linux-fsdevel
On Sat, 21 Jan 2006, Jan Koss wrote:
> >It will all just work
> >fine. (Unless you have omitted to say things about your fs that are
> >important. Why don't you show all your code rather than just those
> >snippets and then proper advice can be given...)
>
> fs is just a simple analog of ufs/ext2/minix/sysv. It is block
All of the above a page cache users, not block device oriented at all.
> oriented, and I suppose that working with pages, instead of blocks
> make it more complicated, then it should to be.
It also makes it very slow not to use the page cache...
> >You do not want the invalidate_inode_buffers() call. It makes no sense
>
> Great, we reached the point.
> Yes, my file system based on usage sb_bread/brelse.
Right, so nothing like the other file systems you compare yourself to
then. None of them are sb_bread/brelse based.
> As ordinary file system my file system implements readpage and writepage,
> it is similar to
> static int sysv_writepage(struct page *page, struct writeback_control *wbc)
> {
> return block_write_full_page(page,get_block,wbc);
> }
> static int sysv_readpage(struct file *file, struct page *page)
> {
> return block_read_full_page(page,get_block);
> }
>
> get_block make such thing
> map_bh(...)
Err, so you are page cache based and not sb_bread/brelse based at all.
I think you are confused... (-;
> So, when we "realloc" blocks, what happen with these "old" mapped (or
> used in some other way) blocks?
>
> How can I prevent usage "old" blocks instead of "new" blocks?
> Or if I mark old and new blocks as dirty all will be right?
You cannot do the reallocation using your method if the above page cache
functions are used like that by your fs. You need to do it how I showed
it to you first, i.e. without sb_bread/brelse as those make no sense
whatsoever for you. (They access the block device directly, completely
bypassing the page cache so you are breaking cache coherency and are 100%
broken by design.)
You seem to be extremely conused I am afraid. They only way to help you
is to see your whole file system code unless you start becoming clearer
about what you are really doing so that you don't keep making
contradictory statements in two successive sentences...
Best regards,
Anton
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: fragmentation && blocks "realloc"
2006-01-21 20:28 ` Anton Altaparmakov
@ 2006-01-22 20:58 ` Jan Koss
2006-01-22 21:32 ` Anton Altaparmakov
0 siblings, 1 reply; 13+ messages in thread
From: Jan Koss @ 2006-01-22 20:58 UTC (permalink / raw)
To: Anton Altaparmakov; +Cc: kernelnewbies, linux-fsdevel
Hello.
>(They access the block device directly, completely
>bypassing the page cache so you are breaking cache coherency and are 100%
>broken by design.)
Oh... I thought that start from 2.4.x there are no separate implementation
of working with blocks and pages, when you read block, kernel read whole page,
am I wrong?
> They only way to help you
> is to see your whole file system code
If we need some handhold for discussion, lets talk about minix v.1
(my file system derive from this code).
Lets suppose I want make algorigth of allocation blocks in
fs/minix/bitmap.c: minix_new_block more inteligent.
I should say that minix code use sb_bread/brelse and work with pages (for
example fs/minix/dir.c).
So instead of allocation one additional block,
I want "realloc" blocks, so all file will occupy several consecutive blocks.
And we stop on such code
bh->b_blocknr = newblk;
unmap_underlying_metadata(bh->b_bdev, bh->b_blocknr);
mark_buffer_dirty (bh);
And question how should I get this _bh_, if I can not use sb_bread?
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: fragmentation && blocks "realloc"
2006-01-22 20:58 ` Jan Koss
@ 2006-01-22 21:32 ` Anton Altaparmakov
2006-01-22 22:05 ` Jan Koss
0 siblings, 1 reply; 13+ messages in thread
From: Anton Altaparmakov @ 2006-01-22 21:32 UTC (permalink / raw)
To: Jan Koss; +Cc: kernelnewbies, linux-fsdevel
Hi,
On Sun, 22 Jan 2006, Jan Koss wrote:
> >(They access the block device directly, completely
> >bypassing the page cache so you are breaking cache coherency and are 100%
> >broken by design.)
>
> Oh... I thought that start from 2.4.x there are no separate implementation
> of working with blocks and pages, when you read block, kernel read whole page,
> am I wrong?
There is a very big difference. If you do sb_bread() you are reading a
block from the block device. And yes this block is attached to a page but
it is a page belonging to the block device address space mapping. You
cannot do anything to this block other than read/write it.
If you use the page cache to access the contents of a file, then that file
(or more precisely the inode of that file) will have an address space
mapping of its own, completely independent of the address space mapping of
the block device inode. Those pages will (or will not) have buffers
attached to them (your getblock() callback is there exactly to allow the
buffers to be created and mapped if they are not there). Those buffers
will be part of the file page cache page, thus part of the inode's address
space mapping, and those buffers have no meaning other than to say "the
data in this part of the page belongs to blockdevice so and so and to
blocknumber on that block device so and so". So you can change the
b_blocknr on those buffers to your hearts content (well you need to
observe necessarily locking so buffers under i/o don't get screwed) and
that is no problem.
Note that the buffers from the block device address space mapping are
COMPLETELY separate from the buffers from a file inode address space
mapping. So writes from one are NOT seen in the other and you NEVER can
mix the two forms of i/o and expect to have a working file system. You
will get random results and tons of weird data corruption that way.
> > They only way to help you
> > is to see your whole file system code
>
> If we need some handhold for discussion, lets talk about minix v.1
> (my file system derive from this code).
> Lets suppose I want make algorigth of allocation blocks in
> fs/minix/bitmap.c: minix_new_block more inteligent.
>
> I should say that minix code use sb_bread/brelse and work with pages (for
> example fs/minix/dir.c).
Er, not on current kernels:
$ grep bread linux-2.6/fs/minix/*
bitmap.c: *bh = sb_bread(sb, block);
bitmap.c: *bh = sb_bread(sb, block);
inode.c: if (!(bh = sb_bread(s, 1)))
inode.c: if (!(sbi->s_imap[i]=sb_bread(s, block)))
inode.c: if (!(sbi->s_zmap[i]=sb_bread(s, block)))
itree_common.c: bh = sb_bread(sb, block_to_cpu(p->key));
itree_common.c: bh = sb_bread(inode->i_sb, nr);
Are you working on 2.4 by any chance? If you are writing a new fs I would
strongly recommend you to work on 2.6 kernels otherwise you are writing
something that is already out of date...
The only thing minix in current 2.6 kernel uses bread for is to read the
on-disk inodes themselves. It never uses it to access file data at all
and I very much doubt that even old 2.4 kernels ever use bread for
anything that is not strictly metadata rather than file data.
> So instead of allocation one additional block,
> I want "realloc" blocks, so all file will occupy several consecutive blocks.
>
> And we stop on such code
> bh->b_blocknr = newblk;
> unmap_underlying_metadata(bh->b_bdev, bh->b_blocknr);
> mark_buffer_dirty (bh);
>
> And question how should I get this _bh_, if I can not use sb_bread?
That depends entirely in which function you are / which call path you are
in at present. Taking minix as an example, tell me the call path where
you end up wanting to do the above and I will tell you where to get the bh
from... (-:
Btw. don't think this is all that easy. If you want to keep whole files
rather than whole pages of buffers in consecutive blocks you are in for
some very serious fun with multi-page locking and/or complete i/o
serialisation, i.e. when a write is happening all other writes on the same
file will just block...
Best regards,
Anton
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: fragmentation && blocks "realloc"
2006-01-22 21:32 ` Anton Altaparmakov
@ 2006-01-22 22:05 ` Jan Koss
2006-01-24 10:37 ` Anton Altaparmakov
0 siblings, 1 reply; 13+ messages in thread
From: Jan Koss @ 2006-01-22 22:05 UTC (permalink / raw)
To: Anton Altaparmakov; +Cc: kernelnewbies, linux-fsdevel
On 1/23/06, Anton Altaparmakov <aia21@cam.ac.uk> wrote:
...
> Note that the buffers from the block device address space mapping are
> COMPLETELY separate from the buffers from a file inode address space
> mapping. So writes from one are NOT seen in the other and you NEVER can
> mix the two forms of i/o and expect to have a working file system. You
> will get random results and tons of weird data corruption that way.
>
Thanks a lot, this is clear for me several important things.
> That depends entirely in which function you are / which call path you are
> in at present. Taking minix as an example, tell me the call path where
> you end up wanting to do the above and I will tell you where to get the bh
> from... (-:
>
I told about 2.6.15.
in fs/minix/bitmap.c there is minix_new_block we come in it from get_block in
fs/minix/itree_common.c.
After analizing blocks<->file I want move some blocks to another location
and update page cache correspondingly, what should I do?
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: fragmentation && blocks "realloc"
2006-01-22 22:05 ` Jan Koss
@ 2006-01-24 10:37 ` Anton Altaparmakov
2006-02-23 21:47 ` Nate Diller
0 siblings, 1 reply; 13+ messages in thread
From: Anton Altaparmakov @ 2006-01-24 10:37 UTC (permalink / raw)
To: Jan Koss; +Cc: kernelnewbies, linux-fsdevel
On Mon, 23 Jan 2006, Jan Koss wrote:
> On 1/23/06, Anton Altaparmakov <aia21@cam.ac.uk> wrote:
> > That depends entirely in which function you are / which call path you are
> > in at present. Taking minix as an example, tell me the call path where
> > you end up wanting to do the above and I will tell you where to get the bh
> > from... (-:
>
> I told about 2.6.15.
>
> in fs/minix/bitmap.c there is minix_new_block we come in it from get_block in
> fs/minix/itree_common.c.
>
> After analizing blocks<->file I want move some blocks to another location
> and update page cache correspondingly, what should I do?
<Argh I just spent ages writing an email and it got lost when the internet
connection died... I only have what was visible on the terminal screen,
so starting again on the rest...>
You cannot do what you want from such a low level because the upper layers
hold locks that you need. For example a readpage/writepage/prepare_write
can be running concurrently with get_block() and even other instances of
get_block() can be running at the same time and it would then be unsafe to
do any sort of reallocation. So you have to scrap that idea.
You could do it in higher up levels, i.e. in file ->write itself but again
this introduces a lot of complexity to your file system.
Basically what you are trying to so is much harder than you think and
involves a lot of work...
There is a possible alternative however. Your get_block function could
take a reference on the inode (i_count), set a flag in the file system
specific inode "need realloc" and add the inode to a queue of a
"realloc-demon" for your fs which is just a kernel thread which will run
periodically, say every five seconds, and it will take inodes one after
the other from its queue, then take all necessary locks so you can do this
(e.g i_mutex on the inode as well as i_alloc_sem/i_alloc_mutex - whatever
it is called now) - note you will probably need an extra lock to prevent
entry into readpage/writepage whilst this is happening and your
readpage/writepage will need to take that lock for reading whilst your
daemon takes it for writing so multiple read/writepage can run
simultaneously but your deamon runs exclusive.
Then, if the inode is marked "need realloc" it will allocate a contiguous
chunk of space equal to the file size, clear the "need-realloc" bit, do
the reallocation by starting at the first page (index 0) and working
upwards, getting it (warning: deadlock possible with a read or writepage
holding that page's lock and blocked on your "realloc lock" so maybe
trylock and if fails abort and requeue the inode to the daemon at the end
of the queue), then when you have a page, loop around its buffers and for
each buffer move it from the old allocation to the new one as I described
earlier (i.e. just change b_blocknr, invalidate underlying metadata, mark
the buffer dirty).
That or something simillar should work with minimal impact on your
existing fs code. And it has the huge benefit or performing the reallocs
in the back ground. Otherwise your original idea would be disastrous to
performance. Imagine a 8G file that you are appending data to. Every
time you append a new block you may end up having to reallocate the file
from inside your get_block (you don't know that more writes are coming in
a second) and each time it will take a few minutes so each little write
will hang the system for a few minutes - hardly what you want...
And the daemon at least batches things in 5 second intervals so multiple
"need realloc" settings on an inode will be done in one go every 5
seconds.
You know, if it was that easy to keep fragmentation close or even equal to
zero at all times without impact on performance, all file systems would be
already doing that. (-;
Hope this gives you a starting point if nothing else.
Best regards,
Anton
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: fragmentation && blocks "realloc"
2006-01-24 10:37 ` Anton Altaparmakov
@ 2006-02-23 21:47 ` Nate Diller
0 siblings, 0 replies; 13+ messages in thread
From: Nate Diller @ 2006-02-23 21:47 UTC (permalink / raw)
To: Anton Altaparmakov; +Cc: Jan Koss, kernelnewbies, linux-fsdevel
On 1/24/06, Anton Altaparmakov <aia21@cam.ac.uk> wrote:
> On Mon, 23 Jan 2006, Jan Koss wrote:
> > On 1/23/06, Anton Altaparmakov <aia21@cam.ac.uk> wrote:
> > > That depends entirely in which function you are / which call path you are
> > > in at present. Taking minix as an example, tell me the call path where
> > > you end up wanting to do the above and I will tell you where to get the bh
> > > from... (-:
> >
> > I told about 2.6.15.
> >
> > in fs/minix/bitmap.c there is minix_new_block we come in it from get_block in
> > fs/minix/itree_common.c.
> >
> > After analizing blocks<->file I want move some blocks to another location
> > and update page cache correspondingly, what should I do?
>
> <Argh I just spent ages writing an email and it got lost when the internet
> connection died... I only have what was visible on the terminal screen,
> so starting again on the rest...>
>
> You cannot do what you want from such a low level because the upper layers
> hold locks that you need. For example a readpage/writepage/prepare_write
> can be running concurrently with get_block() and even other instances of
> get_block() can be running at the same time and it would then be unsafe to
> do any sort of reallocation. So you have to scrap that idea.
>
> You could do it in higher up levels, i.e. in file ->write itself but again
> this introduces a lot of complexity to your file system.
>
> Basically what you are trying to so is much harder than you think and
> involves a lot of work...
>
> There is a possible alternative however. Your get_block function could
> take a reference on the inode (i_count), set a flag in the file system
> specific inode "need realloc" and add the inode to a queue of a
> "realloc-demon" for your fs which is just a kernel thread which will run
> periodically, say every five seconds, and it will take inodes one after
> the other from its queue, then take all necessary locks so you can do this
> (e.g i_mutex on the inode as well as i_alloc_sem/i_alloc_mutex - whatever
> it is called now) - note you will probably need an extra lock to prevent
> entry into readpage/writepage whilst this is happening and your
> readpage/writepage will need to take that lock for reading whilst your
> daemon takes it for writing so multiple read/writepage can run
> simultaneously but your deamon runs exclusive.
>
> Then, if the inode is marked "need realloc" it will allocate a contiguous
> chunk of space equal to the file size, clear the "need-realloc" bit, do
> the reallocation by starting at the first page (index 0) and working
> upwards, getting it (warning: deadlock possible with a read or writepage
> holding that page's lock and blocked on your "realloc lock" so maybe
> trylock and if fails abort and requeue the inode to the daemon at the end
> of the queue), then when you have a page, loop around its buffers and for
> each buffer move it from the old allocation to the new one as I described
> earlier (i.e. just change b_blocknr, invalidate underlying metadata, mark
> the buffer dirty).
>
> That or something simillar should work with minimal impact on your
> existing fs code. And it has the huge benefit or performing the reallocs
> in the back ground. Otherwise your original idea would be disastrous to
> performance. Imagine a 8G file that you are appending data to. Every
> time you append a new block you may end up having to reallocate the file
> from inside your get_block (you don't know that more writes are coming in
> a second) and each time it will take a few minutes so each little write
> will hang the system for a few minutes - hardly what you want...
>
> And the daemon at least batches things in 5 second intervals so multiple
> "need realloc" settings on an inode will be done in one go every 5
> seconds.
>
> You know, if it was that easy to keep fragmentation close or even equal to
> zero at all times without impact on performance, all file systems would be
> already doing that. (-;
well, the above is a reasonable solution, but if you were willing to
put up with more allocation and flush complexity, you could try a
strict allocate-on-flush design. Just read in a page, and promptly
unmap it, then you don't have to worry about CPU overhead until flush
time, when you map all the pages and write them out. That would
result in the lowest amount of fragmentation you can get without a
repacker of some sort. It's not even all that hard unless you try
supporting file holes, transactions, non-4k blocks, or other
complexities. There are also potential OOM issues if you are using
something as old-fashioned as bitmaps in your allocation code, and
need to read them in under memory pressure...
NATE
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2006-02-23 21:47 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-01-20 11:47 fragmentation && blocks "realloc" Jan Koss
2006-01-20 13:34 ` Anton Altaparmakov
2006-01-20 15:46 ` Jan Koss
2006-01-20 19:22 ` Jan Koss
2006-01-20 20:11 ` Anton Altaparmakov
2006-01-21 9:42 ` Jan Koss
2006-01-21 20:28 ` Anton Altaparmakov
2006-01-22 20:58 ` Jan Koss
2006-01-22 21:32 ` Anton Altaparmakov
2006-01-22 22:05 ` Jan Koss
2006-01-24 10:37 ` Anton Altaparmakov
2006-02-23 21:47 ` Nate Diller
2006-01-20 20:04 ` Anton Altaparmakov
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).