* fragmentation && blocks "realloc" @ 2006-01-20 11:47 Jan Koss 2006-01-20 13:34 ` Anton Altaparmakov 0 siblings, 1 reply; 13+ messages in thread From: Jan Koss @ 2006-01-20 11:47 UTC (permalink / raw) To: kernelnewbies; +Cc: linux-fsdevel Hello. Let's suppose that we have file which consist of two blocks and user resizing file and now we need 4 blocks. Near this two blocks there are no 2 free blocks, and instead of allocating 2 additional blocks somewhere, I want allocate chunk of 4 blocks. The main problem is choose way of invalidate "old" blocks and copy data to new buffers, how it possible on linux? something like struct buffer_head *oldbh, *newbh; memcpy(newbh->b_data, oldbh->b_data); block_invalidatepage(oldbh->b_this_page,...) [ block_invalidatepage right choise ? ] or it is possible just change b_blocknr? ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: fragmentation && blocks "realloc" 2006-01-20 11:47 fragmentation && blocks "realloc" Jan Koss @ 2006-01-20 13:34 ` Anton Altaparmakov 2006-01-20 15:46 ` Jan Koss 0 siblings, 1 reply; 13+ messages in thread From: Anton Altaparmakov @ 2006-01-20 13:34 UTC (permalink / raw) To: Jan Koss; +Cc: kernelnewbies, linux-fsdevel On Fri, 2006-01-20 at 14:47 +0300, Jan Koss wrote: > Hello. > > Let's suppose that we have file which consist of two blocks > and user resizing file and now we need 4 blocks. > > Near this two blocks there are no 2 free blocks, > and instead of allocating 2 additional blocks somewhere, > I want allocate chunk of 4 blocks. > > The main problem is choose way of invalidate "old" blocks and copy > data to new buffers, > > how it possible on linux? > > something like > struct buffer_head *oldbh, *newbh; > memcpy(newbh->b_data, oldbh->b_data); > block_invalidatepage(oldbh->b_this_page,...) No need to invalidate or copy anything as long as you are working inside a file system driver and those buffers are attached to page cache of a file. > or it is possible just change b_blocknr? Yes, just change b_blocknr, and mark the buffer dirty so it gets written out to the new location or indeed you can do the write (or submission thereof) yourself if you want. Note since you are effectively "allocating" the buffer(s), after you have done the block allocation on your file system and updated bh->b_blocknr, you need to call unmap_underlying_metadata(bh->b_dev, bh->b_blocknr); for each block before you write it out or your write could get trampled on by a different write. And of course do not forget to deallocate the two blocks you just freed in your fs... (-: Best regards, Anton -- Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/ ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: fragmentation && blocks "realloc" 2006-01-20 13:34 ` Anton Altaparmakov @ 2006-01-20 15:46 ` Jan Koss 2006-01-20 19:22 ` Jan Koss 2006-01-20 20:04 ` Anton Altaparmakov 0 siblings, 2 replies; 13+ messages in thread From: Jan Koss @ 2006-01-20 15:46 UTC (permalink / raw) To: Anton Altaparmakov; +Cc: kernelnewbies, linux-fsdevel In fact, I expected yes for the first abbility and no for the second :) Now code looks like: bh = sb_bread(sb, oldblock); if (!bh) goto err; bh->b_blocknr = newblk; mark_buffer_dirty (bh); unmap_underlying_metadata(bh->b_bdev, bh->b_blocknr); Let's suppose such test case: after situation, which I described in the first email, user resize file and new size 5 blocks, and there are no free blocks except 2 blocks which we deallocated in the frist email, so we have to allocate them. When I reproduced this test case, I got such messages from kernel: __find_get_block_slow failed block=oldblock... So as I can see I missed something in "art of changing b_blocknr". Error in __find_get_block_slow may happen only if all buffers on page mapped. May be this is because of buffer_head change b_blocknr, but didn't change b_this_page? On 1/20/06, Anton Altaparmakov <aia21@cam.ac.uk> wrote: > On Fri, 2006-01-20 at 14:47 +0300, Jan Koss wrote: > > Hello. > > > > Let's suppose that we have file which consist of two blocks > > and user resizing file and now we need 4 blocks. > > > > Near this two blocks there are no 2 free blocks, > > and instead of allocating 2 additional blocks somewhere, > > I want allocate chunk of 4 blocks. > > > > The main problem is choose way of invalidate "old" blocks and copy > > data to new buffers, > > > > how it possible on linux? > > > > something like > > struct buffer_head *oldbh, *newbh; > > memcpy(newbh->b_data, oldbh->b_data); > > block_invalidatepage(oldbh->b_this_page,...) > > No need to invalidate or copy anything as long as you are working inside > a file system driver and those buffers are attached to page cache of a > file. > > > or it is possible just change b_blocknr? > > Yes, just change b_blocknr, and mark the buffer dirty so it gets written > out to the new location or indeed you can do the write (or submission > thereof) yourself if you want. > > Note since you are effectively "allocating" the buffer(s), after you > have done the block allocation on your file system and updated > bh->b_blocknr, you need to call unmap_underlying_metadata(bh->b_dev, > bh->b_blocknr); for each block before you write it out or your write > could get trampled on by a different write. > > And of course do not forget to deallocate the two blocks you just freed > in your fs... (-: > > Best regards, > > Anton > -- > Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @) > Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK > Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net > WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/ > > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: fragmentation && blocks "realloc" 2006-01-20 15:46 ` Jan Koss @ 2006-01-20 19:22 ` Jan Koss 2006-01-20 20:11 ` Anton Altaparmakov 2006-01-20 20:04 ` Anton Altaparmakov 1 sibling, 1 reply; 13+ messages in thread From: Jan Koss @ 2006-01-20 19:22 UTC (permalink / raw) To: Anton Altaparmakov; +Cc: kernelnewbies, linux-fsdevel In comparison with this > bh = sb_bread(sb, oldblock); > if (!bh) > goto err; > bh->b_blocknr = newblk; > mark_buffer_dirty (bh); > unmap_underlying_metadata(bh->b_bdev, bh->b_blocknr); > this code like this didn't cause any "warrnings": struct buffer_head *newbh; bh = sb_bread(sb, oldblock); newbh = sb_bread(sb, newblock); if (!(bh || newbh)) goto err; memcpy(newbh->b_data, bh->b_data, sb->s_blocksize); mark_buffer_dirty(newbh); brelse(bh); brelse(newbh); invalidate_inode_buffers(inode); but it didn't optimal, much better what you suggest, but... ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: fragmentation && blocks "realloc" 2006-01-20 19:22 ` Jan Koss @ 2006-01-20 20:11 ` Anton Altaparmakov 2006-01-21 9:42 ` Jan Koss 0 siblings, 1 reply; 13+ messages in thread From: Anton Altaparmakov @ 2006-01-20 20:11 UTC (permalink / raw) To: Jan Koss; +Cc: kernelnewbies, linux-fsdevel On Fri, 20 Jan 2006, Jan Koss wrote: > In comparison with this > > bh = sb_bread(sb, oldblock); > > if (!bh) > > goto err; > > bh->b_blocknr = newblk; > > mark_buffer_dirty (bh); > > unmap_underlying_metadata(bh->b_bdev, bh->b_blocknr); > > this code like this didn't cause any "warrnings": > struct buffer_head *newbh; > > bh = sb_bread(sb, oldblock); > newbh = sb_bread(sb, newblock); > if (!(bh || newbh)) > goto err; > > memcpy(newbh->b_data, bh->b_data, sb->s_blocksize); > mark_buffer_dirty(newbh); > brelse(bh); > brelse(newbh); > invalidate_inode_buffers(inode); Yes, that is almost correct. Although it is wrong. (-; You do not want the invalidate_inode_buffers() call. It makes no sense for your fs at all given how you are dealing with the buffers with sb_bread()/brelse()... Your method never attaches buffers to the inode so there is no point in trying to invalidate anything. It will all just work fine. (Unless you have omitted to say things about your fs that are important. Why don't you show all your code rather than just those snippets and then proper advice can be given...) Best regards, Anton -- Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/ ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: fragmentation && blocks "realloc" 2006-01-20 20:11 ` Anton Altaparmakov @ 2006-01-21 9:42 ` Jan Koss 2006-01-21 20:28 ` Anton Altaparmakov 0 siblings, 1 reply; 13+ messages in thread From: Jan Koss @ 2006-01-21 9:42 UTC (permalink / raw) To: Anton Altaparmakov; +Cc: kernelnewbies, linux-fsdevel >It will all just work >fine. (Unless you have omitted to say things about your fs that are >important. Why don't you show all your code rather than just those >snippets and then proper advice can be given...) fs is just a simple analog of ufs/ext2/minix/sysv. It is block oriented, and I suppose that working with pages, instead of blocks make it more complicated, then it should to be. >You do not want the invalidate_inode_buffers() call. It makes no sense Great, we reached the point. Yes, my file system based on usage sb_bread/brelse. As ordinary file system my file system implements readpage and writepage, it is similar to static int sysv_writepage(struct page *page, struct writeback_control *wbc) { return block_write_full_page(page,get_block,wbc); } static int sysv_readpage(struct file *file, struct page *page) { return block_read_full_page(page,get_block); } get_block make such thing map_bh(...) So, when we "realloc" blocks, what happen with these "old" mapped (or used in some other way) blocks? How can I prevent usage "old" blocks instead of "new" blocks? Or if I mark old and new blocks as dirty all will be right? -- Kernelnewbies: Help each other learn about the Linux kernel. Archive: http://mail.nl.linux.org/kernelnewbies/ FAQ: http://kernelnewbies.org/faq/ ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: fragmentation && blocks "realloc" 2006-01-21 9:42 ` Jan Koss @ 2006-01-21 20:28 ` Anton Altaparmakov 2006-01-22 20:58 ` Jan Koss 0 siblings, 1 reply; 13+ messages in thread From: Anton Altaparmakov @ 2006-01-21 20:28 UTC (permalink / raw) To: Jan Koss; +Cc: kernelnewbies, linux-fsdevel On Sat, 21 Jan 2006, Jan Koss wrote: > >It will all just work > >fine. (Unless you have omitted to say things about your fs that are > >important. Why don't you show all your code rather than just those > >snippets and then proper advice can be given...) > > fs is just a simple analog of ufs/ext2/minix/sysv. It is block All of the above a page cache users, not block device oriented at all. > oriented, and I suppose that working with pages, instead of blocks > make it more complicated, then it should to be. It also makes it very slow not to use the page cache... > >You do not want the invalidate_inode_buffers() call. It makes no sense > > Great, we reached the point. > Yes, my file system based on usage sb_bread/brelse. Right, so nothing like the other file systems you compare yourself to then. None of them are sb_bread/brelse based. > As ordinary file system my file system implements readpage and writepage, > it is similar to > static int sysv_writepage(struct page *page, struct writeback_control *wbc) > { > return block_write_full_page(page,get_block,wbc); > } > static int sysv_readpage(struct file *file, struct page *page) > { > return block_read_full_page(page,get_block); > } > > get_block make such thing > map_bh(...) Err, so you are page cache based and not sb_bread/brelse based at all. I think you are confused... (-; > So, when we "realloc" blocks, what happen with these "old" mapped (or > used in some other way) blocks? > > How can I prevent usage "old" blocks instead of "new" blocks? > Or if I mark old and new blocks as dirty all will be right? You cannot do the reallocation using your method if the above page cache functions are used like that by your fs. You need to do it how I showed it to you first, i.e. without sb_bread/brelse as those make no sense whatsoever for you. (They access the block device directly, completely bypassing the page cache so you are breaking cache coherency and are 100% broken by design.) You seem to be extremely conused I am afraid. They only way to help you is to see your whole file system code unless you start becoming clearer about what you are really doing so that you don't keep making contradictory statements in two successive sentences... Best regards, Anton -- Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/ ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: fragmentation && blocks "realloc" 2006-01-21 20:28 ` Anton Altaparmakov @ 2006-01-22 20:58 ` Jan Koss 2006-01-22 21:32 ` Anton Altaparmakov 0 siblings, 1 reply; 13+ messages in thread From: Jan Koss @ 2006-01-22 20:58 UTC (permalink / raw) To: Anton Altaparmakov; +Cc: kernelnewbies, linux-fsdevel Hello. >(They access the block device directly, completely >bypassing the page cache so you are breaking cache coherency and are 100% >broken by design.) Oh... I thought that start from 2.4.x there are no separate implementation of working with blocks and pages, when you read block, kernel read whole page, am I wrong? > They only way to help you > is to see your whole file system code If we need some handhold for discussion, lets talk about minix v.1 (my file system derive from this code). Lets suppose I want make algorigth of allocation blocks in fs/minix/bitmap.c: minix_new_block more inteligent. I should say that minix code use sb_bread/brelse and work with pages (for example fs/minix/dir.c). So instead of allocation one additional block, I want "realloc" blocks, so all file will occupy several consecutive blocks. And we stop on such code bh->b_blocknr = newblk; unmap_underlying_metadata(bh->b_bdev, bh->b_blocknr); mark_buffer_dirty (bh); And question how should I get this _bh_, if I can not use sb_bread? ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: fragmentation && blocks "realloc" 2006-01-22 20:58 ` Jan Koss @ 2006-01-22 21:32 ` Anton Altaparmakov 2006-01-22 22:05 ` Jan Koss 0 siblings, 1 reply; 13+ messages in thread From: Anton Altaparmakov @ 2006-01-22 21:32 UTC (permalink / raw) To: Jan Koss; +Cc: kernelnewbies, linux-fsdevel Hi, On Sun, 22 Jan 2006, Jan Koss wrote: > >(They access the block device directly, completely > >bypassing the page cache so you are breaking cache coherency and are 100% > >broken by design.) > > Oh... I thought that start from 2.4.x there are no separate implementation > of working with blocks and pages, when you read block, kernel read whole page, > am I wrong? There is a very big difference. If you do sb_bread() you are reading a block from the block device. And yes this block is attached to a page but it is a page belonging to the block device address space mapping. You cannot do anything to this block other than read/write it. If you use the page cache to access the contents of a file, then that file (or more precisely the inode of that file) will have an address space mapping of its own, completely independent of the address space mapping of the block device inode. Those pages will (or will not) have buffers attached to them (your getblock() callback is there exactly to allow the buffers to be created and mapped if they are not there). Those buffers will be part of the file page cache page, thus part of the inode's address space mapping, and those buffers have no meaning other than to say "the data in this part of the page belongs to blockdevice so and so and to blocknumber on that block device so and so". So you can change the b_blocknr on those buffers to your hearts content (well you need to observe necessarily locking so buffers under i/o don't get screwed) and that is no problem. Note that the buffers from the block device address space mapping are COMPLETELY separate from the buffers from a file inode address space mapping. So writes from one are NOT seen in the other and you NEVER can mix the two forms of i/o and expect to have a working file system. You will get random results and tons of weird data corruption that way. > > They only way to help you > > is to see your whole file system code > > If we need some handhold for discussion, lets talk about minix v.1 > (my file system derive from this code). > Lets suppose I want make algorigth of allocation blocks in > fs/minix/bitmap.c: minix_new_block more inteligent. > > I should say that minix code use sb_bread/brelse and work with pages (for > example fs/minix/dir.c). Er, not on current kernels: $ grep bread linux-2.6/fs/minix/* bitmap.c: *bh = sb_bread(sb, block); bitmap.c: *bh = sb_bread(sb, block); inode.c: if (!(bh = sb_bread(s, 1))) inode.c: if (!(sbi->s_imap[i]=sb_bread(s, block))) inode.c: if (!(sbi->s_zmap[i]=sb_bread(s, block))) itree_common.c: bh = sb_bread(sb, block_to_cpu(p->key)); itree_common.c: bh = sb_bread(inode->i_sb, nr); Are you working on 2.4 by any chance? If you are writing a new fs I would strongly recommend you to work on 2.6 kernels otherwise you are writing something that is already out of date... The only thing minix in current 2.6 kernel uses bread for is to read the on-disk inodes themselves. It never uses it to access file data at all and I very much doubt that even old 2.4 kernels ever use bread for anything that is not strictly metadata rather than file data. > So instead of allocation one additional block, > I want "realloc" blocks, so all file will occupy several consecutive blocks. > > And we stop on such code > bh->b_blocknr = newblk; > unmap_underlying_metadata(bh->b_bdev, bh->b_blocknr); > mark_buffer_dirty (bh); > > And question how should I get this _bh_, if I can not use sb_bread? That depends entirely in which function you are / which call path you are in at present. Taking minix as an example, tell me the call path where you end up wanting to do the above and I will tell you where to get the bh from... (-: Btw. don't think this is all that easy. If you want to keep whole files rather than whole pages of buffers in consecutive blocks you are in for some very serious fun with multi-page locking and/or complete i/o serialisation, i.e. when a write is happening all other writes on the same file will just block... Best regards, Anton -- Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/ ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: fragmentation && blocks "realloc" 2006-01-22 21:32 ` Anton Altaparmakov @ 2006-01-22 22:05 ` Jan Koss 2006-01-24 10:37 ` Anton Altaparmakov 0 siblings, 1 reply; 13+ messages in thread From: Jan Koss @ 2006-01-22 22:05 UTC (permalink / raw) To: Anton Altaparmakov; +Cc: kernelnewbies, linux-fsdevel On 1/23/06, Anton Altaparmakov <aia21@cam.ac.uk> wrote: ... > Note that the buffers from the block device address space mapping are > COMPLETELY separate from the buffers from a file inode address space > mapping. So writes from one are NOT seen in the other and you NEVER can > mix the two forms of i/o and expect to have a working file system. You > will get random results and tons of weird data corruption that way. > Thanks a lot, this is clear for me several important things. > That depends entirely in which function you are / which call path you are > in at present. Taking minix as an example, tell me the call path where > you end up wanting to do the above and I will tell you where to get the bh > from... (-: > I told about 2.6.15. in fs/minix/bitmap.c there is minix_new_block we come in it from get_block in fs/minix/itree_common.c. After analizing blocks<->file I want move some blocks to another location and update page cache correspondingly, what should I do? ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: fragmentation && blocks "realloc" 2006-01-22 22:05 ` Jan Koss @ 2006-01-24 10:37 ` Anton Altaparmakov 2006-02-23 21:47 ` Nate Diller 0 siblings, 1 reply; 13+ messages in thread From: Anton Altaparmakov @ 2006-01-24 10:37 UTC (permalink / raw) To: Jan Koss; +Cc: kernelnewbies, linux-fsdevel On Mon, 23 Jan 2006, Jan Koss wrote: > On 1/23/06, Anton Altaparmakov <aia21@cam.ac.uk> wrote: > > That depends entirely in which function you are / which call path you are > > in at present. Taking minix as an example, tell me the call path where > > you end up wanting to do the above and I will tell you where to get the bh > > from... (-: > > I told about 2.6.15. > > in fs/minix/bitmap.c there is minix_new_block we come in it from get_block in > fs/minix/itree_common.c. > > After analizing blocks<->file I want move some blocks to another location > and update page cache correspondingly, what should I do? <Argh I just spent ages writing an email and it got lost when the internet connection died... I only have what was visible on the terminal screen, so starting again on the rest...> You cannot do what you want from such a low level because the upper layers hold locks that you need. For example a readpage/writepage/prepare_write can be running concurrently with get_block() and even other instances of get_block() can be running at the same time and it would then be unsafe to do any sort of reallocation. So you have to scrap that idea. You could do it in higher up levels, i.e. in file ->write itself but again this introduces a lot of complexity to your file system. Basically what you are trying to so is much harder than you think and involves a lot of work... There is a possible alternative however. Your get_block function could take a reference on the inode (i_count), set a flag in the file system specific inode "need realloc" and add the inode to a queue of a "realloc-demon" for your fs which is just a kernel thread which will run periodically, say every five seconds, and it will take inodes one after the other from its queue, then take all necessary locks so you can do this (e.g i_mutex on the inode as well as i_alloc_sem/i_alloc_mutex - whatever it is called now) - note you will probably need an extra lock to prevent entry into readpage/writepage whilst this is happening and your readpage/writepage will need to take that lock for reading whilst your daemon takes it for writing so multiple read/writepage can run simultaneously but your deamon runs exclusive. Then, if the inode is marked "need realloc" it will allocate a contiguous chunk of space equal to the file size, clear the "need-realloc" bit, do the reallocation by starting at the first page (index 0) and working upwards, getting it (warning: deadlock possible with a read or writepage holding that page's lock and blocked on your "realloc lock" so maybe trylock and if fails abort and requeue the inode to the daemon at the end of the queue), then when you have a page, loop around its buffers and for each buffer move it from the old allocation to the new one as I described earlier (i.e. just change b_blocknr, invalidate underlying metadata, mark the buffer dirty). That or something simillar should work with minimal impact on your existing fs code. And it has the huge benefit or performing the reallocs in the back ground. Otherwise your original idea would be disastrous to performance. Imagine a 8G file that you are appending data to. Every time you append a new block you may end up having to reallocate the file from inside your get_block (you don't know that more writes are coming in a second) and each time it will take a few minutes so each little write will hang the system for a few minutes - hardly what you want... And the daemon at least batches things in 5 second intervals so multiple "need realloc" settings on an inode will be done in one go every 5 seconds. You know, if it was that easy to keep fragmentation close or even equal to zero at all times without impact on performance, all file systems would be already doing that. (-; Hope this gives you a starting point if nothing else. Best regards, Anton -- Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/ ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: fragmentation && blocks "realloc" 2006-01-24 10:37 ` Anton Altaparmakov @ 2006-02-23 21:47 ` Nate Diller 0 siblings, 0 replies; 13+ messages in thread From: Nate Diller @ 2006-02-23 21:47 UTC (permalink / raw) To: Anton Altaparmakov; +Cc: Jan Koss, kernelnewbies, linux-fsdevel On 1/24/06, Anton Altaparmakov <aia21@cam.ac.uk> wrote: > On Mon, 23 Jan 2006, Jan Koss wrote: > > On 1/23/06, Anton Altaparmakov <aia21@cam.ac.uk> wrote: > > > That depends entirely in which function you are / which call path you are > > > in at present. Taking minix as an example, tell me the call path where > > > you end up wanting to do the above and I will tell you where to get the bh > > > from... (-: > > > > I told about 2.6.15. > > > > in fs/minix/bitmap.c there is minix_new_block we come in it from get_block in > > fs/minix/itree_common.c. > > > > After analizing blocks<->file I want move some blocks to another location > > and update page cache correspondingly, what should I do? > > <Argh I just spent ages writing an email and it got lost when the internet > connection died... I only have what was visible on the terminal screen, > so starting again on the rest...> > > You cannot do what you want from such a low level because the upper layers > hold locks that you need. For example a readpage/writepage/prepare_write > can be running concurrently with get_block() and even other instances of > get_block() can be running at the same time and it would then be unsafe to > do any sort of reallocation. So you have to scrap that idea. > > You could do it in higher up levels, i.e. in file ->write itself but again > this introduces a lot of complexity to your file system. > > Basically what you are trying to so is much harder than you think and > involves a lot of work... > > There is a possible alternative however. Your get_block function could > take a reference on the inode (i_count), set a flag in the file system > specific inode "need realloc" and add the inode to a queue of a > "realloc-demon" for your fs which is just a kernel thread which will run > periodically, say every five seconds, and it will take inodes one after > the other from its queue, then take all necessary locks so you can do this > (e.g i_mutex on the inode as well as i_alloc_sem/i_alloc_mutex - whatever > it is called now) - note you will probably need an extra lock to prevent > entry into readpage/writepage whilst this is happening and your > readpage/writepage will need to take that lock for reading whilst your > daemon takes it for writing so multiple read/writepage can run > simultaneously but your deamon runs exclusive. > > Then, if the inode is marked "need realloc" it will allocate a contiguous > chunk of space equal to the file size, clear the "need-realloc" bit, do > the reallocation by starting at the first page (index 0) and working > upwards, getting it (warning: deadlock possible with a read or writepage > holding that page's lock and blocked on your "realloc lock" so maybe > trylock and if fails abort and requeue the inode to the daemon at the end > of the queue), then when you have a page, loop around its buffers and for > each buffer move it from the old allocation to the new one as I described > earlier (i.e. just change b_blocknr, invalidate underlying metadata, mark > the buffer dirty). > > That or something simillar should work with minimal impact on your > existing fs code. And it has the huge benefit or performing the reallocs > in the back ground. Otherwise your original idea would be disastrous to > performance. Imagine a 8G file that you are appending data to. Every > time you append a new block you may end up having to reallocate the file > from inside your get_block (you don't know that more writes are coming in > a second) and each time it will take a few minutes so each little write > will hang the system for a few minutes - hardly what you want... > > And the daemon at least batches things in 5 second intervals so multiple > "need realloc" settings on an inode will be done in one go every 5 > seconds. > > You know, if it was that easy to keep fragmentation close or even equal to > zero at all times without impact on performance, all file systems would be > already doing that. (-; well, the above is a reasonable solution, but if you were willing to put up with more allocation and flush complexity, you could try a strict allocate-on-flush design. Just read in a page, and promptly unmap it, then you don't have to worry about CPU overhead until flush time, when you map all the pages and write them out. That would result in the lowest amount of fragmentation you can get without a repacker of some sort. It's not even all that hard unless you try supporting file holes, transactions, non-4k blocks, or other complexities. There are also potential OOM issues if you are using something as old-fashioned as bitmaps in your allocation code, and need to read them in under memory pressure... NATE ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: fragmentation && blocks "realloc" 2006-01-20 15:46 ` Jan Koss 2006-01-20 19:22 ` Jan Koss @ 2006-01-20 20:04 ` Anton Altaparmakov 1 sibling, 0 replies; 13+ messages in thread From: Anton Altaparmakov @ 2006-01-20 20:04 UTC (permalink / raw) To: Jan Koss; +Cc: kernelnewbies, linux-fsdevel On Fri, 20 Jan 2006, Jan Koss wrote: > In fact, I expected yes for the first abbility and no for the second :) > > Now code looks like: > bh = sb_bread(sb, oldblock); > if (!bh) > goto err; > bh->b_blocknr = newblk; > mark_buffer_dirty (bh); > unmap_underlying_metadata(bh->b_bdev, bh->b_blocknr); No, no, no!!! You cannot do this. You are not using the page cache for your file system (why not?) so you cannot remap buffers like I suggested. Even if you could your code is wrong. You need the unmap_underlying_metadata() _before_ mark_buffer_dirty(). > Let's suppose such test case: > after situation, which I described in the first email, > user resize file and new size 5 blocks, > and there are no free blocks except 2 blocks which we deallocated in > the frist email, > so we have to allocate them. > > When I reproduced this test case, I got such messages from kernel: > __find_get_block_slow failed block=oldblock... > > So as I can see I missed something in "art of changing b_blocknr". > > Error in __find_get_block_slow may happen only if all buffers on page mapped. > > May be this is because of buffer_head change b_blocknr, but didn't > change b_this_page? You cannot touch b_this_page on buffers you access via sb_bread(). The correct solution for a file system like yours would be to copy the buffer data to the correct buffer and write that and release the old one. I.e. your first suggestion, i.e. do not touch b_blocknr or b_this_page. And you do not need to call unmap_underlying_metadata() either or invalidate any pages. You are working with the block device directly. Bypassing the per file page cache thus you cannot do anything to the buffers at all other than read/write them. It would be far better if you start using the page cache (via ->readpage, ->writepage, and probably ->prepare_write and ->commit_write as well) and then from inside readpage/writepage/prepare_write/commit_write you can do with the buffers as I suggested... Best regards, Anton -- Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/ ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2006-02-23 21:47 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-01-20 11:47 fragmentation && blocks "realloc" Jan Koss 2006-01-20 13:34 ` Anton Altaparmakov 2006-01-20 15:46 ` Jan Koss 2006-01-20 19:22 ` Jan Koss 2006-01-20 20:11 ` Anton Altaparmakov 2006-01-21 9:42 ` Jan Koss 2006-01-21 20:28 ` Anton Altaparmakov 2006-01-22 20:58 ` Jan Koss 2006-01-22 21:32 ` Anton Altaparmakov 2006-01-22 22:05 ` Jan Koss 2006-01-24 10:37 ` Anton Altaparmakov 2006-02-23 21:47 ` Nate Diller 2006-01-20 20:04 ` Anton Altaparmakov
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).