* Problem with delayed allocation
@ 2008-08-02 20:07 Theodore Ts'o
2008-08-02 22:40 ` Theodore Tso
` (3 more replies)
0 siblings, 4 replies; 17+ messages in thread
From: Theodore Ts'o @ 2008-08-02 20:07 UTC (permalink / raw)
To: Aneesh Kumar K.V; +Cc: linux-ext4
Apparently __fsync_super(), which is called right before remounting a
filesystem read-only, isn't working correctly. To reproduce, create a
script which does this:
#!/bin/sh
DEVICE=/dev/closure/test
mke2fs -t ext4dev /dev/closure/test
mount $DEVICE /mnt
cd /mnt
tar xfj /var/tmp/linux-2.6.26.tar.gz <----- or some really big file
du -s
cd ..
mount -o remount,ro /mnt
sync
dmesg > /tmp/dmesg.out <----- note all of the ext4_da_writepages error messages
umount /mnt
du -s /mnt
sync
mount $DEVICE /mnt
du -s /mnt <--- note that size of the unpacked hierarcy is much smaller
This doesn't happen if the ext4 filesystem is mounted with nodelalloc,
so I assume the problem is in ext4_da_writepages().
Aneesh, can you look at this? I've tried going through the code paths
starting with __fsync_super(), going down through __sync_single_inode(),
and I can't see anything obvious.
I've checked and we've had this problem for a while. I don't think this
is a recent regression. The "sync" command does seem to force file data
out, but it looks like we're not properly waiting for writes to complete
before __fsync_super() returns. There is a call filemap_fdatawait() in
__sync_single_inode(), but it's apparently not doing the right thing.
Aneesh, can you try to find whatever it is that I missed? Thanks!!
- Ted
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: Problem with delayed allocation 2008-08-02 20:07 Problem with delayed allocation Theodore Ts'o @ 2008-08-02 22:40 ` Theodore Tso 2008-08-04 3:16 ` Aneesh Kumar K.V ` (2 subsequent siblings) 3 siblings, 0 replies; 17+ messages in thread From: Theodore Tso @ 2008-08-02 22:40 UTC (permalink / raw) To: Aneesh Kumar K.V, cmm; +Cc: linux-ext4 One update on the delayed allocation writes not getting pushed problem; the problem existed with the patch queue as of July 10th, but it is *much* worse with recent kernels. I found out why; the V2 version of the journal writepages patch had this comment: 1) fix writepages() inefficency issue. sync() will invoke writepages() twice (not sure exactly why), the second time all the pages are clean so it waste the cpu time to walk though all pages and find they are not dirty . But it's simple to workaround by skip writepages() if there is no dirty pages pointed by the mapping. It is quite deliberate for sync to invoke writepages() twice. __fsync_super() calls sync_inodes_sb() twice, once with wait=0, and once with want=1. This results in sync_inodes_sb() calling sync_sb_inodes(), first with wbc.sync_mode set to WB_SYNC_HOLD (in the wait==0 case), and then later with wbc.sync_mode set to WB_SYNC_ALL (in the wait==1 case). sync_sb_inodes() calls generic_sync_sb_inodes(), which iterates over all inodes in sb->s_io, and calls __writeback_single(inode, wbc), which will write out the dirty pages by calling __sync_single_inode(). __sync_single_inode is responsible for writing out a single inode's dirty pages and inode data to disk. If wbc.sync_mode is set to WB_SYNC_ALL, it will wait for the writeout by calling filemap_fdatawait() after it calls do_writepages(). It is do_writepages(mapping, wbc), which calls the filesystem-specific writepages (i.e., the ext4_da_writepages) or generic_writepages(). When writing out a 300 megabytes of a Linux kernel source, approximately 100 megabytes of data don't make it to disk if the filesystem is remounted read-only right after the untar finishes. In an earlier version of the patch series, without the journal credits patch, only 20 megabytes of data is lost. So the journal_credits patch seems to make things worse. For this reason, I'm not going to push the journal credits patch to Linus right away, because I suspect it is implicated in a regression. I'm also going to suggest that we split up the patch into simpler pieces, so we can audit the obvious bits and get them upstream more quickly. The data writeback paths are *complicated* (possibly unduly complicated), and I'm not sure I understand them completely. So we're going to need to back off and study this carefully. This is why I think we really are going to have to break up this patch. - Ted ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Problem with delayed allocation 2008-08-02 20:07 Problem with delayed allocation Theodore Ts'o 2008-08-02 22:40 ` Theodore Tso @ 2008-08-04 3:16 ` Aneesh Kumar K.V 2008-08-04 14:08 ` Theodore Tso 2008-08-04 14:52 ` Aneesh Kumar K.V 2008-08-04 16:35 ` Aneesh Kumar K.V 3 siblings, 1 reply; 17+ messages in thread From: Aneesh Kumar K.V @ 2008-08-04 3:16 UTC (permalink / raw) To: Theodore Ts'o; +Cc: linux-ext4 On Sat, Aug 02, 2008 at 04:07:19PM -0400, Theodore Ts'o wrote: > > Apparently __fsync_super(), which is called right before remounting a > filesystem read-only, isn't working correctly. To reproduce, create a > script which does this: > > #!/bin/sh > DEVICE=/dev/closure/test > mke2fs -t ext4dev /dev/closure/test > mount $DEVICE /mnt > cd /mnt > tar xfj /var/tmp/linux-2.6.26.tar.gz <----- or some really big file > du -s > cd .. > mount -o remount,ro /mnt > sync > dmesg > /tmp/dmesg.out <----- note all of the ext4_da_writepages error messages > umount /mnt > du -s /mnt > sync > mount $DEVICE /mnt > du -s /mnt <--- note that size of the unpacked hierarcy is much smaller > > This doesn't happen if the ext4 filesystem is mounted with nodelalloc, > so I assume the problem is in ext4_da_writepages(). > > Aneesh, can you look at this? I've tried going through the code paths > starting with __fsync_super(), going down through __sync_single_inode(), > and I can't see anything obvious. > > I've checked and we've had this problem for a while. I don't think this > is a recent regression. The "sync" command does seem to force file data > out, but it looks like we're not properly waiting for writes to complete > before __fsync_super() returns. There is a call filemap_fdatawait() in > __sync_single_inode(), but it's apparently not doing the right thing. > Aneesh, can you try to find whatever it is that I missed? Thanks!! > __fsync_super use filemap_fdatawait(mapping) for waiting on writeback pages. But all the dirty pages of the inode are not in writeback because we might have had block allocation failures. Also with the current code base I am seeing buffer_heads which are unmapped, non delay and dirty That means writepages won't allocate block for them and writepage cannot write them. -aneesh ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Problem with delayed allocation 2008-08-04 3:16 ` Aneesh Kumar K.V @ 2008-08-04 14:08 ` Theodore Tso 0 siblings, 0 replies; 17+ messages in thread From: Theodore Tso @ 2008-08-04 14:08 UTC (permalink / raw) To: Aneesh Kumar K.V; +Cc: linux-ext4 On Mon, Aug 04, 2008 at 08:46:52AM +0530, Aneesh Kumar K.V wrote: > __fsync_super use filemap_fdatawait(mapping) for waiting on writeback > pages. But all the dirty pages of the inode are not in writeback because > we might have had block allocation failures. Yes, but that should only happen if the filesystem is full or a user's quota is overrun, correct? > Also with the current code base I am seeing buffer_heads which are > unmapped, non delay and dirty That means writepages won't allocate > block for them and writepage cannot write them. I thought all writes went through the page cache? Are you saying that the *pages* are clean but the buffer_heads are marked dirty? In that case, ext4_da_writepages, if wbc.sync_mode is not WB_SYNC_NONE, *must* wait on them and not return until the buffers are safely on disk, since filemap_fdatawait(mapping) won't in __sync_single_inode() won't do the waiting for the writepages routine. - Ted ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Problem with delayed allocation 2008-08-02 20:07 Problem with delayed allocation Theodore Ts'o 2008-08-02 22:40 ` Theodore Tso 2008-08-04 3:16 ` Aneesh Kumar K.V @ 2008-08-04 14:52 ` Aneesh Kumar K.V 2008-08-04 15:27 ` Aneesh Kumar K.V 2008-08-04 16:35 ` Aneesh Kumar K.V 3 siblings, 1 reply; 17+ messages in thread From: Aneesh Kumar K.V @ 2008-08-04 14:52 UTC (permalink / raw) To: Theodore Ts'o; +Cc: linux-ext4 On Sat, Aug 02, 2008 at 04:07:19PM -0400, Theodore Ts'o wrote: > > Apparently __fsync_super(), which is called right before remounting a > filesystem read-only, isn't working correctly. To reproduce, create a > script which does this: > > #!/bin/sh > DEVICE=/dev/closure/test > mke2fs -t ext4dev /dev/closure/test > mount $DEVICE /mnt > cd /mnt > tar xfj /var/tmp/linux-2.6.26.tar.gz <----- or some really big file > du -s > cd .. > mount -o remount,ro /mnt > sync > dmesg > /tmp/dmesg.out <----- note all of the ext4_da_writepages error messages > umount /mnt > du -s /mnt > sync > mount $DEVICE /mnt > du -s /mnt <--- note that size of the unpacked hierarcy is much smaller > > This doesn't happen if the ext4 filesystem is mounted with nodelalloc, > so I assume the problem is in ext4_da_writepages(). > Can you try this patch and see if it makes any difference ? diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 25adfc3..5a8a2d3 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -518,6 +518,7 @@ void generic_sync_sb_inodes(struct super_block *sb, spin_lock(&inode_lock); if (wbc->nr_to_write <= 0) { wbc->more_io = 1; + printk(KERN_CRIT "Breaking from the %s loop\n", __func__); break; } if (!list_empty(&sb->s_more_io)) @@ -611,6 +612,8 @@ void sync_inodes_sb(struct super_block *sb, int wait) (inodes_stat.nr_inodes - inodes_stat.nr_unused) + nr_dirty + nr_unstable; wbc.nr_to_write += wbc.nr_to_write / 2; /* Bit more for luck */ + wbc.nr_to_write = LONG_MAX; + sync_sb_inodes(sb, &wbc); } ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: Problem with delayed allocation 2008-08-04 14:52 ` Aneesh Kumar K.V @ 2008-08-04 15:27 ` Aneesh Kumar K.V 2008-08-04 15:33 ` Aneesh Kumar K.V 0 siblings, 1 reply; 17+ messages in thread From: Aneesh Kumar K.V @ 2008-08-04 15:27 UTC (permalink / raw) To: Theodore Ts'o; +Cc: linux-ext4, Mingming Cao On Mon, Aug 04, 2008 at 08:22:49PM +0530, Aneesh Kumar K.V wrote: > On Sat, Aug 02, 2008 at 04:07:19PM -0400, Theodore Ts'o wrote: > > > > Apparently __fsync_super(), which is called right before remounting a > > filesystem read-only, isn't working correctly. To reproduce, create a > > script which does this: > > > > #!/bin/sh > > DEVICE=/dev/closure/test > > mke2fs -t ext4dev /dev/closure/test > > mount $DEVICE /mnt > > cd /mnt > > tar xfj /var/tmp/linux-2.6.26.tar.gz <----- or some really big file > > du -s > > cd .. > > mount -o remount,ro /mnt > > sync > > dmesg > /tmp/dmesg.out <----- note all of the ext4_da_writepages error messages > > umount /mnt > > du -s /mnt > > sync > > mount $DEVICE /mnt > > du -s /mnt <--- note that size of the unpacked hierarcy is much smaller > > > > This doesn't happen if the ext4 filesystem is mounted with nodelalloc, > > so I assume the problem is in ext4_da_writepages(). > > > > Can you try this patch and see if it makes any difference ? > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c > index 25adfc3..5a8a2d3 100644 > --- a/fs/fs-writeback.c > +++ b/fs/fs-writeback.c > @@ -518,6 +518,7 @@ void generic_sync_sb_inodes(struct super_block *sb, > spin_lock(&inode_lock); > if (wbc->nr_to_write <= 0) { > wbc->more_io = 1; > + printk(KERN_CRIT "Breaking from the %s loop\n", __func__); > break; > } > if (!list_empty(&sb->s_more_io)) > @@ -611,6 +612,8 @@ void sync_inodes_sb(struct super_block *sb, int wait) > (inodes_stat.nr_inodes - inodes_stat.nr_unused) + > nr_dirty + nr_unstable; > wbc.nr_to_write += wbc.nr_to_write / 2; /* Bit more for luck */ > + wbc.nr_to_write = LONG_MAX; > + > sync_sb_inodes(sb, &wbc); > } > I guess this could be the reason. I am not hitting the error during remount, ro with this change. But I have other changes also accumulated as a part of rewrite. diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 4a50445..ecabe77 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -2225,8 +2288,10 @@ static int ext4_da_writepages(struct address_space *mapping, if (!mapping->nrpages || !mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) return 0; +#if 0 if (wbc->nr_to_write > mapping->nrpages) wbc->nr_to_write = mapping->nrpages; +#endif if (!wbc->range_cyclic) { ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: Problem with delayed allocation 2008-08-04 15:27 ` Aneesh Kumar K.V @ 2008-08-04 15:33 ` Aneesh Kumar K.V 0 siblings, 0 replies; 17+ messages in thread From: Aneesh Kumar K.V @ 2008-08-04 15:33 UTC (permalink / raw) To: Theodore Ts'o; +Cc: linux-ext4, Mingming Cao Hi Ted, On Mon, Aug 04, 2008 at 08:57:30PM +0530, Aneesh Kumar K.V wrote: > On Mon, Aug 04, 2008 at 08:22:49PM +0530, Aneesh Kumar K.V wrote: > > On Sat, Aug 02, 2008 at 04:07:19PM -0400, Theodore Ts'o wrote: > > > > > > Apparently __fsync_super(), which is called right before remounting a > > > filesystem read-only, isn't working correctly. To reproduce, create a > > > script which does this: > > > > > > #!/bin/sh > > > DEVICE=/dev/closure/test > > > mke2fs -t ext4dev /dev/closure/test > > > mount $DEVICE /mnt > > > cd /mnt > > > tar xfj /var/tmp/linux-2.6.26.tar.gz <----- or some really big file > > > du -s > > > cd .. > > > mount -o remount,ro /mnt > > > sync > > > dmesg > /tmp/dmesg.out <----- note all of the ext4_da_writepages error messages > > > umount /mnt > > > du -s /mnt > > > sync > > > mount $DEVICE /mnt > > > du -s /mnt <--- note that size of the unpacked hierarcy is much smaller > > > > > > This doesn't happen if the ext4 filesystem is mounted with nodelalloc, > > > so I assume the problem is in ext4_da_writepages(). > > > > > > > Can you try this patch and see if it makes any difference ? > > > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c > > index 25adfc3..5a8a2d3 100644 > > --- a/fs/fs-writeback.c > > +++ b/fs/fs-writeback.c > > @@ -518,6 +518,7 @@ void generic_sync_sb_inodes(struct super_block *sb, > > spin_lock(&inode_lock); > > if (wbc->nr_to_write <= 0) { > > wbc->more_io = 1; > > + printk(KERN_CRIT "Breaking from the %s loop\n", __func__); > > break; > > } > > if (!list_empty(&sb->s_more_io)) > > @@ -611,6 +612,8 @@ void sync_inodes_sb(struct super_block *sb, int wait) > > (inodes_stat.nr_inodes - inodes_stat.nr_unused) + > > nr_dirty + nr_unstable; > > wbc.nr_to_write += wbc.nr_to_write / 2; /* Bit more for luck */ > > + wbc.nr_to_write = LONG_MAX; > > + > > sync_sb_inodes(sb, &wbc); > > } > > > > > I guess this could be the reason. I am not hitting the error during > remount, ro with this change. But I have other changes also accumulated > as a part of rewrite. > > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index 4a50445..ecabe77 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -2225,8 +2288,10 @@ static int ext4_da_writepages(struct address_space *mapping, > if (!mapping->nrpages || !mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) > return 0; > > +#if 0 > if (wbc->nr_to_write > mapping->nrpages) > wbc->nr_to_write = mapping->nrpages; > +#endif > > > if (!wbc->range_cyclic) { The reason why you are able to reproduce it with the linus tree is because of /* * set the max dirty pages could be write at a time * to fit into the reserved transaction credits */ if (wbc->nr_to_write > EXT4_MAX_WRITEBACK_PAGES) wbc->nr_to_write = EXT4_MAX_WRITEBACK_PAGES; -aneesh ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Problem with delayed allocation 2008-08-02 20:07 Problem with delayed allocation Theodore Ts'o ` (2 preceding siblings ...) 2008-08-04 14:52 ` Aneesh Kumar K.V @ 2008-08-04 16:35 ` Aneesh Kumar K.V 2008-08-05 6:44 ` Theodore Tso 3 siblings, 1 reply; 17+ messages in thread From: Aneesh Kumar K.V @ 2008-08-04 16:35 UTC (permalink / raw) To: Theodore Ts'o; +Cc: linux-ext4 On Sat, Aug 02, 2008 at 04:07:19PM -0400, Theodore Ts'o wrote: > > Apparently __fsync_super(), which is called right before remounting a > filesystem read-only, isn't working correctly. To reproduce, create a > script which does this: > > #!/bin/sh > DEVICE=/dev/closure/test > mke2fs -t ext4dev /dev/closure/test > mount $DEVICE /mnt > cd /mnt > tar xfj /var/tmp/linux-2.6.26.tar.gz <----- or some really big file > du -s > cd .. > mount -o remount,ro /mnt > sync > dmesg > /tmp/dmesg.out <----- note all of the ext4_da_writepages error messages > umount /mnt > du -s /mnt > sync > mount $DEVICE /mnt > du -s /mnt <--- note that size of the unpacked hierarcy is much smaller > > This doesn't happen if the ext4 filesystem is mounted with nodelalloc, > so I assume the problem is in ext4_da_writepages(). > This is the complete patch that I have. I haven't fully tested it (right now waiting for the machine to be free). This should apply after stable-boundary-undo.patch Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 5665bec..bfe27d9 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -41,6 +41,8 @@ #include "acl.h" #include "ext4_extents.h" +#define MPAGE_DA_EXTENT_TAIL 0x01 + static inline int ext4_begin_ordered_truncate(struct inode *inode, loff_t new_size) { @@ -1580,6 +1582,8 @@ static void ext4_da_page_release_reservation(struct page *page, unsigned long first_page, next_page; /* extent of pages */ get_block_t *get_block; struct writeback_control *wbc; + int io_done; + long pages_written; }; /* @@ -1599,12 +1603,6 @@ static void ext4_da_page_release_reservation(struct page *page, static int mpage_da_submit_io(struct mpage_da_data *mpd) { struct address_space *mapping = mpd->inode->i_mapping; - struct mpage_data mpd_pp = { - .bio = NULL, - .last_block_in_bio = 0, - .get_block = mpd->get_block, - .use_writepage = 1, - }; int ret = 0, err, nr_pages, i; unsigned long index, end; struct pagevec pvec; @@ -1628,7 +1626,9 @@ static int mpage_da_submit_io(struct mpage_da_data *mpd) break; index++; - err = __mpage_writepage(page, mpd->wbc, &mpd_pp); + err = mapping->a_ops->writepage(page, mpd->wbc); + if (!err) + mpd->pages_written++; /* * In error case, we have to continue because @@ -1640,8 +1640,6 @@ static int mpage_da_submit_io(struct mpage_da_data *mpd) } pagevec_release(&pvec); } - if (mpd_pp.bio) - mpage_bio_submit(WRITE, mpd_pp.bio); return ret; } @@ -1708,6 +1706,13 @@ static void mpage_put_bnr_to_bhs(struct mpage_da_data *mpd, sector_t logical, if (buffer_delay(bh)) { bh->b_blocknr = pblock; clear_buffer_delay(bh); + bh->b_bdev = inode->i_sb->s_bdev; + } else if (buffer_unwritten(bh)) { + bh->b_blocknr = pblock; + clear_buffer_unwritten(bh); + set_buffer_mapped(bh); + set_buffer_new(bh); + bh->b_bdev = inode->i_sb->s_bdev; } else if (buffer_mapped(bh)) BUG_ON(bh->b_blocknr != pblock); @@ -1748,8 +1753,8 @@ static inline void __unmap_underlying_blocks(struct inode *inode, */ static void mpage_da_map_blocks(struct mpage_da_data *mpd) { + int err = 0; struct buffer_head *lbh = &mpd->lbh; - int err = 0, remain = lbh->b_size; sector_t next = lbh->b_blocknr; struct buffer_head new; @@ -1759,38 +1764,28 @@ static void mpage_da_map_blocks(struct mpage_da_data *mpd) if (buffer_mapped(lbh) && !buffer_delay(lbh)) return; - while (remain) { - new.b_state = lbh->b_state; - new.b_blocknr = 0; - new.b_size = remain; - err = mpd->get_block(mpd->inode, next, &new, 1); - if (err) { - /* - * Rather than implement own error handling - * here, we just leave remaining blocks - * unallocated and try again with ->writepage() - */ - break; - } - BUG_ON(new.b_size == 0); + new.b_state = lbh->b_state; + new.b_blocknr = 0; + new.b_size = lbh->b_size; + err = mpd->get_block(mpd->inode, next, &new, 1); + if (err) + return; + BUG_ON(new.b_size == 0); - if (buffer_new(&new)) - __unmap_underlying_blocks(mpd->inode, &new); + if (buffer_new(&new)) + __unmap_underlying_blocks(mpd->inode, &new); - /* - * If blocks are delayed marked, we need to - * put actual blocknr and drop delayed bit - */ - if (buffer_delay(lbh)) - mpage_put_bnr_to_bhs(mpd, next, &new); + /* + * If blocks are delayed marked, we need to + * put actual blocknr and drop delayed bit + */ + if (buffer_delay(lbh) || buffer_unwritten(lbh)) + mpage_put_bnr_to_bhs(mpd, next, &new); - /* go for the remaining blocks */ - next += new.b_size >> mpd->inode->i_blkbits; - remain -= new.b_size; - } + return; } -#define BH_FLAGS ((1 << BH_Uptodate) | (1 << BH_Mapped) | (1 << BH_Delay)) +#define BH_FLAGS ((1 << BH_Uptodate) | (1 << BH_Mapped) | (1 << BH_Delay) | (1 << BH_Unwritten)) /* * mpage_add_bh_to_extent - try to add one more block to extent of blocks @@ -1832,13 +1827,9 @@ static void mpage_add_bh_to_extent(struct mpage_da_data *mpd, * need to flush current extent and start new one */ mpage_da_map_blocks(mpd); - - /* - * Now start a new extent - */ - lbh->b_size = bh->b_size; - lbh->b_state = bh->b_state & BH_FLAGS; - lbh->b_blocknr = logical; + mpage_da_submit_io(mpd); + mpd->io_done = 1; + return; } /* @@ -1858,6 +1849,17 @@ static int __mpage_da_writepage(struct page *page, struct buffer_head *bh, *head, fake; sector_t logical; + if (mpd->io_done) { + /* + * Rest of the page in the page_vec + * redirty then and skip then. We will + * try to to write them again after + * starting a new transaction + */ + redirty_page_for_writepage(wbc, page); + unlock_page(page); + return MPAGE_DA_EXTENT_TAIL; + } /* * Can we merge this page to current extent? */ @@ -1869,6 +1871,13 @@ static int __mpage_da_writepage(struct page *page, if (mpd->next_page != mpd->first_page) { mpage_da_map_blocks(mpd); mpage_da_submit_io(mpd); + /* + * skip rest of the page in the page_vec + */ + mpd->io_done = 1; + redirty_page_for_writepage(wbc, page); + unlock_page(page); + return MPAGE_DA_EXTENT_TAIL; } /* @@ -1899,6 +1908,8 @@ static int __mpage_da_writepage(struct page *page, set_buffer_dirty(bh); set_buffer_uptodate(bh); mpage_add_bh_to_extent(mpd, logical, bh); + if (mpd->io_done) + return MPAGE_DA_EXTENT_TAIL; } else { /* * Page with regular buffer heads, just add all dirty ones @@ -1907,8 +1918,11 @@ static int __mpage_da_writepage(struct page *page, bh = head; do { BUG_ON(buffer_locked(bh)); - if (buffer_dirty(bh)) + if (buffer_dirty(bh) && (!buffer_mapped(bh) || buffer_delay(bh))) { mpage_add_bh_to_extent(mpd, logical, bh); + if (mpd->io_done) + return MPAGE_DA_EXTENT_TAIL; + } logical++; } while ((bh = bh->b_this_page) != head); } @@ -1943,6 +1957,7 @@ static int mpage_da_writepages(struct address_space *mapping, get_block_t get_block) { struct mpage_da_data mpd; + long to_write; int ret; if (!get_block) @@ -1956,17 +1971,22 @@ static int mpage_da_writepages(struct address_space *mapping, mpd.first_page = 0; mpd.next_page = 0; mpd.get_block = get_block; + mpd.io_done = 0; + mpd.pages_written = 0; + + to_write = wbc->nr_to_write; ret = write_cache_pages(mapping, wbc, __mpage_da_writepage, &mpd); /* * Handle last extent of pages */ - if (mpd.next_page != mpd.first_page) { + if (!mpd.io_done && mpd.next_page != mpd.first_page) { mpage_da_map_blocks(&mpd); mpage_da_submit_io(&mpd); } + wbc->nr_to_write = to_write - mpd.pages_written; return ret; } @@ -2178,10 +2198,7 @@ static int ext4_da_writepages(struct address_space *mapping, int ret = 0; long to_write; loff_t range_start = 0; - int blocks_per_page = PAGE_CACHE_SIZE >> inode->i_blkbits; - int max_credit_blocks = ext4_journal_max_transaction_buffers(inode); - int need_credits_per_page = ext4_writepages_trans_blocks(inode, 1); - int max_writeback_pages = (max_credit_blocks / blocks_per_page) / need_credits_per_page; + long pages_skipped = 0; /* * No pages to write? This is mainly a kludge to avoid starting @@ -2191,11 +2208,6 @@ static int ext4_da_writepages(struct address_space *mapping, if (!mapping->nrpages || !mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) return 0; - if (wbc->nr_to_write > mapping->nrpages) - wbc->nr_to_write = mapping->nrpages; - - to_write = wbc->nr_to_write; - if (!wbc->range_cyclic) { /* * If range_cyclic is not set force range_cont @@ -2204,32 +2216,27 @@ static int ext4_da_writepages(struct address_space *mapping, wbc->range_cont = 1; range_start = wbc->range_start; } + pages_skipped = wbc->pages_skipped; - while (!ret && to_write) { - /* - * set the max dirty pages could be write at a time - * to fit into the reserved transaction credits - */ - if (wbc->nr_to_write > max_writeback_pages) - wbc->nr_to_write = max_writeback_pages; +restart_loop: + to_write = wbc->nr_to_write; + while (!ret && to_write > 0) { /* - * Estimate the worse case needed credits to write out - * to_write pages + * we insert one extent at a time. So we need + * credit needed for single extent allocation. + * journalled mode is currently not supported + * by delalloc */ - needed_blocks = ext4_writepages_trans_blocks(inode, - wbc->nr_to_write); - while (needed_blocks > max_credit_blocks) { - wbc->nr_to_write --; - needed_blocks = ext4_writepages_trans_blocks(inode, - wbc->nr_to_write); - } + BUG_ON(ext4_should_journal_data(inode)); + needed_blocks = EXT4_DATA_TRANS_BLOCKS(inode->i_sb); + /* start a new transaction*/ handle = ext4_journal_start(inode, needed_blocks); if (IS_ERR(handle)) { ret = PTR_ERR(handle); - printk(KERN_EMERG "%s: Not enough credits to flush %ld pages\n", __func__, - wbc->nr_to_write); + printk(KERN_EMERG "%s: Not enough credits to flush %ld pages %d\n", + __func__, wbc->nr_to_write , ret); dump_stack(); goto out_writepages; } @@ -2251,7 +2258,14 @@ static int ext4_da_writepages(struct address_space *mapping, ret = mpage_da_writepages(mapping, wbc, ext4_da_get_block_write); ext4_journal_stop(handle); - if (wbc->nr_to_write) { + if (ret == MPAGE_DA_EXTENT_TAIL) { + /* + * got one extent now try with + * rest of the pages + */ + to_write += wbc->nr_to_write; + ret = 0; + } else if (wbc->nr_to_write) { /* * There is no more writeout needed * or we requested for a noblocking writeout @@ -2263,6 +2277,15 @@ static int ext4_da_writepages(struct address_space *mapping, wbc->nr_to_write = to_write; } + if (wbc->range_cont && (pages_skipped != wbc->pages_skipped)) { + /* We skipped pages in this loop */ + wbc->range_start = range_start; + wbc->nr_to_write = to_write + + wbc->pages_skipped - pages_skipped; + wbc->pages_skipped = pages_skipped; + goto restart_loop; + } + out_writepages: wbc->nr_to_write = to_write; if (range_start) ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: Problem with delayed allocation 2008-08-04 16:35 ` Aneesh Kumar K.V @ 2008-08-05 6:44 ` Theodore Tso 2008-08-05 6:52 ` Aneesh Kumar K.V 0 siblings, 1 reply; 17+ messages in thread From: Theodore Tso @ 2008-08-05 6:44 UTC (permalink / raw) To: Aneesh Kumar K.V; +Cc: linux-ext4 On Mon, Aug 04, 2008 at 10:05:05PM +0530, Aneesh Kumar K.V wrote: > > This is the complete patch that I have. I haven't fully tested it > (right now waiting for the machine to be free). This should apply > after stable-boundary-undo.patch Umm... the patch doesn't apply right after the stable boundary udo patch. - Ted ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Problem with delayed allocation 2008-08-05 6:44 ` Theodore Tso @ 2008-08-05 6:52 ` Aneesh Kumar K.V 2008-08-05 13:21 ` Aneesh Kumar K.V 0 siblings, 1 reply; 17+ messages in thread From: Aneesh Kumar K.V @ 2008-08-05 6:52 UTC (permalink / raw) To: Theodore Tso; +Cc: linux-ext4 [-- Attachment #1: Type: text/plain, Size: 661 bytes --] On Tue, Aug 05, 2008 at 02:44:28AM -0400, Theodore Tso wrote: > On Mon, Aug 04, 2008 at 10:05:05PM +0530, Aneesh Kumar K.V wrote: > > > > This is the complete patch that I have. I haven't fully tested it > > (right now waiting for the machine to be free). This should apply > > after stable-boundary-undo.patch > > Umm... the patch doesn't apply right after the stable boundary udo > patch. > > - Ted I did a fresh git pull and updated the patch. I also accumulated few changes after words while testing on ABAT. Attaching both the patches below. The patches apply after ext4_journal_credits_fix_for_writepages.patch in the patch queue. -aneesh [-- Attachment #2: 1.patch --] [-- Type: text/x-diff, Size: 10933 bytes --] commit ac92572cf2bc6739b3a7cc7037a044e493cc6b4e Author: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Date: Tue Aug 5 09:09:44 2008 +0530 ext4: Rework the ext4_da_writepages With the below changes we reserve credit needed to insert only one extent resulting from a call to single get_block. That make sure we don't take too much journal credits during writeout. We also don't limit the pages to write. That means we loop through the dirty pages building largest possible contiguous block request. Then we issue a single get_block request. We may get less block that we requested. If so we would end up not mapping some of the buffer_heads. That means those buffer_heads are still marked delay. Later in the writepage callback via __mpage_writepage we redirty those pages. We should also not limit/throttle wbc->nr_to_write in the filesystem writepages callback. That cause wrong behaviour in generic_sync_sb_inodes caused by wbc->nr_to_write being <= 0 Also make sure fallocated area is handled properly by delayed allocation. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index c96cc0b..fb5aa7b 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -41,6 +41,8 @@ #include "acl.h" #include "ext4_extents.h" +#define MPAGE_DA_EXTENT_TAIL 0x01 + static inline int ext4_begin_ordered_truncate(struct inode *inode, loff_t new_size) { @@ -1604,6 +1606,8 @@ static void ext4_da_page_release_reservation(struct page *page, unsigned long first_page, next_page; /* extent of pages */ get_block_t *get_block; struct writeback_control *wbc; + int io_done; + long pages_written; }; /* @@ -1623,12 +1627,6 @@ static void ext4_da_page_release_reservation(struct page *page, static int mpage_da_submit_io(struct mpage_da_data *mpd) { struct address_space *mapping = mpd->inode->i_mapping; - struct mpage_data mpd_pp = { - .bio = NULL, - .last_block_in_bio = 0, - .get_block = mpd->get_block, - .use_writepage = 1, - }; int ret = 0, err, nr_pages, i; unsigned long index, end; struct pagevec pvec; @@ -1652,7 +1650,9 @@ static int mpage_da_submit_io(struct mpage_da_data *mpd) break; index++; - err = __mpage_writepage(page, mpd->wbc, &mpd_pp); + err = mapping->a_ops->writepage(page, mpd->wbc); + if (!err) + mpd->pages_written++; /* * In error case, we have to continue because @@ -1664,8 +1664,6 @@ static int mpage_da_submit_io(struct mpage_da_data *mpd) } pagevec_release(&pvec); } - if (mpd_pp.bio) - mpage_bio_submit(WRITE, mpd_pp.bio); return ret; } @@ -1732,6 +1730,13 @@ static void mpage_put_bnr_to_bhs(struct mpage_da_data *mpd, sector_t logical, if (buffer_delay(bh)) { bh->b_blocknr = pblock; clear_buffer_delay(bh); + bh->b_bdev = inode->i_sb->s_bdev; + } else if (buffer_unwritten(bh)) { + bh->b_blocknr = pblock; + clear_buffer_unwritten(bh); + set_buffer_mapped(bh); + set_buffer_new(bh); + bh->b_bdev = inode->i_sb->s_bdev; } else if (buffer_mapped(bh)) BUG_ON(bh->b_blocknr != pblock); @@ -1772,8 +1777,8 @@ static inline void __unmap_underlying_blocks(struct inode *inode, */ static void mpage_da_map_blocks(struct mpage_da_data *mpd) { + int err = 0; struct buffer_head *lbh = &mpd->lbh; - int err = 0, remain = lbh->b_size; sector_t next = lbh->b_blocknr; struct buffer_head new; @@ -1783,38 +1788,29 @@ static void mpage_da_map_blocks(struct mpage_da_data *mpd) if (buffer_mapped(lbh) && !buffer_delay(lbh)) return; - while (remain) { - new.b_state = lbh->b_state; - new.b_blocknr = 0; - new.b_size = remain; - err = mpd->get_block(mpd->inode, next, &new, 1); - if (err) { - /* - * Rather than implement own error handling - * here, we just leave remaining blocks - * unallocated and try again with ->writepage() - */ - break; - } - BUG_ON(new.b_size == 0); + new.b_state = lbh->b_state; + new.b_blocknr = 0; + new.b_size = lbh->b_size; + err = mpd->get_block(mpd->inode, next, &new, 1); + if (err) + return; + BUG_ON(new.b_size == 0); - if (buffer_new(&new)) - __unmap_underlying_blocks(mpd->inode, &new); + if (buffer_new(&new)) + __unmap_underlying_blocks(mpd->inode, &new); - /* - * If blocks are delayed marked, we need to - * put actual blocknr and drop delayed bit - */ - if (buffer_delay(lbh)) - mpage_put_bnr_to_bhs(mpd, next, &new); + /* + * If blocks are delayed marked, we need to + * put actual blocknr and drop delayed bit + */ + if (buffer_delay(lbh) || buffer_unwritten(lbh)) + mpage_put_bnr_to_bhs(mpd, next, &new); - /* go for the remaining blocks */ - next += new.b_size >> mpd->inode->i_blkbits; - remain -= new.b_size; - } + return; } -#define BH_FLAGS ((1 << BH_Uptodate) | (1 << BH_Mapped) | (1 << BH_Delay)) +#define BH_FLAGS ((1 << BH_Uptodate) | (1 << BH_Mapped) | \ + (1 << BH_Delay) | (1 << BH_Unwritten)) /* * mpage_add_bh_to_extent - try to add one more block to extent of blocks @@ -1856,13 +1852,9 @@ static void mpage_add_bh_to_extent(struct mpage_da_data *mpd, * need to flush current extent and start new one */ mpage_da_map_blocks(mpd); - - /* - * Now start a new extent - */ - lbh->b_size = bh->b_size; - lbh->b_state = bh->b_state & BH_FLAGS; - lbh->b_blocknr = logical; + mpage_da_submit_io(mpd); + mpd->io_done = 1; + return; } /* @@ -1882,6 +1874,17 @@ static int __mpage_da_writepage(struct page *page, struct buffer_head *bh, *head, fake; sector_t logical; + if (mpd->io_done) { + /* + * Rest of the page in the page_vec + * redirty then and skip then. We will + * try to to write them again after + * starting a new transaction + */ + redirty_page_for_writepage(wbc, page); + unlock_page(page); + return MPAGE_DA_EXTENT_TAIL; + } /* * Can we merge this page to current extent? */ @@ -1893,6 +1896,13 @@ static int __mpage_da_writepage(struct page *page, if (mpd->next_page != mpd->first_page) { mpage_da_map_blocks(mpd); mpage_da_submit_io(mpd); + /* + * skip rest of the page in the page_vec + */ + mpd->io_done = 1; + redirty_page_for_writepage(wbc, page); + unlock_page(page); + return MPAGE_DA_EXTENT_TAIL; } /* @@ -1923,6 +1933,8 @@ static int __mpage_da_writepage(struct page *page, set_buffer_dirty(bh); set_buffer_uptodate(bh); mpage_add_bh_to_extent(mpd, logical, bh); + if (mpd->io_done) + return MPAGE_DA_EXTENT_TAIL; } else { /* * Page with regular buffer heads, just add all dirty ones @@ -1931,8 +1943,12 @@ static int __mpage_da_writepage(struct page *page, bh = head; do { BUG_ON(buffer_locked(bh)); - if (buffer_dirty(bh)) + if (buffer_dirty(bh) && + (!buffer_mapped(bh) || buffer_delay(bh))) { mpage_add_bh_to_extent(mpd, logical, bh); + if (mpd->io_done) + return MPAGE_DA_EXTENT_TAIL; + } logical++; } while ((bh = bh->b_this_page) != head); } @@ -1967,6 +1983,7 @@ static int mpage_da_writepages(struct address_space *mapping, get_block_t get_block) { struct mpage_da_data mpd; + long to_write; int ret; if (!get_block) @@ -1980,17 +1997,22 @@ static int mpage_da_writepages(struct address_space *mapping, mpd.first_page = 0; mpd.next_page = 0; mpd.get_block = get_block; + mpd.io_done = 0; + mpd.pages_written = 0; + + to_write = wbc->nr_to_write; ret = write_cache_pages(mapping, wbc, __mpage_da_writepage, &mpd); /* * Handle last extent of pages */ - if (mpd.next_page != mpd.first_page) { + if (!mpd.io_done && mpd.next_page != mpd.first_page) { mpage_da_map_blocks(&mpd); mpage_da_submit_io(&mpd); } + wbc->nr_to_write = to_write - mpd.pages_written; return ret; } @@ -2202,10 +2224,7 @@ static int ext4_da_writepages(struct address_space *mapping, int ret = 0; long to_write; loff_t range_start = 0; - int blocks_per_page = PAGE_CACHE_SIZE >> inode->i_blkbits; - int max_credit_blocks = ext4_journal_max_transaction_buffers(inode); - int need_credits_per_page = ext4_writepages_trans_blocks(inode, 1); - int max_writeback_pages = (max_credit_blocks / blocks_per_page) / need_credits_per_page; + long pages_skipped = 0; /* * No pages to write? This is mainly a kludge to avoid starting @@ -2215,11 +2234,6 @@ static int ext4_da_writepages(struct address_space *mapping, if (!mapping->nrpages || !mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) return 0; - if (wbc->nr_to_write > mapping->nrpages) - wbc->nr_to_write = mapping->nrpages; - - to_write = wbc->nr_to_write; - if (!wbc->range_cyclic) { /* * If range_cyclic is not set force range_cont @@ -2228,26 +2242,21 @@ static int ext4_da_writepages(struct address_space *mapping, wbc->range_cont = 1; range_start = wbc->range_start; } + pages_skipped = wbc->pages_skipped; - while (!ret && to_write) { - /* - * set the max dirty pages could be write at a time - * to fit into the reserved transaction credits - */ - if (wbc->nr_to_write > max_writeback_pages) - wbc->nr_to_write = max_writeback_pages; +restart_loop: + to_write = wbc->nr_to_write; + while (!ret && to_write > 0) { /* - * Estimate the worse case needed credits to write out - * to_write pages + * we insert one extent at a time. So we need + * credit needed for single extent allocation. + * journalled mode is currently not supported + * by delalloc */ - needed_blocks = ext4_writepages_trans_blocks(inode, - wbc->nr_to_write); - while (needed_blocks > max_credit_blocks) { - wbc->nr_to_write--; - needed_blocks = ext4_writepages_trans_blocks(inode, - wbc->nr_to_write); - } + BUG_ON(ext4_should_journal_data(inode)); + needed_blocks = EXT4_DATA_TRANS_BLOCKS(inode->i_sb); + /* start a new transaction*/ handle = ext4_journal_start(inode, needed_blocks); if (IS_ERR(handle)) { @@ -2276,7 +2285,14 @@ static int ext4_da_writepages(struct address_space *mapping, ret = mpage_da_writepages(mapping, wbc, ext4_da_get_block_write); ext4_journal_stop(handle); - if (wbc->nr_to_write) { + if (ret == MPAGE_DA_EXTENT_TAIL) { + /* + * got one extent now try with + * rest of the pages + */ + to_write += wbc->nr_to_write; + ret = 0; + } else if (wbc->nr_to_write) { /* * There is no more writeout needed * or we requested for a noblocking writeout @@ -2288,6 +2304,15 @@ static int ext4_da_writepages(struct address_space *mapping, wbc->nr_to_write = to_write; } + if (wbc->range_cont && (pages_skipped != wbc->pages_skipped)) { + /* We skipped pages in this loop */ + wbc->range_start = range_start; + wbc->nr_to_write = to_write + + wbc->pages_skipped - pages_skipped; + wbc->pages_skipped = pages_skipped; + goto restart_loop; + } + out_writepages: wbc->nr_to_write = to_write; if (range_start) [-- Attachment #3: 2.patch --] [-- Type: text/x-diff, Size: 969 bytes --] diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 8d62200..023e1a8 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -1790,6 +1790,13 @@ static void mpage_da_map_blocks(struct mpage_da_data *mpd) new.b_state = lbh->b_state; new.b_blocknr = 0; new.b_size = lbh->b_size; + + /* + * If we didn't accumulate anything + * to write simply return + */ + if (!new.b_size) + return; err = mpd->get_block(mpd->inode, next, &new, 1); if (err) return; diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 25adfc3..a7db10c 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -517,8 +517,12 @@ void generic_sync_sb_inodes(struct super_block *sb, cond_resched(); spin_lock(&inode_lock); if (wbc->nr_to_write <= 0) { - wbc->more_io = 1; - break; + if (wbc->sync_mode == WB_SYNC_ALL) { + wbc->nr_to_write = LONG_MAX; + } else { + wbc->more_io = 1; + break; + } } if (!list_empty(&sb->s_more_io)) wbc->more_io = 1; ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: Problem with delayed allocation 2008-08-05 6:52 ` Aneesh Kumar K.V @ 2008-08-05 13:21 ` Aneesh Kumar K.V 2008-08-05 13:47 ` Theodore Tso 2008-08-06 10:05 ` Aneesh Kumar K.V 0 siblings, 2 replies; 17+ messages in thread From: Aneesh Kumar K.V @ 2008-08-05 13:21 UTC (permalink / raw) To: Theodore Tso; +Cc: linux-ext4 On Tue, Aug 05, 2008 at 12:22:17PM +0530, Aneesh Kumar K.V wrote: > On Tue, Aug 05, 2008 at 02:44:28AM -0400, Theodore Tso wrote: > > On Mon, Aug 04, 2008 at 10:05:05PM +0530, Aneesh Kumar K.V wrote: > > > > > > This is the complete patch that I have. I haven't fully tested it > > > (right now waiting for the machine to be free). This should apply > > > after stable-boundary-undo.patch > > > > Umm... the patch doesn't apply right after the stable boundary udo > > patch. > > > > - Ted > > I did a fresh git pull and updated the patch. I also accumulated few > changes after words while testing on ABAT. Attaching both the patches > below. The patches apply after ext4_journal_credits_fix_for_writepages.patch > in the patch queue. I still see the problem with the below changes. Now that i have read the writeback path more closely I am not sure how it will guarantee that all dirty pages of the inode are written back to disk before generic_sync_sb_inodes return. ..... .... > @@ -2202,10 +2224,7 @@ static int ext4_da_writepages(struct address_space *mapping, > int ret = 0; > long to_write; > loff_t range_start = 0; > - int blocks_per_page = PAGE_CACHE_SIZE >> inode->i_blkbits; > - int max_credit_blocks = ext4_journal_max_transaction_buffers(inode); > - int need_credits_per_page = ext4_writepages_trans_blocks(inode, 1); > - int max_writeback_pages = (max_credit_blocks / blocks_per_page) / need_credits_per_page; > + long pages_skipped = 0; > > /* > * No pages to write? This is mainly a kludge to avoid starting > @@ -2215,11 +2234,6 @@ static int ext4_da_writepages(struct address_space *mapping, > if (!mapping->nrpages || !mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) > return 0; > > - if (wbc->nr_to_write > mapping->nrpages) > - wbc->nr_to_write = mapping->nrpages; > - > - to_write = wbc->nr_to_write; > - > if (!wbc->range_cyclic) { > /* > * If range_cyclic is not set force range_cont > @@ -2228,26 +2242,21 @@ static int ext4_da_writepages(struct address_space *mapping, > wbc->range_cont = 1; > range_start = wbc->range_start; > } > + pages_skipped = wbc->pages_skipped; > > - while (!ret && to_write) { > - /* > - * set the max dirty pages could be write at a time > - * to fit into the reserved transaction credits > - */ > - if (wbc->nr_to_write > max_writeback_pages) > - wbc->nr_to_write = max_writeback_pages; > +restart_loop: > + to_write = wbc->nr_to_write; > + while (!ret && to_write > 0) { > .... ..... > * or we requested for a noblocking writeout > @@ -2288,6 +2304,15 @@ static int ext4_da_writepages(struct address_space *mapping, > wbc->nr_to_write = to_write; > } > > + if (wbc->range_cont && (pages_skipped != wbc->pages_skipped)) { > + /* We skipped pages in this loop */ > + wbc->range_start = range_start; > + wbc->nr_to_write = to_write + > + wbc->pages_skipped - pages_skipped; > + wbc->pages_skipped = pages_skipped; > + goto restart_loop; > + } > + This should not be needed. I was trying to force the pages to writeback. generic_sync_sb_inodes actually move the inode to s_dirty if the pages_skipped differ after a writeback. But the confusing part is we are not looking at s_dirty list again. We move s_dirty and s_more_io to s_io only once in queue_io > out_writepages: > wbc->nr_to_write = to_write; > if (range_start) > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index 8d62200..023e1a8 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -1790,6 +1790,13 @@ static void mpage_da_map_blocks(struct mpage_da_data *mpd) > new.b_state = lbh->b_state; > new.b_blocknr = 0; > new.b_size = lbh->b_size; > + > + /* > + * If we didn't accumulate anything > + * to write simply return > + */ > + if (!new.b_size) > + return; > err = mpd->get_block(mpd->inode, next, &new, 1); > if (err) > return; > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c > index 25adfc3..a7db10c 100644 > --- a/fs/fs-writeback.c > +++ b/fs/fs-writeback.c > @@ -517,8 +517,12 @@ void generic_sync_sb_inodes(struct super_block *sb, > cond_resched(); > spin_lock(&inode_lock); > if (wbc->nr_to_write <= 0) { > - wbc->more_io = 1; > - break; > + if (wbc->sync_mode == WB_SYNC_ALL) { > + wbc->nr_to_write = LONG_MAX; > + } else { > + wbc->more_io = 1; > + break; > + } > } > if (!list_empty(&sb->s_more_io)) > wbc->more_io = 1; This also should not be done. I guess we need to look at core writeback code more closely. -aneesh ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Problem with delayed allocation 2008-08-05 13:21 ` Aneesh Kumar K.V @ 2008-08-05 13:47 ` Theodore Tso 2008-08-05 14:24 ` Aneesh Kumar K.V 2008-08-06 10:05 ` Aneesh Kumar K.V 1 sibling, 1 reply; 17+ messages in thread From: Theodore Tso @ 2008-08-05 13:47 UTC (permalink / raw) To: Aneesh Kumar K.V; +Cc: linux-ext4 On Tue, Aug 05, 2008 at 06:51:33PM +0530, Aneesh Kumar K.V wrote: > This should not be needed. I was trying to force the pages to writeback. > generic_sync_sb_inodes actually move the inode to s_dirty if the > pages_skipped differ after a writeback. But the confusing part is we > are not looking at s_dirty list again. We move s_dirty and s_more_io to s_io > only once in queue_io Yes, but ext4_da_writepages() gets called twice in the __fsync_super() code path, right? Once with wbc->sync_mode set to WB_SYNC_HOLD, and once with wbc->sync_mode set to wbc->sync_mode set to WB_SYNC_ALL, corresponding to sync_inodes_sb() getting called twice, once with wait=0 and once with wait=1. - Ted ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Problem with delayed allocation 2008-08-05 13:47 ` Theodore Tso @ 2008-08-05 14:24 ` Aneesh Kumar K.V 2008-08-05 15:16 ` Theodore Tso 0 siblings, 1 reply; 17+ messages in thread From: Aneesh Kumar K.V @ 2008-08-05 14:24 UTC (permalink / raw) To: Theodore Tso; +Cc: linux-ext4 On Tue, Aug 05, 2008 at 09:47:23AM -0400, Theodore Tso wrote: > On Tue, Aug 05, 2008 at 06:51:33PM +0530, Aneesh Kumar K.V wrote: > > This should not be needed. I was trying to force the pages to writeback. > > generic_sync_sb_inodes actually move the inode to s_dirty if the > > pages_skipped differ after a writeback. But the confusing part is we > > are not looking at s_dirty list again. We move s_dirty and s_more_io to s_io > > only once in queue_io > > Yes, but ext4_da_writepages() gets called twice in the __fsync_super() > code path, right? Once with wbc->sync_mode set to WB_SYNC_HOLD, and > once with wbc->sync_mode set to wbc->sync_mode set to WB_SYNC_ALL, > corresponding to sync_inodes_sb() getting called twice, once with > wait=0 and once with wait=1. > But we would still can have pages skipped in the second call to ext4_da_writepages(). But this make me wonder how xfs is doing delalloc. Also this should be possible in other file systems too. The delayed allocation logic is just exposing it much easily. sync_inodes_sb(sb, 0); generic_sync_sb_inodes write 10 pages and moves 10 to pages skipped. move the inode to s_dirty. sync_inodes_sb(sb, 1); generic_sync_sb_inodes move s_dirty to s_io write 10 pages and move 5 pages to skipped list move inode to s_dirty. I guess sync_inodes_sb() should ensure that all dirty pages are written to the disk. And currently i can see may ways in which generic_sync_sb_inodes fails to do that. generic_sync_sb_inodes is suitable for pdflush work function which get called periodically But for __fsync_super i guess we need a different API which ensures that all the dirty pages are synced to the disk. -aneesh ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Problem with delayed allocation 2008-08-05 14:24 ` Aneesh Kumar K.V @ 2008-08-05 15:16 ` Theodore Tso 0 siblings, 0 replies; 17+ messages in thread From: Theodore Tso @ 2008-08-05 15:16 UTC (permalink / raw) To: Aneesh Kumar K.V; +Cc: linux-ext4 On Tue, Aug 05, 2008 at 07:54:03PM +0530, Aneesh Kumar K.V wrote: > But we would still can have pages skipped in the second call to > ext4_da_writepages(). But this make me wonder how xfs is doing > delalloc. I just checked XFS, and it does the right thing. See below for my tests. The two interesting things of note is that XFS takes a lot longer (5 seconds vs 0.293 seconds) to do the unmount, so they are definitely doing something right to wait for the dellayed allocations to get mapped and written to disk. The second thing of note is that ext4 is currently beating XFS at the totally meaningless reiser4 benchmark (aka untar a kernel source tree :-), which we can do in 28 seconds versus XFS's 31 seconds. So for this test, we're 12% faster (20% faster if we include the time taken by the remount read-only step), but we're losing 9% of the data. :-/ - Ted <tytso.root@closure> {/}, level 2 270# /sbin/mkfs.xfs -f /dev/thunk/testbench; mount /dev/thunk/testbench /mnt; cd /mnt; time tar xjf /usr/projects/linux/linux-2.6.26-3495-gf303489.tar.bz2 ; time mount -o remount,ro /mnt; cd ..; du -s /mnt; umount /mnt; mount /dev/thunk/testbench /mnt; du -s /mnt; umount /mnt meta-data=/dev/thunk/testbench isize=256 agcount=8, agsize=163840 blks = sectsz=512 attr=0 data = bsize=4096 blocks=1310720, imaxpct=25 = sunit=0 swidth=0 blks, unwritten=1 naming =version 2 bsize=4096 log =internal log bsize=4096 blocks=2560, version=1 = sectsz=512 sunit=0 blks, lazy-count=0 realtime =none extsz=4096 blocks=0, rtextents=0 real 0m31.060s user 0m19.965s sys 0m8.323s real 0m5.263s user 0m0.000s sys 0m0.847s 320872 /mnt 320872 /mnt <tytso.root@closure> {/}, level 2 271# /sbin/mkfs.ext4dev /dev/thunk/testbench; mount /dev/thunk/testbench /mnt; cd /mnt; time tar xjf /usr/projects/linux/linux-2.6.26-3495-gf303489.tar.bz2 ; time mount -o remount,ro /mnt; cd ..; du -s /mnt; umount /mnt; mount /dev/thunk/testbench /mnt; du -s /mnt; umount /mnt mke2fs 1.41.0 (10-Jul-2008) Filesystem label= OS type: Linux Block size=4096 (log=2) Fragment size=4096 (log=2) 327680 inodes, 1310720 blocks 65536 blocks (5.00%) reserved for the super user First data block=0 Maximum filesystem blocks=1342177280 40 block groups 32768 blocks per group, 32768 fragments per group 8192 inodes per group Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736 Writing inode tables: done Creating journal (32768 blocks): done Writing superblocks and filesystem accounting information: done This filesystem will be automatically checked every 32 mounts or 180 days, whichever comes first. Use tune2fs -c or -i to override. real 0m28.125s user 0m18.545s sys 0m8.983s real 0m0.293s user 0m0.000s sys 0m0.093s 323736 /mnt 284332 /mnt ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Problem with delayed allocation 2008-08-05 13:21 ` Aneesh Kumar K.V 2008-08-05 13:47 ` Theodore Tso @ 2008-08-06 10:05 ` Aneesh Kumar K.V 2008-08-06 10:11 ` Aneesh Kumar K.V 1 sibling, 1 reply; 17+ messages in thread From: Aneesh Kumar K.V @ 2008-08-06 10:05 UTC (permalink / raw) To: Theodore Tso; +Cc: linux-ext4 On Tue, Aug 05, 2008 at 06:51:33PM +0530, Aneesh Kumar K.V wrote: > On Tue, Aug 05, 2008 at 12:22:17PM +0530, Aneesh Kumar K.V wrote: > > On Tue, Aug 05, 2008 at 02:44:28AM -0400, Theodore Tso wrote: > > > On Mon, Aug 04, 2008 at 10:05:05PM +0530, Aneesh Kumar K.V wrote: > > > > > > > > This is the complete patch that I have. I haven't fully tested it > > > > (right now waiting for the machine to be free). This should apply > > > > after stable-boundary-undo.patch > > > > > > Umm... the patch doesn't apply right after the stable boundary udo > > > patch. > > > > > > - Ted > > > > I did a fresh git pull and updated the patch. I also accumulated few > > changes after words while testing on ABAT. Attaching both the patches > > below. The patches apply after ext4_journal_credits_fix_for_writepages.patch > > in the patch queue. > > I still see the problem with the below changes. Now that i have read > the writeback path more closely I am not sure how it will guarantee > that all dirty pages of the inode are written back to disk before > generic_sync_sb_inodes return. > Below changes fixed it for me. range_start was used in the loop in generic_sync_sb_inodes and since we were checking for !range_start we missed restoring wbc->range_start for range_start == 0. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 472a6a7..ea1a8db 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -2241,14 +2241,14 @@ static int ext4_da_writepages(struct address_space *mapping, if (!mapping->nrpages || !mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) return 0; - if (!wbc->range_cyclic) { + if (!wbc->range_cyclic) /* * If range_cyclic is not set force range_cont * and save the old writeback_index */ wbc->range_cont = 1; - range_start = wbc->range_start; - } + + range_start = wbc->range_start; pages_skipped = wbc->pages_skipped; restart_loop: @@ -2322,8 +2322,7 @@ static int ext4_da_writepages(struct address_space *mapping, out_writepages: wbc->nr_to_write = to_write; - if (range_start) - wbc->range_start = range_start; + wbc->range_start = range_start; return ret; } ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: Problem with delayed allocation 2008-08-06 10:05 ` Aneesh Kumar K.V @ 2008-08-06 10:11 ` Aneesh Kumar K.V 2008-08-07 0:49 ` Mingming Cao 0 siblings, 1 reply; 17+ messages in thread From: Aneesh Kumar K.V @ 2008-08-06 10:11 UTC (permalink / raw) To: Theodore Tso; +Cc: linux-ext4 On Wed, Aug 06, 2008 at 03:35:53PM +0530, Aneesh Kumar K.V wrote: > On Tue, Aug 05, 2008 at 06:51:33PM +0530, Aneesh Kumar K.V wrote: > > On Tue, Aug 05, 2008 at 12:22:17PM +0530, Aneesh Kumar K.V wrote: > > > On Tue, Aug 05, 2008 at 02:44:28AM -0400, Theodore Tso wrote: > > > > On Mon, Aug 04, 2008 at 10:05:05PM +0530, Aneesh Kumar K.V wrote: > > > > > > > > > > This is the complete patch that I have. I haven't fully tested it > > > > > (right now waiting for the machine to be free). This should apply > > > > > after stable-boundary-undo.patch > > > > > > > > Umm... the patch doesn't apply right after the stable boundary udo > > > > patch. > > > > > > > > - Ted > > > > > > I did a fresh git pull and updated the patch. I also accumulated few > > > changes after words while testing on ABAT. Attaching both the patches > > > below. The patches apply after ext4_journal_credits_fix_for_writepages.patch > > > in the patch queue. > > > > I still see the problem with the below changes. Now that i have read > > the writeback path more closely I am not sure how it will guarantee > > that all dirty pages of the inode are written back to disk before > > generic_sync_sb_inodes return. > > > > Below changes fixed it for me. range_start was used in the loop in > generic_sync_sb_inodes and since we were checking for !range_start > we missed restoring wbc->range_start for range_start == 0. > > Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index 472a6a7..ea1a8db 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -2241,14 +2241,14 @@ static int ext4_da_writepages(struct address_space *mapping, > if (!mapping->nrpages || !mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) > return 0; > > - if (!wbc->range_cyclic) { > + if (!wbc->range_cyclic) > /* > * If range_cyclic is not set force range_cont > * and save the old writeback_index > */ > wbc->range_cont = 1; > - range_start = wbc->range_start; > - } > + > + range_start = wbc->range_start; > pages_skipped = wbc->pages_skipped; > > restart_loop: > @@ -2322,8 +2322,7 @@ static int ext4_da_writepages(struct address_space *mapping, > > out_writepages: > wbc->nr_to_write = to_write; > - if (range_start) > - wbc->range_start = range_start; > + wbc->range_start = range_start; > return ret; > } > complete patch below commit 912553855e83ed3bc9e9b3ee025d7df03c98848b Author: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Date: Wed Aug 6 15:36:49 2008 +0530 ext4: Rework the ext4_da_writepages With the below changes we reserve credit needed to insert only one extent resulting from a call to single get_block. That make sure we don't take too much journal credits during writeout. We also don't limit the pages to write. That means we loop through the dirty pages building largest possible contiguous block request. Then we issue a single get_block request. We may get less block that we requested. If so we would end up not mapping some of the buffer_heads. That means those buffer_heads are still marked delay. Later in the writepage callback via __mpage_writepage we redirty those pages. We should also not limit/throttle wbc->nr_to_write in the filesystem writepages callback. That cause wrong behaviour in generic_sync_sb_inodes caused by wbc->nr_to_write being <= 0 Also make sure fallocated area is handled properly by delayed allocation. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index c96cc0b..ea1a8db 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -41,6 +41,8 @@ #include "acl.h" #include "ext4_extents.h" +#define MPAGE_DA_EXTENT_TAIL 0x01 + static inline int ext4_begin_ordered_truncate(struct inode *inode, loff_t new_size) { @@ -1604,6 +1606,8 @@ static void ext4_da_page_release_reservation(struct page *page, unsigned long first_page, next_page; /* extent of pages */ get_block_t *get_block; struct writeback_control *wbc; + int io_done; + long pages_written; }; /* @@ -1623,12 +1627,6 @@ static void ext4_da_page_release_reservation(struct page *page, static int mpage_da_submit_io(struct mpage_da_data *mpd) { struct address_space *mapping = mpd->inode->i_mapping; - struct mpage_data mpd_pp = { - .bio = NULL, - .last_block_in_bio = 0, - .get_block = mpd->get_block, - .use_writepage = 1, - }; int ret = 0, err, nr_pages, i; unsigned long index, end; struct pagevec pvec; @@ -1652,7 +1650,9 @@ static int mpage_da_submit_io(struct mpage_da_data *mpd) break; index++; - err = __mpage_writepage(page, mpd->wbc, &mpd_pp); + err = mapping->a_ops->writepage(page, mpd->wbc); + if (!err) + mpd->pages_written++; /* * In error case, we have to continue because @@ -1664,8 +1664,6 @@ static int mpage_da_submit_io(struct mpage_da_data *mpd) } pagevec_release(&pvec); } - if (mpd_pp.bio) - mpage_bio_submit(WRITE, mpd_pp.bio); return ret; } @@ -1732,6 +1730,13 @@ static void mpage_put_bnr_to_bhs(struct mpage_da_data *mpd, sector_t logical, if (buffer_delay(bh)) { bh->b_blocknr = pblock; clear_buffer_delay(bh); + bh->b_bdev = inode->i_sb->s_bdev; + } else if (buffer_unwritten(bh)) { + bh->b_blocknr = pblock; + clear_buffer_unwritten(bh); + set_buffer_mapped(bh); + set_buffer_new(bh); + bh->b_bdev = inode->i_sb->s_bdev; } else if (buffer_mapped(bh)) BUG_ON(bh->b_blocknr != pblock); @@ -1772,8 +1777,8 @@ static inline void __unmap_underlying_blocks(struct inode *inode, */ static void mpage_da_map_blocks(struct mpage_da_data *mpd) { + int err = 0; struct buffer_head *lbh = &mpd->lbh; - int err = 0, remain = lbh->b_size; sector_t next = lbh->b_blocknr; struct buffer_head new; @@ -1783,38 +1788,36 @@ static void mpage_da_map_blocks(struct mpage_da_data *mpd) if (buffer_mapped(lbh) && !buffer_delay(lbh)) return; - while (remain) { - new.b_state = lbh->b_state; - new.b_blocknr = 0; - new.b_size = remain; - err = mpd->get_block(mpd->inode, next, &new, 1); - if (err) { - /* - * Rather than implement own error handling - * here, we just leave remaining blocks - * unallocated and try again with ->writepage() - */ - break; - } - BUG_ON(new.b_size == 0); + new.b_state = lbh->b_state; + new.b_blocknr = 0; + new.b_size = lbh->b_size; - if (buffer_new(&new)) - __unmap_underlying_blocks(mpd->inode, &new); + /* + * If we didn't accumulate anything + * to write simply return + */ + if (!new.b_size) + return; + err = mpd->get_block(mpd->inode, next, &new, 1); + if (err) + return; + BUG_ON(new.b_size == 0); - /* - * If blocks are delayed marked, we need to - * put actual blocknr and drop delayed bit - */ - if (buffer_delay(lbh)) - mpage_put_bnr_to_bhs(mpd, next, &new); + if (buffer_new(&new)) + __unmap_underlying_blocks(mpd->inode, &new); - /* go for the remaining blocks */ - next += new.b_size >> mpd->inode->i_blkbits; - remain -= new.b_size; - } + /* + * If blocks are delayed marked, we need to + * put actual blocknr and drop delayed bit + */ + if (buffer_delay(lbh) || buffer_unwritten(lbh)) + mpage_put_bnr_to_bhs(mpd, next, &new); + + return; } -#define BH_FLAGS ((1 << BH_Uptodate) | (1 << BH_Mapped) | (1 << BH_Delay)) +#define BH_FLAGS ((1 << BH_Uptodate) | (1 << BH_Mapped) | \ + (1 << BH_Delay) | (1 << BH_Unwritten)) /* * mpage_add_bh_to_extent - try to add one more block to extent of blocks @@ -1856,13 +1859,9 @@ static void mpage_add_bh_to_extent(struct mpage_da_data *mpd, * need to flush current extent and start new one */ mpage_da_map_blocks(mpd); - - /* - * Now start a new extent - */ - lbh->b_size = bh->b_size; - lbh->b_state = bh->b_state & BH_FLAGS; - lbh->b_blocknr = logical; + mpage_da_submit_io(mpd); + mpd->io_done = 1; + return; } /* @@ -1882,6 +1881,17 @@ static int __mpage_da_writepage(struct page *page, struct buffer_head *bh, *head, fake; sector_t logical; + if (mpd->io_done) { + /* + * Rest of the page in the page_vec + * redirty then and skip then. We will + * try to to write them again after + * starting a new transaction + */ + redirty_page_for_writepage(wbc, page); + unlock_page(page); + return MPAGE_DA_EXTENT_TAIL; + } /* * Can we merge this page to current extent? */ @@ -1893,6 +1903,13 @@ static int __mpage_da_writepage(struct page *page, if (mpd->next_page != mpd->first_page) { mpage_da_map_blocks(mpd); mpage_da_submit_io(mpd); + /* + * skip rest of the page in the page_vec + */ + mpd->io_done = 1; + redirty_page_for_writepage(wbc, page); + unlock_page(page); + return MPAGE_DA_EXTENT_TAIL; } /* @@ -1923,6 +1940,8 @@ static int __mpage_da_writepage(struct page *page, set_buffer_dirty(bh); set_buffer_uptodate(bh); mpage_add_bh_to_extent(mpd, logical, bh); + if (mpd->io_done) + return MPAGE_DA_EXTENT_TAIL; } else { /* * Page with regular buffer heads, just add all dirty ones @@ -1931,8 +1950,12 @@ static int __mpage_da_writepage(struct page *page, bh = head; do { BUG_ON(buffer_locked(bh)); - if (buffer_dirty(bh)) + if (buffer_dirty(bh) && + (!buffer_mapped(bh) || buffer_delay(bh))) { mpage_add_bh_to_extent(mpd, logical, bh); + if (mpd->io_done) + return MPAGE_DA_EXTENT_TAIL; + } logical++; } while ((bh = bh->b_this_page) != head); } @@ -1967,6 +1990,7 @@ static int mpage_da_writepages(struct address_space *mapping, get_block_t get_block) { struct mpage_da_data mpd; + long to_write; int ret; if (!get_block) @@ -1980,17 +2004,22 @@ static int mpage_da_writepages(struct address_space *mapping, mpd.first_page = 0; mpd.next_page = 0; mpd.get_block = get_block; + mpd.io_done = 0; + mpd.pages_written = 0; + + to_write = wbc->nr_to_write; ret = write_cache_pages(mapping, wbc, __mpage_da_writepage, &mpd); /* * Handle last extent of pages */ - if (mpd.next_page != mpd.first_page) { + if (!mpd.io_done && mpd.next_page != mpd.first_page) { mpage_da_map_blocks(&mpd); mpage_da_submit_io(&mpd); } + wbc->nr_to_write = to_write - mpd.pages_written; return ret; } @@ -2202,10 +2231,7 @@ static int ext4_da_writepages(struct address_space *mapping, int ret = 0; long to_write; loff_t range_start = 0; - int blocks_per_page = PAGE_CACHE_SIZE >> inode->i_blkbits; - int max_credit_blocks = ext4_journal_max_transaction_buffers(inode); - int need_credits_per_page = ext4_writepages_trans_blocks(inode, 1); - int max_writeback_pages = (max_credit_blocks / blocks_per_page) / need_credits_per_page; + long pages_skipped = 0; /* * No pages to write? This is mainly a kludge to avoid starting @@ -2215,39 +2241,29 @@ static int ext4_da_writepages(struct address_space *mapping, if (!mapping->nrpages || !mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) return 0; - if (wbc->nr_to_write > mapping->nrpages) - wbc->nr_to_write = mapping->nrpages; - - to_write = wbc->nr_to_write; - - if (!wbc->range_cyclic) { + if (!wbc->range_cyclic) /* * If range_cyclic is not set force range_cont * and save the old writeback_index */ wbc->range_cont = 1; - range_start = wbc->range_start; - } - while (!ret && to_write) { - /* - * set the max dirty pages could be write at a time - * to fit into the reserved transaction credits - */ - if (wbc->nr_to_write > max_writeback_pages) - wbc->nr_to_write = max_writeback_pages; + range_start = wbc->range_start; + pages_skipped = wbc->pages_skipped; + +restart_loop: + to_write = wbc->nr_to_write; + while (!ret && to_write > 0) { /* - * Estimate the worse case needed credits to write out - * to_write pages + * we insert one extent at a time. So we need + * credit needed for single extent allocation. + * journalled mode is currently not supported + * by delalloc */ - needed_blocks = ext4_writepages_trans_blocks(inode, - wbc->nr_to_write); - while (needed_blocks > max_credit_blocks) { - wbc->nr_to_write--; - needed_blocks = ext4_writepages_trans_blocks(inode, - wbc->nr_to_write); - } + BUG_ON(ext4_should_journal_data(inode)); + needed_blocks = EXT4_DATA_TRANS_BLOCKS(inode->i_sb); + /* start a new transaction*/ handle = ext4_journal_start(inode, needed_blocks); if (IS_ERR(handle)) { @@ -2276,7 +2292,14 @@ static int ext4_da_writepages(struct address_space *mapping, ret = mpage_da_writepages(mapping, wbc, ext4_da_get_block_write); ext4_journal_stop(handle); - if (wbc->nr_to_write) { + if (ret == MPAGE_DA_EXTENT_TAIL) { + /* + * got one extent now try with + * rest of the pages + */ + to_write += wbc->nr_to_write; + ret = 0; + } else if (wbc->nr_to_write) { /* * There is no more writeout needed * or we requested for a noblocking writeout @@ -2288,10 +2311,18 @@ static int ext4_da_writepages(struct address_space *mapping, wbc->nr_to_write = to_write; } + if (wbc->range_cont && (pages_skipped != wbc->pages_skipped)) { + /* We skipped pages in this loop */ + wbc->range_start = range_start; + wbc->nr_to_write = to_write + + wbc->pages_skipped - pages_skipped; + wbc->pages_skipped = pages_skipped; + goto restart_loop; + } + out_writepages: wbc->nr_to_write = to_write; - if (range_start) - wbc->range_start = range_start; + wbc->range_start = range_start; return ret; } ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: Problem with delayed allocation 2008-08-06 10:11 ` Aneesh Kumar K.V @ 2008-08-07 0:49 ` Mingming Cao 0 siblings, 0 replies; 17+ messages in thread From: Mingming Cao @ 2008-08-07 0:49 UTC (permalink / raw) To: Aneesh Kumar K.V; +Cc: Theodore Tso, linux-ext4 > > Below changes fixed it for me. range_start was used in the loop in > > generic_sync_sb_inodes and since we were checking for !range_start > > we missed restoring wbc->range_start for range_start == 0. > > I added this patch to patch queue for more testing. I will split the journal credit fix patch and drop some of the original fix for the writepages, that means this patch need to be update also ...but at least add to the queue for more testing Mingming ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2008-08-07 0:49 UTC | newest] Thread overview: 17+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-08-02 20:07 Problem with delayed allocation Theodore Ts'o 2008-08-02 22:40 ` Theodore Tso 2008-08-04 3:16 ` Aneesh Kumar K.V 2008-08-04 14:08 ` Theodore Tso 2008-08-04 14:52 ` Aneesh Kumar K.V 2008-08-04 15:27 ` Aneesh Kumar K.V 2008-08-04 15:33 ` Aneesh Kumar K.V 2008-08-04 16:35 ` Aneesh Kumar K.V 2008-08-05 6:44 ` Theodore Tso 2008-08-05 6:52 ` Aneesh Kumar K.V 2008-08-05 13:21 ` Aneesh Kumar K.V 2008-08-05 13:47 ` Theodore Tso 2008-08-05 14:24 ` Aneesh Kumar K.V 2008-08-05 15:16 ` Theodore Tso 2008-08-06 10:05 ` Aneesh Kumar K.V 2008-08-06 10:11 ` Aneesh Kumar K.V 2008-08-07 0:49 ` Mingming Cao
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox