* NILFS2 get stuck after bio_alloc() fail
@ 2009-06-14 1:32 Leandro Lucarella
2009-06-14 1:52 ` Alberto Bertogli
2009-06-14 3:45 ` Ryusuke Konishi
0 siblings, 2 replies; 14+ messages in thread
From: Leandro Lucarella @ 2009-06-14 1:32 UTC (permalink / raw)
To: linux-kernel; +Cc: Alberto Bertogli
Hi!
While testing nilfs2 (using 2.6.30) doing some "cp"s and "rm"s, I noticed
sometimes they got stucked in D state, and the kernel had said the
following message:
NILFS: IO error writing segment
A friend gave me a hand and after adding some printk()s we found out that
the problem seems to occur when bio_alloc()s inside nilfs_alloc_seg_bio()
fail, making it return NULL; but we don't know how that causes the
processes to get stucked.
PS: Please CC me.
--
Leandro Lucarella (luca) | Blog colectivo: http://www.mazziblog.com.ar/blog/
----------------------------------------------------------------------------
GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145 104C 949E BFB6 5F5A 8D05)
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: NILFS2 get stuck after bio_alloc() fail 2009-06-14 1:32 NILFS2 get stuck after bio_alloc() fail Leandro Lucarella @ 2009-06-14 1:52 ` Alberto Bertogli 2009-06-14 6:30 ` Ryusuke Konishi 2009-06-14 3:45 ` Ryusuke Konishi 1 sibling, 1 reply; 14+ messages in thread From: Alberto Bertogli @ 2009-06-14 1:52 UTC (permalink / raw) To: Leandro Lucarella; +Cc: linux-kernel, konishi.ryusuke, users On Sat, Jun 13, 2009 at 10:32:11PM -0300, Leandro Lucarella wrote: > Hi! > > While testing nilfs2 (using 2.6.30) doing some "cp"s and "rm"s, I noticed > sometimes they got stucked in D state, and the kernel had said the > following message: > > NILFS: IO error writing segment > > A friend gave me a hand and after adding some printk()s we found out that > the problem seems to occur when bio_alloc()s inside nilfs_alloc_seg_bio() > fail, making it return NULL; but we don't know how that causes the > processes to get stucked. By the way, those bio_alloc()s are using GFP_NOWAIT but it looks like they could use at least GFP_NOIO or GFP_NOFS, since the caller can (and sometimes do) sleep. The only caller is nilfs_submit_bh(), which calls nilfs_submit_seg_bio() which can sleep calling wait_for_completion(). Is there something I'm missing? Thanks a lot, Alberto ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: NILFS2 get stuck after bio_alloc() fail 2009-06-14 1:52 ` Alberto Bertogli @ 2009-06-14 6:30 ` Ryusuke Konishi 2009-06-14 7:00 ` Pekka Enberg 0 siblings, 1 reply; 14+ messages in thread From: Ryusuke Konishi @ 2009-06-14 6:30 UTC (permalink / raw) To: albertito; +Cc: llucax, linux-kernel, konishi.ryusuke, users Hi! On Sat, 13 Jun 2009 22:52:40 -0300, Alberto Bertogli wrote: > On Sat, Jun 13, 2009 at 10:32:11PM -0300, Leandro Lucarella wrote: >> Hi! >> >> While testing nilfs2 (using 2.6.30) doing some "cp"s and "rm"s, I noticed >> sometimes they got stucked in D state, and the kernel had said the >> following message: >> >> NILFS: IO error writing segment >> >> A friend gave me a hand and after adding some printk()s we found out that >> the problem seems to occur when bio_alloc()s inside nilfs_alloc_seg_bio() >> fail, making it return NULL; but we don't know how that causes the >> processes to get stucked. > > By the way, those bio_alloc()s are using GFP_NOWAIT but it looks like they > could use at least GFP_NOIO or GFP_NOFS, since the caller can (and sometimes > do) sleep. The only caller is nilfs_submit_bh(), which calls > nilfs_submit_seg_bio() which can sleep calling wait_for_completion(). Is there > something I'm missing? > > Thanks a lot, > Alberto The original GFP flag was GFP_NOIO, but replaced to GFP_NOWAIT at a preliminary release in February 2008. It was because a user experienced system memory shortage by the bio_alloc() call. Even though nilfs_alloc_seg_bio() repeatedly calls bio_alloc() reducing the number of bio vectors in case of failure, this fallback did not work well. I'm in two minds whether I should change it back to GFP_NOIO. Or should I switch the gfp as follows? Thanks, Ryusuke Konishi diff --git a/fs/nilfs2/segbuf.c b/fs/nilfs2/segbuf.c index 1e68821..6b8f00a 100644 --- a/fs/nilfs2/segbuf.c +++ b/fs/nilfs2/segbuf.c @@ -306,6 +306,7 @@ static int nilfs_submit_seg_bio(struct nilfs_write_info *wi, int mode) * @sb: super block * @start: beginning disk block number of this BIO. * @nr_vecs: request size of page vector. + * @gfp_mask: gfp flags * * alloc_seg_bio() allocates a new BIO structure and initialize it. * @@ -313,14 +314,14 @@ static int nilfs_submit_seg_bio(struct nilfs_write_info *wi, int mode) * On error, NULL is returned. */ static struct bio *nilfs_alloc_seg_bio(struct super_block *sb, sector_t start, - int nr_vecs) + gfp_t gfp_mask, int nr_vecs) { struct bio *bio; - bio = bio_alloc(GFP_NOWAIT, nr_vecs); + bio = bio_alloc(gfp_mask, nr_vecs); if (bio == NULL) { while (!bio && (nr_vecs >>= 1)) - bio = bio_alloc(GFP_NOWAIT, nr_vecs); + bio = bio_alloc(gfp_mask, nr_vecs); } if (likely(bio)) { bio->bi_bdev = sb->s_bdev; @@ -353,9 +354,14 @@ static int nilfs_submit_bh(struct nilfs_write_info *wi, struct buffer_head *bh, repeat: if (!wi->bio) { wi->bio = nilfs_alloc_seg_bio(wi->sb, wi->blocknr + wi->end, - wi->nr_vecs); - if (unlikely(!wi->bio)) - return -ENOMEM; + GFP_NOWAIT, wi->nr_vecs); + if (unlikely(!wi->bio)) { + wi->bio = nilfs_alloc_seg_bio(wi->sb, + wi->blocknr + wi->end, + GFP_NOIO, 1); + if (!wi->bio) + return -ENOMEM; + } } len = bio_add_page(wi->bio, bh->b_page, bh->b_size, bh_offset(bh)); ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: NILFS2 get stuck after bio_alloc() fail 2009-06-14 6:30 ` Ryusuke Konishi @ 2009-06-14 7:00 ` Pekka Enberg 2009-06-14 12:34 ` Ryusuke Konishi 0 siblings, 1 reply; 14+ messages in thread From: Pekka Enberg @ 2009-06-14 7:00 UTC (permalink / raw) To: Ryusuke Konishi; +Cc: albertito, llucax, linux-kernel, users On Sun, Jun 14, 2009 at 9:30 AM, Ryusuke Konishi<konishi.ryusuke@lab.ntt.co.jp> wrote: > The original GFP flag was GFP_NOIO, but replaced to GFP_NOWAIT at a > preliminary release in February 2008. It was because a user > experienced system memory shortage by the bio_alloc() call. > > Even though nilfs_alloc_seg_bio() repeatedly calls bio_alloc() > reducing the number of bio vectors in case of failure, this fallback > did not work well. > > I'm in two minds whether I should change it back to GFP_NOIO. > Or should I switch the gfp as follows? As far as I can tell, the only difference with GFP_NOIO and GFP_NOWAIT here is that the former will trigger the mempool_alloc() page reclaim path. But I am not sure I understand why switching to GFP_NOWAIT helped with memory shortage. What exactly was the problem there? Pekka ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: NILFS2 get stuck after bio_alloc() fail 2009-06-14 7:00 ` Pekka Enberg @ 2009-06-14 12:34 ` Ryusuke Konishi 0 siblings, 0 replies; 14+ messages in thread From: Ryusuke Konishi @ 2009-06-14 12:34 UTC (permalink / raw) To: penberg; +Cc: konishi.ryusuke, albertito, llucax, linux-kernel, users Hi, On Sun, 14 Jun 2009 10:00:06 +0300, Pekka Enberg <penberg@cs.helsinki.fi> wrote: > On Sun, Jun 14, 2009 at 9:30 AM, Ryusuke > Konishi<konishi.ryusuke@lab.ntt.co.jp> wrote: >> The original GFP flag was GFP_NOIO, but replaced to GFP_NOWAIT at a >> preliminary release in February 2008. It was because a user >> experienced system memory shortage by the bio_alloc() call. >> >> Even though nilfs_alloc_seg_bio() repeatedly calls bio_alloc() >> reducing the number of bio vectors in case of failure, this fallback >> did not work well. >> >> I'm in two minds whether I should change it back to GFP_NOIO. >> Or should I switch the gfp as follows? > > As far as I can tell, the only difference with GFP_NOIO and GFP_NOWAIT > here is that the former will trigger the mempool_alloc() page reclaim > path. But I am not sure I understand why switching to GFP_NOWAIT > helped with memory shortage. What exactly was the problem there? > > Pekka I will confirm the details later. I cannot dig into the record due to planned outage at office now. The version I replaced GFP_NOIO with GFP_NOWAIT is considerably old. A true problem might have been in other places somewhere. For reference, the retry loop in question is as follows. It is to allocate bio for log writing. bio = bio_alloc(GFP_NOWAIT, nr_vecs); if (bio == NULL) { while (!bio && (nr_vecs >>= 1)) bio = bio_alloc(GFP_NOWAIT, nr_vecs); } where nr_vecs is less than or equal to bio_get_nr_vecs(bdev). Thanks, Ryusuke Konishi ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: NILFS2 get stuck after bio_alloc() fail 2009-06-14 1:32 NILFS2 get stuck after bio_alloc() fail Leandro Lucarella 2009-06-14 1:52 ` Alberto Bertogli @ 2009-06-14 3:45 ` Ryusuke Konishi 2009-06-14 15:32 ` Leandro Lucarella 1 sibling, 1 reply; 14+ messages in thread From: Ryusuke Konishi @ 2009-06-14 3:45 UTC (permalink / raw) To: llucax; +Cc: linux-kernel, albertito, users, llucax, konishi.ryusuke Hi, On Sat, 13 Jun 2009 22:32:11 -0300, Leandro Lucarella wrote: > Hi! > > While testing nilfs2 (using 2.6.30) doing some "cp"s and "rm"s, I noticed > sometimes they got stucked in D state, and the kernel had said the > following message: > > NILFS: IO error writing segment > > A friend gave me a hand and after adding some printk()s we found out that > the problem seems to occur when bio_alloc()s inside nilfs_alloc_seg_bio() > fail, making it return NULL; but we don't know how that causes the > processes to get stucked. > > PS: Please CC me. > -- > Leandro Lucarella (luca) | Blog colectivo: http://www.mazziblog.com.ar/blog/ Thank you for reporting this issue. Could you get stack dump of the stuck nilfs task? It is acquirable as follows if you enabled magic sysrq feature: # echo t > /proc/sysrq-trigger I will dig into the process how it got stuck. P.S. nilfs.org and the nilfs mailing list is stopping due to planned power outage. They will be back on June 15 0:30 a.m. A backup site except the list is available on http://nilfs.sourceforge.net/en/ during the period. Regards, Ryusuke Konishi ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: NILFS2 get stuck after bio_alloc() fail 2009-06-14 3:45 ` Ryusuke Konishi @ 2009-06-14 15:32 ` Leandro Lucarella 2009-06-14 18:02 ` Ryusuke Konishi 0 siblings, 1 reply; 14+ messages in thread From: Leandro Lucarella @ 2009-06-14 15:32 UTC (permalink / raw) To: Ryusuke Konishi; +Cc: linux-kernel, albertito, users Ryusuke Konishi, el 14 de junio a las 12:45 me escribiste: > Hi, > On Sat, 13 Jun 2009 22:32:11 -0300, Leandro Lucarella wrote: > > Hi! > > > > While testing nilfs2 (using 2.6.30) doing some "cp"s and "rm"s, I noticed > > sometimes they got stucked in D state, and the kernel had said the > > following message: > > > > NILFS: IO error writing segment > > > > A friend gave me a hand and after adding some printk()s we found out that > > the problem seems to occur when bio_alloc()s inside nilfs_alloc_seg_bio() > > fail, making it return NULL; but we don't know how that causes the > > processes to get stucked. > > Thank you for reporting this issue. > > Could you get stack dump of the stuck nilfs task? > It is acquirable as follows if you enabled magic sysrq feature: > > # echo t > /proc/sysrq-trigger > > I will dig into the process how it got stuck. Here is (what I thought it's) the important stuff: [...] kdmflush S dc5abf5c 0 1018 2 dc5abf84 00000046 dc60d780 dc5abf5c c01ad12e dd4d6ed0 dd4d7148 e3504d6e 00003c16 dc8b2560 dc5abf7c c040e24b dd846da0 dc60d7cc dd4d6ed0 dc5abf8c c040d628 dc5abfd0 c0131dbd dc7fe230 dd4d6ed0 dc5abfa8 dd4d6ed0 dd846da8 Call Trace: [<c01ad12e>] ? bio_fs_destructor+0xe/0x10 [<c040e24b>] ? down_write+0xb/0x30 [<c040d628>] schedule+0x8/0x20 [<c0131dbd>] worker_thread+0x16d/0x1e0 [<debcba30>] ? dm_wq_work+0x0/0x120 [dm_mod] [<c0135420>] ? autoremove_wake_function+0x0/0x50 [<c0131c50>] ? worker_thread+0x0/0x1e0 [<c0134fb3>] kthread+0x43/0x80 [<c0134f70>] ? kthread+0x0/0x80 [<c0103513>] kernel_thread_helper+0x7/0x14 [...] loop0 S dcc7bce0 0 15884 2 d7671f48 00000046 c01ad116 dcc7bce0 dcc7bca0 d4686590 d4686808 b50316ce 000003b8 dc7010a0 c01b0d4f c01b0cf0 dcc7bcec 0c7f3000 00000000 d7671f50 c040d628 d7671fd0 de85391c 00000000 00000000 00000000 dcbbd108 dcbbd000 Call Trace: [<c01ad116>] ? bio_free+0x46/0x50 [<c01b0d4f>] ? mpage_end_io_read+0x5f/0x70 [<c01b0cf0>] ? mpage_end_io_read+0x0/0x70 [<c040d628>] schedule+0x8/0x20 [<de85391c>] loop_thread+0x1cc/0x490 [loop] [<de853590>] ? do_lo_send_aops+0x0/0x1c0 [loop] [<c0135420>] ? autoremove_wake_function+0x0/0x50 [<de853750>] ? loop_thread+0x0/0x490 [loop] [<c0134fb3>] kthread+0x43/0x80 [<c0134f70>] ? kthread+0x0/0x80 [<c0103513>] kernel_thread_helper+0x7/0x14 segctord D 00000001 0 15886 2 d3847ef4 00000046 c011cefb 00000001 00000001 dcf48fd0 dcf49248 c052b9d0 d50962e4 dc701720 d46871dc d46871e4 c23f180c c23f180c d3847f28 d3847efc c040d628 d3847f20 c040ed3d c23f1810 dcf48fd0 d46871dc 00000000 c23f180c Call Trace: [<c011cefb>] ? dequeue_task_fair+0x27b/0x280 [<c040d628>] schedule+0x8/0x20 [<c040ed3d>] rwsem_down_failed_common+0x7d/0x180 [<c040ee5d>] rwsem_down_write_failed+0x1d/0x30 [<c040eeaa>] call_rwsem_down_write_failed+0x6/0x8 [<c040e25e>] ? down_write+0x1e/0x30 [<decb6299>] nilfs_transaction_lock+0x59/0x100 [nilfs2] [<decb6d5c>] nilfs_segctor_thread+0xcc/0x2e0 [nilfs2] [<decb6c80>] ? nilfs_construction_timeout+0x0/0x10 [nilfs2] [<decb6c90>] ? nilfs_segctor_thread+0x0/0x2e0 [nilfs2] [<c0134fb3>] kthread+0x43/0x80 [<c0134f70>] ? kthread+0x0/0x80 [<c0103513>] kernel_thread_helper+0x7/0x14 rm D d976bde0 0 16147 1 d976bdf0 00000086 003abc46 d976bde0 c013cc46 c18ad190 c18ad408 00000000 003abc46 dc789900 d976be38 d976bdf0 00000000 d976be30 d976be38 d976bdf8 c040d628 d976be00 c040d67a d976be08 c01668dd d976be24 c040dad7 c01668b0 Call Trace: [<c013cc46>] ? getnstimeofday+0x56/0x110 [<c040d628>] schedule+0x8/0x20 [<c040d67a>] io_schedule+0x3a/0x70 [<c01668dd>] sync_page+0x2d/0x60 [<c040dad7>] __wait_on_bit+0x47/0x70 [<c01668b0>] ? sync_page+0x0/0x60 [<c0166b08>] wait_on_page_bit+0x98/0xb0 [<c0135470>] ? wake_bit_function+0x0/0x60 [<c016f3e4>] truncate_inode_pages_range+0x244/0x360 [<c01a448c>] ? __mark_inode_dirty+0x2c/0x160 [<decb756c>] ? nilfs_transaction_commit+0x9c/0x170 [nilfs2] [<c040e27b>] ? down_read+0xb/0x20 [<c016f51a>] truncate_inode_pages+0x1a/0x20 [<deca3e9f>] nilfs_delete_inode+0x9f/0xd0 [nilfs2] [<deca3e00>] ? nilfs_delete_inode+0x0/0xd0 [nilfs2] [<c019c082>] generic_delete_inode+0x92/0x150 [<c019c1af>] generic_drop_inode+0x6f/0x1b0 [<c019b457>] iput+0x47/0x50 [<c0194763>] do_unlinkat+0xd3/0x160 [<c0197106>] ? vfs_readdir+0x66/0x90 [<c0196e00>] ? filldir64+0x0/0xf0 [<c01971c6>] ? sys_getdents64+0x96/0xb0 [<c0194913>] sys_unlinkat+0x23/0x50 [<c0102db5>] syscall_call+0x7/0xb umount D d06bbe6c 0 16727 1 d06bbe7c 00000086 d06bbe58 d06bbe6c c013cc46 dc5ef350 dc5ef5c8 00000000 022bb380 dc6503a0 d06bbec4 d06bbe7c 00000000 d06bbebc d06bbec4 d06bbe84 c040d628 d06bbe8c c040d67a d06bbe94 c01668dd d06bbeb0 c040dad7 c01668b0 Call Trace: [<c013cc46>] ? getnstimeofday+0x56/0x110 [<c040d628>] schedule+0x8/0x20 [<c040d67a>] io_schedule+0x3a/0x70 [<c01668dd>] sync_page+0x2d/0x60 [<c040dad7>] __wait_on_bit+0x47/0x70 [<c01668b0>] ? sync_page+0x0/0x60 [<c0166b08>] wait_on_page_bit+0x98/0xb0 [<c0135470>] ? wake_bit_function+0x0/0x60 [<c0167494>] wait_on_page_writeback_range+0xa4/0x110 [<c01675a0>] ? __filemap_fdatawrite_range+0x60/0x80 [<c0167534>] filemap_fdatawait+0x34/0x40 [<c016871b>] filemap_write_and_wait+0x3b/0x50 [<c01ae329>] sync_blockdev+0x19/0x20 [<c01a4365>] __sync_inodes+0x45/0x70 [<c01a439d>] sync_inodes+0xd/0x30 [<c01a70d7>] do_sync+0x17/0x70 [<c01a715d>] sys_sync+0xd/0x20 [<c0102db5>] syscall_call+0x7/0xb [...] 'rm' is the "original" stuck process, 'umount' got stuck after that, when I tried to umount the nilfs (it was mounted in a loop device). Here is the complete trace: http://pastebin.lugmen.org.ar/4931 Thank you. -- Leandro Lucarella (luca) | Blog colectivo: http://www.mazziblog.com.ar/blog/ ---------------------------------------------------------------------------- GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145 104C 949E BFB6 5F5A 8D05) ---------------------------------------------------------------------------- Don't take life to seriously, you won't get out alive ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: NILFS2 get stuck after bio_alloc() fail 2009-06-14 15:32 ` Leandro Lucarella @ 2009-06-14 18:02 ` Ryusuke Konishi 2009-06-14 18:13 ` Leandro Lucarella 0 siblings, 1 reply; 14+ messages in thread From: Ryusuke Konishi @ 2009-06-14 18:02 UTC (permalink / raw) To: llucax; +Cc: konishi.ryusuke, linux-kernel, albertito, users, llucax Hi Leandro, On Sun, 14 Jun 2009 12:32:56 -0300, Leandro Lucarella wrote: > Ryusuke Konishi, el 14 de junio a las 12:45 me escribiste: >> Hi, >> On Sat, 13 Jun 2009 22:32:11 -0300, Leandro Lucarella wrote: >> > Hi! >> > >> > While testing nilfs2 (using 2.6.30) doing some "cp"s and "rm"s, I noticed >> > sometimes they got stucked in D state, and the kernel had said the >> > following message: >> > >> > NILFS: IO error writing segment >> > >> > A friend gave me a hand and after adding some printk()s we found out that >> > the problem seems to occur when bio_alloc()s inside nilfs_alloc_seg_bio() >> > fail, making it return NULL; but we don't know how that causes the >> > processes to get stucked. >> >> Thank you for reporting this issue. >> >> Could you get stack dump of the stuck nilfs task? >> It is acquirable as follows if you enabled magic sysrq feature: >> >> # echo t > /proc/sysrq-trigger >> >> I will dig into the process how it got stuck. > > Here is (what I thought it's) the important stuff: <snip> > 'rm' is the "original" stuck process, 'umount' got stuck after that, when I > tried to umount the nilfs (it was mounted in a loop device). > > Here is the complete trace: > http://pastebin.lugmen.org.ar/4931 Thank you for your help. According to your log, there seems to be a leakage in clear processing of the writeback flag on pages. I will review the error path of log writer to narrow down the cause. Regards, Ryusuke Konishi ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: NILFS2 get stuck after bio_alloc() fail 2009-06-14 18:02 ` Ryusuke Konishi @ 2009-06-14 18:13 ` Leandro Lucarella 2009-06-18 17:34 ` [PATCH] nilfs2: fix hang problem after bio_alloc() failed Ryusuke Konishi 2009-06-23 4:40 ` NILFS2 get stuck after bio_alloc() fail Ryusuke Konishi 0 siblings, 2 replies; 14+ messages in thread From: Leandro Lucarella @ 2009-06-14 18:13 UTC (permalink / raw) To: Ryusuke Konishi; +Cc: konishi.ryusuke, linux-kernel, albertito, users Ryusuke Konishi, el 15 de junio a las 03:02 me escribiste: [snip] > > Here is the complete trace: > > http://pastebin.lugmen.org.ar/4931 > > Thank you for your help. > > According to your log, there seems to be a leakage in clear processing > of the writeback flag on pages. I will review the error path of log > writer to narrow down the cause. Oh! I forgot to tell you that the cleanerd process was not running. When I mounted the NILFS2 filesystem the cleanerd daemon said: nilfs_cleanerd[29534]: start nilfs_cleanerd[29534]: cannot create cleanerd on /dev/loop0 nilfs_cleanerd[29534]: shutdown I thought I might have an old utilities version and that could be because of that (since the nilfs website was down I couldn't check), but now I checked in the backup nilfs website that latest version is 2.0.12 and I have that (Debian package), so I don't know why the cleanerd failed to start in the first place. -- Leandro Lucarella (luca) | Blog colectivo: http://www.mazziblog.com.ar/blog/ ---------------------------------------------------------------------------- GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145 104C 949E BFB6 5F5A 8D05) ^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH] nilfs2: fix hang problem after bio_alloc() failed 2009-06-14 18:13 ` Leandro Lucarella @ 2009-06-18 17:34 ` Ryusuke Konishi 2009-06-21 11:08 ` Andi Kleen 2009-06-23 4:40 ` NILFS2 get stuck after bio_alloc() fail Ryusuke Konishi 1 sibling, 1 reply; 14+ messages in thread From: Ryusuke Konishi @ 2009-06-18 17:34 UTC (permalink / raw) To: Leandro Lucarella; +Cc: konishi.ryusuke, linux-kernel, albertito, users Hi Leandro, On Sun, 14 Jun 2009 15:13:13 -0300, Leandro Lucarella wrote: > Ryusuke Konishi, el 15 de junio a las 03:02 me escribiste: > [snip] > > > Here is the complete trace: > > > http://pastebin.lugmen.org.ar/4931 > > > > Thank you for your help. > > > > According to your log, there seems to be a leakage in clear processing > > of the writeback flag on pages. I will review the error path of log > > writer to narrow down the cause. Here is a patch that hopefully fixes the hang problem after bio_alloc() failed. This is composed of three separate patches, but I now attach them together. If you still see problems with it, please let me know. > Oh! I forgot to tell you that the cleanerd process was not running. When > I mounted the NILFS2 filesystem the cleanerd daemon said: > > nilfs_cleanerd[29534]: start > nilfs_cleanerd[29534]: cannot create cleanerd on /dev/loop0 > nilfs_cleanerd[29534]: shutdown > > I thought I might have an old utilities version and that could be because > of that (since the nilfs website was down I couldn't check), but now > I checked in the backup nilfs website that latest version is 2.0.12 and > I have that (Debian package), so I don't know why the cleanerd failed to > start in the first place. The message indicates that initialization of the cleanerd failed. I don't know for sure, but another memory allocation failure might cause it. Thanks, Ryusuke Konishi --- Ryusuke Konishi (3): nilfs2: fix mis-conversion of error code by unlikely directive nilfs2: fix hang problem of log writer that follows on write failures nilfs2: remove incorrect warning in nilfs_dat_commit_start function fs/nilfs2/dat.c | 9 --------- fs/nilfs2/segment.c | 28 +++++++--------------------- 2 files changed, 7 insertions(+), 30 deletions(-) diff --git a/fs/nilfs2/dat.c b/fs/nilfs2/dat.c index bb8a581..e2646c3 100644 --- a/fs/nilfs2/dat.c +++ b/fs/nilfs2/dat.c @@ -149,15 +149,6 @@ void nilfs_dat_commit_start(struct inode *dat, struct nilfs_palloc_req *req, entry = nilfs_palloc_block_get_entry(dat, req->pr_entry_nr, req->pr_entry_bh, kaddr); entry->de_start = cpu_to_le64(nilfs_mdt_cno(dat)); - if (entry->de_blocknr != cpu_to_le64(0) || - entry->de_end != cpu_to_le64(NILFS_CNO_MAX)) { - printk(KERN_CRIT - "%s: vbn = %llu, start = %llu, end = %llu, pbn = %llu\n", - __func__, (unsigned long long)req->pr_entry_nr, - (unsigned long long)le64_to_cpu(entry->de_start), - (unsigned long long)le64_to_cpu(entry->de_end), - (unsigned long long)le64_to_cpu(entry->de_blocknr)); - } entry->de_blocknr = cpu_to_le64(blocknr); kunmap_atomic(kaddr, KM_USER0); diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c index 22c7f65..e8f188b 100644 --- a/fs/nilfs2/segment.c +++ b/fs/nilfs2/segment.c @@ -1846,26 +1846,13 @@ static int nilfs_segctor_write(struct nilfs_sc_info *sci, err = nilfs_segbuf_write(segbuf, &wi); res = nilfs_segbuf_wait(segbuf, &wi); - err = unlikely(err) ? : res; + err = unlikely(err) ? err : res; if (unlikely(err)) return err; } return 0; } -static int nilfs_page_has_uncleared_buffer(struct page *page) -{ - struct buffer_head *head, *bh; - - head = bh = page_buffers(page); - do { - if (buffer_dirty(bh) && !list_empty(&bh->b_assoc_buffers)) - return 1; - bh = bh->b_this_page; - } while (bh != head); - return 0; -} - static void __nilfs_end_page_io(struct page *page, int err) { if (!err) { @@ -1889,12 +1876,11 @@ static void nilfs_end_page_io(struct page *page, int err) if (!page) return; - if (buffer_nilfs_node(page_buffers(page)) && - nilfs_page_has_uncleared_buffer(page)) - /* For b-tree node pages, this function may be called twice - or more because they might be split in a segment. - This check assures that cleanup has been done for all - buffers in a split btnode page. */ + if (buffer_nilfs_node(page_buffers(page)) && !PageWriteback(page)) + /* + * For b-tree node pages, this function may be called twice + * or more because they might be split in a segment. + */ return; __nilfs_end_page_io(page, err); @@ -1957,7 +1943,7 @@ static void nilfs_segctor_abort_write(struct nilfs_sc_info *sci, } if (bh->b_page != fs_page) { nilfs_end_page_io(fs_page, err); - if (unlikely(fs_page == failed_page)) + if (fs_page && fs_page == failed_page) goto done; fs_page = bh->b_page; } -- 1.6.2 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH] nilfs2: fix hang problem after bio_alloc() failed 2009-06-18 17:34 ` [PATCH] nilfs2: fix hang problem after bio_alloc() failed Ryusuke Konishi @ 2009-06-21 11:08 ` Andi Kleen 2009-06-21 12:34 ` [NILFS users] " Ryusuke Konishi 0 siblings, 1 reply; 14+ messages in thread From: Andi Kleen @ 2009-06-21 11:08 UTC (permalink / raw) To: Ryusuke Konishi; +Cc: linux-kernel, users Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> writes: > > diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c > index 22c7f65..e8f188b 100644 > --- a/fs/nilfs2/segment.c > +++ b/fs/nilfs2/segment.c > @@ -1846,26 +1846,13 @@ static int nilfs_segctor_write(struct nilfs_sc_info *sci, > err = nilfs_segbuf_write(segbuf, &wi); > > res = nilfs_segbuf_wait(segbuf, &wi); > - err = unlikely(err) ? : res; > + err = unlikely(err) ? err : res; It's very dubious gcc does anything with unlikely here anyways. They typically only work directly in conditions being tested. > if (unlikely(err)) > return err; Also gcc generally considers conditions to blocks that return unlikely, so it's actually superfluous. -Andi -- ak@linux.intel.com -- Speaking for myself only. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [NILFS users] [PATCH] nilfs2: fix hang problem after bio_alloc() failed 2009-06-21 11:08 ` Andi Kleen @ 2009-06-21 12:34 ` Ryusuke Konishi 0 siblings, 0 replies; 14+ messages in thread From: Ryusuke Konishi @ 2009-06-21 12:34 UTC (permalink / raw) To: users, andi; +Cc: konishi.ryusuke, linux-kernel Hi Andi, On Sun, 21 Jun 2009 13:08:43 +0200, Andi Kleen <andi@firstfloor.org> wrote: > Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> writes: > > > > diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c > > index 22c7f65..e8f188b 100644 > > --- a/fs/nilfs2/segment.c > > +++ b/fs/nilfs2/segment.c > > @@ -1846,26 +1846,13 @@ static int nilfs_segctor_write(struct nilfs_sc_info *sci, > > err = nilfs_segbuf_write(segbuf, &wi); > > > > res = nilfs_segbuf_wait(segbuf, &wi); > > - err = unlikely(err) ? : res; > > + err = unlikely(err) ? err : res; > > It's very dubious gcc does anything with unlikely here anyways. > They typically only work directly in conditions being tested. I got it. I'll rewrite the first bugfix so that it just removes the unlikely directive. - err = unlikely(err) ? : res; + err = err ? : res; > > if (unlikely(err)) > > return err; > > Also gcc generally considers conditions to blocks that > return unlikely, so it's actually superfluous. Thanks, I didn't know this. I'll take in this, too. > -Andi > -- > ak@linux.intel.com -- Speaking for myself only. Regards, Ryusuke Konishi ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: NILFS2 get stuck after bio_alloc() fail 2009-06-14 18:13 ` Leandro Lucarella 2009-06-18 17:34 ` [PATCH] nilfs2: fix hang problem after bio_alloc() failed Ryusuke Konishi @ 2009-06-23 4:40 ` Ryusuke Konishi 2009-06-23 12:55 ` Leandro Lucarella 1 sibling, 1 reply; 14+ messages in thread From: Ryusuke Konishi @ 2009-06-23 4:40 UTC (permalink / raw) To: llucax; +Cc: konishi.ryusuke, linux-kernel, albertito, users Hi Leandro, On Sun, 14 Jun 2009 15:13:13 -0300, Leandro Lucarella <llucax@gmail.com> wrote: > Ryusuke Konishi, el 15 de junio a las 03:02 me escribiste: > [snip] > > > Here is the complete trace: > > > http://pastebin.lugmen.org.ar/4931 > > > > Thank you for your help. > > > > According to your log, there seems to be a leakage in clear processing > > of the writeback flag on pages. I will review the error path of log > > writer to narrow down the cause. > > Oh! I forgot to tell you that the cleanerd process was not running. When > I mounted the NILFS2 filesystem the cleanerd daemon said: > > nilfs_cleanerd[29534]: start > nilfs_cleanerd[29534]: cannot create cleanerd on /dev/loop0 > nilfs_cleanerd[29534]: shutdown > > I thought I might have an old utilities version and that could be because > of that (since the nilfs website was down I couldn't check), but now > I checked in the backup nilfs website that latest version is 2.0.12 and > I have that (Debian package), so I don't know why the cleanerd failed to > start in the first place. At the last weekend, several users gave me feedback on the same early failure of cleanerd process. This problem turned out to be caused by an elementaly bug in the userland library of nilfs; there was a lack of canonicalization of pathname in the library, and it caused the failure, for example, when user adds an extra slash symbol to the mount directory name. We have released nilfs2-utils-2.0.13 yesterday to resolve this mess. If you didn't know it, please try the updated package. Thanks, Ryusuke Konishi ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: NILFS2 get stuck after bio_alloc() fail 2009-06-23 4:40 ` NILFS2 get stuck after bio_alloc() fail Ryusuke Konishi @ 2009-06-23 12:55 ` Leandro Lucarella 0 siblings, 0 replies; 14+ messages in thread From: Leandro Lucarella @ 2009-06-23 12:55 UTC (permalink / raw) To: Ryusuke Konishi; +Cc: linux-kernel, albertito, users Ryusuke Konishi, el 23 de junio a las 13:40 me escribiste: > Hi Leandro, > On Sun, 14 Jun 2009 15:13:13 -0300, Leandro Lucarella <llucax@gmail.com> wrote: > > Ryusuke Konishi, el 15 de junio a las 03:02 me escribiste: > > [snip] > > > > Here is the complete trace: > > > > http://pastebin.lugmen.org.ar/4931 > > > > > > Thank you for your help. > > > > > > According to your log, there seems to be a leakage in clear processing > > > of the writeback flag on pages. I will review the error path of log > > > writer to narrow down the cause. > > > > Oh! I forgot to tell you that the cleanerd process was not running. When > > I mounted the NILFS2 filesystem the cleanerd daemon said: > > > > nilfs_cleanerd[29534]: start > > nilfs_cleanerd[29534]: cannot create cleanerd on /dev/loop0 > > nilfs_cleanerd[29534]: shutdown > > > > I thought I might have an old utilities version and that could be because > > of that (since the nilfs website was down I couldn't check), but now > > I checked in the backup nilfs website that latest version is 2.0.12 and > > I have that (Debian package), so I don't know why the cleanerd failed to > > start in the first place. > > At the last weekend, several users gave me feedback on the same early > failure of cleanerd process. > > This problem turned out to be caused by an elementaly bug in the > userland library of nilfs; there was a lack of canonicalization of > pathname in the library, and it caused the failure, for example, when > user adds an extra slash symbol to the mount directory name. > > We have released nilfs2-utils-2.0.13 yesterday to resolve this mess. > If you didn't know it, please try the updated package. Ok, thanks for your help. I'll try the new nilfs-utils and the patch you sent when I have some time and report the results. -- Leandro Lucarella (luca) | Blog colectivo: http://www.mazziblog.com.ar/blog/ ---------------------------------------------------------------------------- GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145 104C 949E BFB6 5F5A 8D05) ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2009-06-23 13:37 UTC | newest] Thread overview: 14+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-06-14 1:32 NILFS2 get stuck after bio_alloc() fail Leandro Lucarella 2009-06-14 1:52 ` Alberto Bertogli 2009-06-14 6:30 ` Ryusuke Konishi 2009-06-14 7:00 ` Pekka Enberg 2009-06-14 12:34 ` Ryusuke Konishi 2009-06-14 3:45 ` Ryusuke Konishi 2009-06-14 15:32 ` Leandro Lucarella 2009-06-14 18:02 ` Ryusuke Konishi 2009-06-14 18:13 ` Leandro Lucarella 2009-06-18 17:34 ` [PATCH] nilfs2: fix hang problem after bio_alloc() failed Ryusuke Konishi 2009-06-21 11:08 ` Andi Kleen 2009-06-21 12:34 ` [NILFS users] " Ryusuke Konishi 2009-06-23 4:40 ` NILFS2 get stuck after bio_alloc() fail Ryusuke Konishi 2009-06-23 12:55 ` Leandro Lucarella
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox