* Re: Reporting a bug - Memory corruption in Linux kernel [not found] <CAMbOQaUW7K=FBESVeo=BOXfYU6cuqthjnmMR6jmeNFnx8PcvuQ@mail.gmail.com> @ 2014-03-06 20:09 ` Nilesh More 2014-03-07 4:00 ` Theodore Ts'o 0 siblings, 1 reply; 5+ messages in thread From: Nilesh More @ 2014-03-06 20:09 UTC (permalink / raw) To: linux-kernel Hi all, I am working on android bug wherein directory entries of ext4 file system get corrupted when USB is hotplugged (with auto mount support enabled). The logs as below: [ 413.607849] usb 2-1.1: USB disconnect, device number 12 [ 414.022630] EXT4-fs error (device mmcblk0p20): ext4_readdir:227: inode #81827: block 328308: comm installd: path /data/data/com.android.nfc/shared_prefs: bad entry in directory: rec_len is smaller than minimal- offset=0(0), inode=0, rec_len=0, name_len=0 [ 414.045204] Aborting journal on device mmcblk0p20-8. [ 414.051217] Kernel panic- not syncing: EXT4-fs (device mmcblk0p20): panic forced after error [ 414.051217] [ 414.061199] CPU: 0 PID: 150 Comm: installd Not tainted 3.10.24-gfe0c16e-dirty #1 [ 414.068586] [<c0016ae8>] (unwind_backtrace+0x0/0x140) from [<c0012e94>] (show_stack+0x18/0x1c) [ 414.077181] [<c0012e94>] (show_stack+0x18/0x1c) from [<c0853b7c>] (panic+0x94/0x1ec) [ 414.084909] [<c0853b7c>] (panic+0x94/0x1ec) from [<c01eb634>] (ext4_handle_error+0x70/0xac) [ 414.093241] [<c01eb634>] (ext4_handle_error+0x70/0xac) from [<c01eb7d4>] (ext4_error_file+0xc8/0x128) [ 414.102443] [<c01eb7d4>] (ext4_error_file+0xc8/0x128) from [<c01cbcdc>] (__ext4_check_dir_entry+0xe8/0x188) [ 414.112163] [<c01cbcdc>] (__ext4_check_dir_entry+0xe8/0x188) from [<c01cc130>] (ext4_readdir+0x3b4/0x800) [ 414.121709] [<c01cc130>] (ext4_readdir+0x3b4/0x800) from [<c0155514>] (vfs_readdir+0x98/0xbc) [ 414.130215] [<c0155514>] (vfs_readdir+0x98/0xbc) from [<c0155678>] (SyS_getdents64+0x6c/0xd4) [ 414.138721] [<c0155678>] (SyS_getdents64+0x6c/0xd4) from [<c000ef80>] (ret_fast_syscall+0x0/0x30) While I tried to root cause this issue - at the time of USB disk mount, I see lot of block_invalidatepage calls through the call stack : add_disk->register_disk->blkdev_put->kill_bdev->truncate_inode_page-->block_invalidatepage If I prevent kill_bdev from invalidating pages, I see a No-Repro for this bug. Also there are no prints saying invalid access to FAT entry(which were present when bug reproduces). Earlier we had no-repro when added delay(1) before _getblk. This points out to the loss of sync between _getblk and kill_bdev and ALSO looks like kill_bdev inadvertently invalidates pages which are Ext4 owned. I am going to debug further to try and get to root cause. Before that wanted to ask if is this A KNOWN ISSUE ? If not, any suggestions that would help me to quickly root cause this ? Thank you for your help, Nilesh On Fri, Mar 7, 2014 at 1:35 AM, Nilesh More <nilesh99999@gmail.com> wrote: > Hi all, > > I am working on android bug wherein directory entries of ext4 file system > get corrupted when USB is hotplugged (with auto mount support enabled). > > The logs as below: > [ 413.607849] usb 2-1.1: USB disconnect, device number 12 > [ 414.022630] EXT4-fs error (device mmcblk0p20): ext4_readdir:227: inode > #81827: block 328308: comm installd: path > /data/data/com.android.nfc/shared_prefs: bad entry in directory: rec_len is > smaller than minimal- offset=0(0), inode=0, rec_len=0, name_len=0 > [ 414.045204] Aborting journal on device mmcblk0p20-8. > [ 414.051217] Kernel panic- not syncing: EXT4-fs (device mmcblk0p20): panic > forced after error > [ 414.051217] > [ 414.061199] CPU: 0 PID: 150 Comm: installd Not tainted > 3.10.24-gfe0c16e-dirty #1 > [ 414.068586] [<c0016ae8>] (unwind_backtrace+0x0/0x140) from [<c0012e94>] > (show_stack+0x18/0x1c) > [ 414.077181] [<c0012e94>] (show_stack+0x18/0x1c) from [<c0853b7c>] > (panic+0x94/0x1ec) > [ 414.084909] [<c0853b7c>] (panic+0x94/0x1ec) from [<c01eb634>] > (ext4_handle_error+0x70/0xac) > [ 414.093241] [<c01eb634>] (ext4_handle_error+0x70/0xac) from [<c01eb7d4>] > (ext4_error_file+0xc8/0x128) > [ 414.102443] [<c01eb7d4>] (ext4_error_file+0xc8/0x128) from [<c01cbcdc>] > (__ext4_check_dir_entry+0xe8/0x188) > [ 414.112163] [<c01cbcdc>] (__ext4_check_dir_entry+0xe8/0x188) from > [<c01cc130>] (ext4_readdir+0x3b4/0x800) > [ 414.121709] [<c01cc130>] (ext4_readdir+0x3b4/0x800) from [<c0155514>] > (vfs_readdir+0x98/0xbc) > [ 414.130215] [<c0155514>] (vfs_readdir+0x98/0xbc) from [<c0155678>] > (SyS_getdents64+0x6c/0xd4) > [ 414.138721] [<c0155678>] (SyS_getdents64+0x6c/0xd4) from [<c000ef80>] > (ret_fast_syscall+0x0/0x30) > > > While I tried to root cause this issue - at the time of USB disk mount, I > see lot of block_invalidatepage calls through the call stack : > add_disk->register_disk->blkdev_put->kill_bdev->truncate_inode_page-->block_invalidatepage > > If I prevent kill_bdev from invalidating pages, I see a No-Repro for this > bug. Also there are no prints saying invalid access to FAT entry(which were > present when bug reproduces). Earlier we had no-repro when added delay(1) > before _getblk. > > This points out to the loss of sync between _getblk and kill_bdev and ALSO > looks like kill_bdev inadvertently invalidates pages which are Ext4 owned. > > I am going to debug further to try and get to root cause. Before that wanted > to ask if is this A KNOWN ISSUE ? If not, any suggestions that would help > me to quickly root cause this ? > > Thank you for your help, > Nilesh ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Reporting a bug - Memory corruption in Linux kernel 2014-03-06 20:09 ` Reporting a bug - Memory corruption in Linux kernel Nilesh More @ 2014-03-07 4:00 ` Theodore Ts'o 2014-03-07 20:18 ` Nilesh More 0 siblings, 1 reply; 5+ messages in thread From: Theodore Ts'o @ 2014-03-07 4:00 UTC (permalink / raw) To: Nilesh More; +Cc: linux-kernel On Fri, Mar 07, 2014 at 01:39:45AM +0530, Nilesh More wrote: > Hi all, > > I am working on android bug wherein directory entries of ext4 file > system get corrupted when USB is hotplugged (with auto mount support > enabled). > > The logs as below: > [ 413.607849] usb 2-1.1: USB disconnect, device number 12 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Hot plugged or hot unplugged? It looks like the problem is that the block device disappeared out from under ext4. Maybe you have a flaky SD/MMC drive (i.e., funky contacts, etc.)? Or maybe when you plug in one USB device, the eMMC device where you have the mounted file system disappeared? > If I prevent kill_bdev from invalidating pages, I see a No-Repro for > this bug. Also there are no prints saying invalid access to FAT > entry(which were present when bug reproduces). Earlier we had no-repro > when added delay(1) before _getblk. > > This points out to the loss of sync between _getblk and kill_bdev and > ALSO looks like kill_bdev inadvertently invalidates pages which are > Ext4 owned. This looks like it's much more of a hardware issue than a software issue. If you are plugging in a USB device, you should *not* be getting a USB disconnect message. And the fact that the pages being used by ext4 are getting invalidated would be consistent with the theory that the USB device on which the ext4 file system was on is somehow getting disconnected, per the message in you've shown in the logs. - Ted ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Reporting a bug - Memory corruption in Linux kernel 2014-03-07 4:00 ` Theodore Ts'o @ 2014-03-07 20:18 ` Nilesh More 2014-03-07 20:55 ` Nilesh More 2014-03-07 21:32 ` Theodore Ts'o 0 siblings, 2 replies; 5+ messages in thread From: Nilesh More @ 2014-03-07 20:18 UTC (permalink / raw) To: Theodore Ts'o, linux-kernel Thanks Theodore for your quick reply. To make few things clear, USB drive has FAT file system in it. And the ext4 file system is of internal sdcard present in android device. The ext4 corruption in /data partition occurs when USB drive is hotplugged/hotunplugged. The bug may repro with first hotplug or with couple of hotplug/unplugs. The main concern here is even if USB drive is corrupted, that should not result into the native file system corruption. Today, I digged in further to see if I can get some more clues. Following are my findngs - 1. When the USB is hotplugged, in the call stack of add_disk( ), while registering disk blkdev_get(bdev, FMODE_READ, NULL) gets called which I guess scans the partition table, initializes part array and registers the partitions in the driver model. 2. To release the ownership of bdev obtained in step#1, blkdev_put(bdev, FMODE_READ) is called. This invalidates the pages cached for bdev in above blkdev_get call by first doing a writeback of these pages to disk. 3. Now if I prevent the invalidate page call in step# 2, I see that ext4 file system remains intact without any correction. That suggests, some part of cached pages obtained in step#1 blkdev_get call is already being used by ext4 file system and once these pages are invalidated we have a corruption in ext4 file system. My query now is, has anybody seen similar kind of issue before ? Could this be a known bug ? Thank you, Nilesh On Fri, Mar 7, 2014 at 9:30 AM, Theodore Ts'o <tytso@mit.edu> wrote: > On Fri, Mar 07, 2014 at 01:39:45AM +0530, Nilesh More wrote: >> Hi all, >> >> I am working on android bug wherein directory entries of ext4 file >> system get corrupted when USB is hotplugged (with auto mount support >> enabled). >> >> The logs as below: >> [ 413.607849] usb 2-1.1: USB disconnect, device number 12 > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > Hot plugged or hot unplugged? It looks like the problem is that the > block device disappeared out from under ext4. Maybe you have a flaky > SD/MMC drive (i.e., funky contacts, etc.)? Or maybe when you plug in > one USB device, the eMMC device where you have the mounted file system > disappeared? > >> If I prevent kill_bdev from invalidating pages, I see a No-Repro for >> this bug. Also there are no prints saying invalid access to FAT >> entry(which were present when bug reproduces). Earlier we had no-repro >> when added delay(1) before _getblk. >> >> This points out to the loss of sync between _getblk and kill_bdev and >> ALSO looks like kill_bdev inadvertently invalidates pages which are >> Ext4 owned. > > This looks like it's much more of a hardware issue than a software > issue. If you are plugging in a USB device, you should *not* be > getting a USB disconnect message. And the fact that the pages being > used by ext4 are getting invalidated would be consistent with the > theory that the USB device on which the ext4 file system was on is > somehow getting disconnected, per the message in you've shown in the > logs. > > - Ted ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Reporting a bug - Memory corruption in Linux kernel 2014-03-07 20:18 ` Nilesh More @ 2014-03-07 20:55 ` Nilesh More 2014-03-07 21:32 ` Theodore Ts'o 1 sibling, 0 replies; 5+ messages in thread From: Nilesh More @ 2014-03-07 20:55 UTC (permalink / raw) To: Theodore Ts'o, linux-kernel Adding three more findings - 4. The memory pages thar are getting allocated in blkdev_get call in step#1 are in msdos_partition() (1 page is alloacted here) and in efi_partition() (2 pages are allocated here) function calls. I traced the 'bdev->bd_inode->i_mapping->nrpages' to track the page allocations. I could see this getting updated in above two function calls. The call stack for reference: blkdev_get->__blkdev_get->rescan_partitions-> check_partition-> check_part[i++](state)->efi_partition/msdos_partition 5. Now if I prevent the page allocations in efi_partition call by doing an early return(anyway USB drive does not have efi compliant partition table), then I could see the No-repro for this issue. This suggests that page allocations from efi_partition function call are running into already allocated ext4 fs allocations. 6. One more thing that I noticed is even with prevention of page allocations in efi_partition, In valid access to FAT prints are present. These pritnts won't be there if pages are not invalidated. This means, even msdos_partition function call does not allocate the correct/clean pages and when these pages are written back in while invalidating them, incorrect data gets written to fat disk inodes(directory inodes) which results in the "invalid access to FAT" error prints. I guess next step would be to try and understand the page allocations in efi_partition and msdos_partition calls. Thank you, Nilesh On Sat, Mar 8, 2014 at 1:48 AM, Nilesh More <nilesh99999@gmail.com> wrote: > Thanks Theodore for your quick reply. > > To make few things clear, USB drive has FAT file system in it. And the > ext4 file system is of internal sdcard present in android device. The > ext4 corruption in /data partition occurs when USB drive is > hotplugged/hotunplugged. The bug may repro with first hotplug or with > couple of hotplug/unplugs. > > The main concern here is even if USB drive is corrupted, that should > not result into the native file system corruption. > > Today, I digged in further to see if I can get some more clues. > Following are my findngs - > > 1. When the USB is hotplugged, in the call stack of add_disk( ), > while registering disk blkdev_get(bdev, FMODE_READ, NULL) gets called > which I guess scans the partition table, initializes part array and > registers the partitions in the driver model. > > 2. To release the ownership of bdev obtained in step#1, > blkdev_put(bdev, FMODE_READ) is called. This invalidates the pages > cached for bdev in above blkdev_get call by first doing a writeback of > these pages to disk. > > 3. Now if I prevent the invalidate page call in step# 2, I see that > ext4 file system remains intact without any correction. That suggests, > some part of cached pages obtained in step#1 blkdev_get call is > already being used by ext4 file system and once these pages are > invalidated we have a corruption in ext4 file system. > > My query now is, has anybody seen similar kind of issue before ? Could > this be a known bug ? > > Thank you, > Nilesh > > > On Fri, Mar 7, 2014 at 9:30 AM, Theodore Ts'o <tytso@mit.edu> wrote: >> On Fri, Mar 07, 2014 at 01:39:45AM +0530, Nilesh More wrote: >>> Hi all, >>> >>> I am working on android bug wherein directory entries of ext4 file >>> system get corrupted when USB is hotplugged (with auto mount support >>> enabled). >>> >>> The logs as below: >>> [ 413.607849] usb 2-1.1: USB disconnect, device number 12 >> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ >> >> Hot plugged or hot unplugged? It looks like the problem is that the >> block device disappeared out from under ext4. Maybe you have a flaky >> SD/MMC drive (i.e., funky contacts, etc.)? Or maybe when you plug in >> one USB device, the eMMC device where you have the mounted file system >> disappeared? >> >>> If I prevent kill_bdev from invalidating pages, I see a No-Repro for >>> this bug. Also there are no prints saying invalid access to FAT >>> entry(which were present when bug reproduces). Earlier we had no-repro >>> when added delay(1) before _getblk. >>> >>> This points out to the loss of sync between _getblk and kill_bdev and >>> ALSO looks like kill_bdev inadvertently invalidates pages which are >>> Ext4 owned. >> >> This looks like it's much more of a hardware issue than a software >> issue. If you are plugging in a USB device, you should *not* be >> getting a USB disconnect message. And the fact that the pages being >> used by ext4 are getting invalidated would be consistent with the >> theory that the USB device on which the ext4 file system was on is >> somehow getting disconnected, per the message in you've shown in the >> logs. >> >> - Ted ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Reporting a bug - Memory corruption in Linux kernel 2014-03-07 20:18 ` Nilesh More 2014-03-07 20:55 ` Nilesh More @ 2014-03-07 21:32 ` Theodore Ts'o 1 sibling, 0 replies; 5+ messages in thread From: Theodore Ts'o @ 2014-03-07 21:32 UTC (permalink / raw) To: Nilesh More; +Cc: linux-kernel On Sat, Mar 08, 2014 at 01:48:42AM +0530, Nilesh More wrote: > > 1. When the USB is hotplugged, in the call stack of add_disk( ), > while registering disk blkdev_get(bdev, FMODE_READ, NULL) gets called > which I guess scans the partition table, initializes part array and > registers the partitions in the driver model. > > 2. To release the ownership of bdev obtained in step#1, > blkdev_put(bdev, FMODE_READ) is called. This invalidates the pages > cached for bdev in above blkdev_get call by first doing a writeback of > these pages to disk. > > 3. Now if I prevent the invalidate page call in step# 2, I see that > ext4 file system remains intact without any correction. That suggests, > some part of cached pages obtained in step#1 blkdev_get call is > already being used by ext4 file system and once these pages are > invalidated we have a corruption in ext4 file system. Can you put in a WARN_ON(1) in blkdev_put() and blkdev_get(), so we can see the exact call stack? Also, can you print out the value of the bdev->bd_dev and bdev->bd_openers at the beginning of blkdev_put() and blkdev_get()? I am not convinced that your analysis is correct, given the "USB disconnect" message. So let's see the exact call stack for the calls to blkdev_get() and blkdev_put(), and see exactly which device is getting obtained and released. > My query now is, has anybody seen similar kind of issue before ? Could > this be a known bug ? Nothing like this before, no. Note that the invalidate_pages() in blkdev_put() only happens when bdev->bd_openers drops down to zero. If the file system is mounted, then bd_openers will be one. So even if someone is calling blkdev_get() and blkdev_put() on the file system, bd_openers will not drop to zero. Also, the USB device would be a different bdev than the one for the system disk. So your theory simply doesn't make any sense to me. If you think that is really what's going on, let's put in the debugging printk's that show exactly which device and the bd_openers count for each call to blkdev_put() and blkdev_get(), and then let's get the precise stack trace used when the pages get invalidated. This pattern: [ 413.607849] usb 2-1.1: USB disconnect, device number 12 [ 414.022630] EXT4-fs error (device mmcblk0p20): ext4_readdir:227: inode #81827: block 328308: comm installd... is the normal thing that one would expect if someone yanks the USB device or a SD card containing a mounted file system from the system. Any theory of what's going that doesn't account for the "USB disconnect" message is going to be fundamentally incomplete. Cheers, - Ted ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2014-03-07 21:32 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <CAMbOQaUW7K=FBESVeo=BOXfYU6cuqthjnmMR6jmeNFnx8PcvuQ@mail.gmail.com>
2014-03-06 20:09 ` Reporting a bug - Memory corruption in Linux kernel Nilesh More
2014-03-07 4:00 ` Theodore Ts'o
2014-03-07 20:18 ` Nilesh More
2014-03-07 20:55 ` Nilesh More
2014-03-07 21:32 ` Theodore Ts'o
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox