Re: Reporting a bug - Memory corruption in Linux kernel

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Re: Reporting a bug - Memory corruption in Linux kernel
       [not found] <CAMbOQaUW7K=FBESVeo=BOXfYU6cuqthjnmMR6jmeNFnx8PcvuQ@mail.gmail.com>
@ 2014-03-06 20:09 ` Nilesh More
  2014-03-07  4:00   ` Theodore Ts'o
  0 siblings, 1 reply; 5+ messages in thread
From: Nilesh More @ 2014-03-06 20:09 UTC (permalink / raw)
  To: linux-kernel

Hi all,

I am working on android bug wherein directory entries of ext4 file
system get corrupted when USB is hotplugged (with auto mount support
enabled).

The logs as below:
[ 413.607849] usb 2-1.1: USB disconnect, device number 12
[ 414.022630] EXT4-fs error (device mmcblk0p20): ext4_readdir:227:
inode #81827: block 328308: comm installd: path
/data/data/com.android.nfc/shared_prefs: bad entry in directory:
rec_len is smaller than minimal- offset=0(0), inode=0, rec_len=0,
name_len=0
[ 414.045204] Aborting journal on device mmcblk0p20-8.
[ 414.051217] Kernel panic- not syncing: EXT4-fs (device mmcblk0p20):
panic forced after error
[ 414.051217]
[ 414.061199] CPU: 0 PID: 150 Comm: installd Not tainted
3.10.24-gfe0c16e-dirty #1
[ 414.068586] [<c0016ae8>] (unwind_backtrace+0x0/0x140) from
[<c0012e94>] (show_stack+0x18/0x1c)
[ 414.077181] [<c0012e94>] (show_stack+0x18/0x1c) from [<c0853b7c>]
(panic+0x94/0x1ec)
[ 414.084909] [<c0853b7c>] (panic+0x94/0x1ec) from [<c01eb634>]
(ext4_handle_error+0x70/0xac)
[ 414.093241] [<c01eb634>] (ext4_handle_error+0x70/0xac) from
[<c01eb7d4>] (ext4_error_file+0xc8/0x128)
[ 414.102443] [<c01eb7d4>] (ext4_error_file+0xc8/0x128) from
[<c01cbcdc>] (__ext4_check_dir_entry+0xe8/0x188)
[ 414.112163] [<c01cbcdc>] (__ext4_check_dir_entry+0xe8/0x188) from
[<c01cc130>] (ext4_readdir+0x3b4/0x800)
[ 414.121709] [<c01cc130>] (ext4_readdir+0x3b4/0x800) from
[<c0155514>] (vfs_readdir+0x98/0xbc)
[ 414.130215] [<c0155514>] (vfs_readdir+0x98/0xbc) from [<c0155678>]
(SyS_getdents64+0x6c/0xd4)
[ 414.138721] [<c0155678>] (SyS_getdents64+0x6c/0xd4) from
[<c000ef80>] (ret_fast_syscall+0x0/0x30)


While I tried to root cause this issue -  at the time of USB disk
mount, I see lot of block_invalidatepage calls through the call stack
:
add_disk->register_disk->blkdev_put->kill_bdev->truncate_inode_page-->block_invalidatepage

If I prevent kill_bdev from invalidating pages, I see a No-Repro for
this bug. Also there are no prints saying invalid access to FAT
entry(which were present when bug reproduces). Earlier we had no-repro
when added delay(1) before _getblk.

This points out to the loss of sync between _getblk and kill_bdev and
ALSO looks like kill_bdev inadvertently invalidates pages which are
Ext4 owned.

I am going to debug further to try and get to root cause. Before that
wanted to ask if is this  A KNOWN ISSUE ?  If not, any suggestions
that would help me to quickly root cause this ?

Thank you for your help,
Nilesh

On Fri, Mar 7, 2014 at 1:35 AM, Nilesh More <nilesh99999@gmail.com> wrote:
> Hi all,
>
> I am working on android bug wherein directory entries of ext4 file system
> get corrupted when USB is hotplugged (with auto mount support enabled).
>
> The logs as below:
> [ 413.607849] usb 2-1.1: USB disconnect, device number 12
> [ 414.022630] EXT4-fs error (device mmcblk0p20): ext4_readdir:227: inode
> #81827: block 328308: comm installd: path
> /data/data/com.android.nfc/shared_prefs: bad entry in directory: rec_len is
> smaller than minimal- offset=0(0), inode=0, rec_len=0, name_len=0
> [ 414.045204] Aborting journal on device mmcblk0p20-8.
> [ 414.051217] Kernel panic- not syncing: EXT4-fs (device mmcblk0p20): panic
> forced after error
> [ 414.051217]
> [ 414.061199] CPU: 0 PID: 150 Comm: installd Not tainted
> 3.10.24-gfe0c16e-dirty #1
> [ 414.068586] [<c0016ae8>] (unwind_backtrace+0x0/0x140) from [<c0012e94>]
> (show_stack+0x18/0x1c)
> [ 414.077181] [<c0012e94>] (show_stack+0x18/0x1c) from [<c0853b7c>]
> (panic+0x94/0x1ec)
> [ 414.084909] [<c0853b7c>] (panic+0x94/0x1ec) from [<c01eb634>]
> (ext4_handle_error+0x70/0xac)
> [ 414.093241] [<c01eb634>] (ext4_handle_error+0x70/0xac) from [<c01eb7d4>]
> (ext4_error_file+0xc8/0x128)
> [ 414.102443] [<c01eb7d4>] (ext4_error_file+0xc8/0x128) from [<c01cbcdc>]
> (__ext4_check_dir_entry+0xe8/0x188)
> [ 414.112163] [<c01cbcdc>] (__ext4_check_dir_entry+0xe8/0x188) from
> [<c01cc130>] (ext4_readdir+0x3b4/0x800)
> [ 414.121709] [<c01cc130>] (ext4_readdir+0x3b4/0x800) from [<c0155514>]
> (vfs_readdir+0x98/0xbc)
> [ 414.130215] [<c0155514>] (vfs_readdir+0x98/0xbc) from [<c0155678>]
> (SyS_getdents64+0x6c/0xd4)
> [ 414.138721] [<c0155678>] (SyS_getdents64+0x6c/0xd4) from [<c000ef80>]
> (ret_fast_syscall+0x0/0x30)
>
>
> While I tried to root cause this issue -  at the time of USB disk mount, I
> see lot of block_invalidatepage calls through the call stack :
> add_disk->register_disk->blkdev_put->kill_bdev->truncate_inode_page-->block_invalidatepage
>
> If I prevent kill_bdev from invalidating pages, I see a No-Repro for this
> bug. Also there are no prints saying invalid access to FAT entry(which were
> present when bug reproduces). Earlier we had no-repro when added delay(1)
> before _getblk.
>
> This points out to the loss of sync between _getblk and kill_bdev and ALSO
> looks like kill_bdev inadvertently invalidates pages which are Ext4 owned.
>
> I am going to debug further to try and get to root cause. Before that wanted
> to ask if is this  A KNOWN ISSUE ?  If not, any suggestions that would help
> me to quickly root cause this ?
>
> Thank you for your help,
> Nilesh

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Reporting a bug - Memory corruption in Linux kernel
  2014-03-06 20:09 ` Reporting a bug - Memory corruption in Linux kernel Nilesh More
@ 2014-03-07  4:00   ` Theodore Ts'o
  2014-03-07 20:18     ` Nilesh More
  0 siblings, 1 reply; 5+ messages in thread
From: Theodore Ts'o @ 2014-03-07  4:00 UTC (permalink / raw)
  To: Nilesh More; +Cc: linux-kernel

On Fri, Mar 07, 2014 at 01:39:45AM +0530, Nilesh More wrote:
> Hi all,
> 
> I am working on android bug wherein directory entries of ext4 file
> system get corrupted when USB is hotplugged (with auto mount support
> enabled).
> 
> The logs as below:
> [ 413.607849] usb 2-1.1: USB disconnect, device number 12
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Hot plugged or hot unplugged?  It looks like the problem is that the
block device disappeared out from under ext4.  Maybe you have a flaky
SD/MMC drive (i.e., funky contacts, etc.)?  Or maybe when you plug in
one USB device, the eMMC device where you have the mounted file system
disappeared?

> If I prevent kill_bdev from invalidating pages, I see a No-Repro for
> this bug. Also there are no prints saying invalid access to FAT
> entry(which were present when bug reproduces). Earlier we had no-repro
> when added delay(1) before _getblk.
> 
> This points out to the loss of sync between _getblk and kill_bdev and
> ALSO looks like kill_bdev inadvertently invalidates pages which are
> Ext4 owned.

This looks like it's much more of a hardware issue than a software
issue.  If you are plugging in a USB device, you should *not* be
getting a USB disconnect message.  And the fact that the pages being
used by ext4 are getting invalidated would be consistent with the
theory that the USB device on which the ext4 file system was on is
somehow getting disconnected, per the message in you've shown in the
logs.

						- Ted

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Reporting a bug - Memory corruption in Linux kernel
  2014-03-07  4:00   ` Theodore Ts'o
@ 2014-03-07 20:18     ` Nilesh More
  2014-03-07 20:55       ` Nilesh More
  2014-03-07 21:32       ` Theodore Ts'o
  0 siblings, 2 replies; 5+ messages in thread
From: Nilesh More @ 2014-03-07 20:18 UTC (permalink / raw)
  To: Theodore Ts'o, linux-kernel

Thanks Theodore for your quick reply.

To make few things clear, USB drive has FAT file system in it. And the
ext4 file system is of internal sdcard present in android device. The
ext4 corruption in /data partition occurs when USB drive is
hotplugged/hotunplugged. The bug may repro with first hotplug or with
couple of hotplug/unplugs.

The main concern here is even if USB drive is corrupted, that should
not result into the native file system corruption.

Today, I digged in further to see if I can get some more clues.
Following are my findngs -

1. When the USB is hotplugged, in the call stack of add_disk( ),
while registering disk blkdev_get(bdev, FMODE_READ, NULL) gets called
which I guess scans the partition table, initializes part array and
registers the partitions in the driver model.

2. To release the ownership of bdev obtained in step#1,
blkdev_put(bdev, FMODE_READ) is called. This invalidates the pages
cached for bdev in above blkdev_get call by first doing a writeback of
these pages to disk.

3. Now if I prevent the invalidate page call in step# 2, I see that
ext4 file system remains intact without any correction. That suggests,
some part of cached pages obtained in step#1 blkdev_get call is
already being used by ext4 file system and once these pages are
invalidated we have a corruption in ext4 file system.

My query now is, has anybody seen similar kind of issue before ? Could
this be a known bug ?

Thank you,
Nilesh

On Fri, Mar 7, 2014 at 9:30 AM, Theodore Ts'o <tytso@mit.edu> wrote:
> On Fri, Mar 07, 2014 at 01:39:45AM +0530, Nilesh More wrote:
>> Hi all,
>>
>> I am working on android bug wherein directory entries of ext4 file
>> system get corrupted when USB is hotplugged (with auto mount support
>> enabled).
>>
>> The logs as below:
>> [ 413.607849] usb 2-1.1: USB disconnect, device number 12
>                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> Hot plugged or hot unplugged?  It looks like the problem is that the
> block device disappeared out from under ext4.  Maybe you have a flaky
> SD/MMC drive (i.e., funky contacts, etc.)?  Or maybe when you plug in
> one USB device, the eMMC device where you have the mounted file system
> disappeared?
>
>> If I prevent kill_bdev from invalidating pages, I see a No-Repro for
>> this bug. Also there are no prints saying invalid access to FAT
>> entry(which were present when bug reproduces). Earlier we had no-repro
>> when added delay(1) before _getblk.
>>
>> This points out to the loss of sync between _getblk and kill_bdev and
>> ALSO looks like kill_bdev inadvertently invalidates pages which are
>> Ext4 owned.
>
> This looks like it's much more of a hardware issue than a software
> issue.  If you are plugging in a USB device, you should *not* be
> getting a USB disconnect message.  And the fact that the pages being
> used by ext4 are getting invalidated would be consistent with the
> theory that the USB device on which the ext4 file system was on is
> somehow getting disconnected, per the message in you've shown in the
> logs.
>
>                                                 - Ted

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Reporting a bug - Memory corruption in Linux kernel
  2014-03-07 20:18     ` Nilesh More
@ 2014-03-07 20:55       ` Nilesh More
  2014-03-07 21:32       ` Theodore Ts'o
  1 sibling, 0 replies; 5+ messages in thread
From: Nilesh More @ 2014-03-07 20:55 UTC (permalink / raw)
  To: Theodore Ts'o, linux-kernel

Adding three more findings -

4. The memory pages thar are getting allocated in blkdev_get call in
step#1 are in msdos_partition() (1 page is alloacted here) and in
efi_partition() (2 pages are allocated here) function calls. I traced
the 'bdev->bd_inode->i_mapping->nrpages' to track the page
allocations. I could see this getting updated in above two function
calls. The call stack for reference:
blkdev_get->__blkdev_get->rescan_partitions-> check_partition->
check_part[i++](state)->efi_partition/msdos_partition

5. Now if I prevent the page allocations in efi_partition call by
doing an early return(anyway USB drive does not have efi compliant
partition table), then I could see the No-repro for this issue. This
suggests that page allocations from efi_partition function call are
running into already allocated ext4 fs allocations.

6. One more thing that I noticed is even with prevention of page
allocations in efi_partition, In valid access to FAT prints are
present. These pritnts won't be there if pages are not invalidated.
This means, even msdos_partition function call does not allocate the
correct/clean pages and when these pages are written back in while
invalidating them, incorrect data gets written to fat disk
inodes(directory inodes) which results in the "invalid access to FAT"
error prints.


I guess next step would be to try and understand the page allocations
in efi_partition and msdos_partition calls.

Thank you,
Nilesh

On Sat, Mar 8, 2014 at 1:48 AM, Nilesh More <nilesh99999@gmail.com> wrote:
> Thanks Theodore for your quick reply.
>
> To make few things clear, USB drive has FAT file system in it. And the
> ext4 file system is of internal sdcard present in android device. The
> ext4 corruption in /data partition occurs when USB drive is
> hotplugged/hotunplugged. The bug may repro with first hotplug or with
> couple of hotplug/unplugs.
>
> The main concern here is even if USB drive is corrupted, that should
> not result into the native file system corruption.
>
> Today, I digged in further to see if I can get some more clues.
> Following are my findngs -
>
> 1. When the USB is hotplugged, in the call stack of add_disk( ),
> while registering disk blkdev_get(bdev, FMODE_READ, NULL) gets called
> which I guess scans the partition table, initializes part array and
> registers the partitions in the driver model.
>
> 2. To release the ownership of bdev obtained in step#1,
> blkdev_put(bdev, FMODE_READ) is called. This invalidates the pages
> cached for bdev in above blkdev_get call by first doing a writeback of
> these pages to disk.
>
> 3. Now if I prevent the invalidate page call in step# 2, I see that
> ext4 file system remains intact without any correction. That suggests,
> some part of cached pages obtained in step#1 blkdev_get call is
> already being used by ext4 file system and once these pages are
> invalidated we have a corruption in ext4 file system.
>
> My query now is, has anybody seen similar kind of issue before ? Could
> this be a known bug ?
>
> Thank you,
> Nilesh
>
>
> On Fri, Mar 7, 2014 at 9:30 AM, Theodore Ts'o <tytso@mit.edu> wrote:
>> On Fri, Mar 07, 2014 at 01:39:45AM +0530, Nilesh More wrote:
>>> Hi all,
>>>
>>> I am working on android bug wherein directory entries of ext4 file
>>> system get corrupted when USB is hotplugged (with auto mount support
>>> enabled).
>>>
>>> The logs as below:
>>> [ 413.607849] usb 2-1.1: USB disconnect, device number 12
>>                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>
>> Hot plugged or hot unplugged?  It looks like the problem is that the
>> block device disappeared out from under ext4.  Maybe you have a flaky
>> SD/MMC drive (i.e., funky contacts, etc.)?  Or maybe when you plug in
>> one USB device, the eMMC device where you have the mounted file system
>> disappeared?
>>
>>> If I prevent kill_bdev from invalidating pages, I see a No-Repro for
>>> this bug. Also there are no prints saying invalid access to FAT
>>> entry(which were present when bug reproduces). Earlier we had no-repro
>>> when added delay(1) before _getblk.
>>>
>>> This points out to the loss of sync between _getblk and kill_bdev and
>>> ALSO looks like kill_bdev inadvertently invalidates pages which are
>>> Ext4 owned.
>>
>> This looks like it's much more of a hardware issue than a software
>> issue.  If you are plugging in a USB device, you should *not* be
>> getting a USB disconnect message.  And the fact that the pages being
>> used by ext4 are getting invalidated would be consistent with the
>> theory that the USB device on which the ext4 file system was on is
>> somehow getting disconnected, per the message in you've shown in the
>> logs.
>>
>>                                                 - Ted

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Reporting a bug - Memory corruption in Linux kernel
  2014-03-07 20:18     ` Nilesh More
  2014-03-07 20:55       ` Nilesh More
@ 2014-03-07 21:32       ` Theodore Ts'o
  1 sibling, 0 replies; 5+ messages in thread
From: Theodore Ts'o @ 2014-03-07 21:32 UTC (permalink / raw)
  To: Nilesh More; +Cc: linux-kernel

On Sat, Mar 08, 2014 at 01:48:42AM +0530, Nilesh More wrote:
> 
> 1. When the USB is hotplugged, in the call stack of add_disk( ),
> while registering disk blkdev_get(bdev, FMODE_READ, NULL) gets called
> which I guess scans the partition table, initializes part array and
> registers the partitions in the driver model.
> 
> 2. To release the ownership of bdev obtained in step#1,
> blkdev_put(bdev, FMODE_READ) is called. This invalidates the pages
> cached for bdev in above blkdev_get call by first doing a writeback of
> these pages to disk.
> 
> 3. Now if I prevent the invalidate page call in step# 2, I see that
> ext4 file system remains intact without any correction. That suggests,
> some part of cached pages obtained in step#1 blkdev_get call is
> already being used by ext4 file system and once these pages are
> invalidated we have a corruption in ext4 file system.

Can you put in a WARN_ON(1) in blkdev_put() and blkdev_get(), so we
can see the exact call stack?  Also, can you print out the value of
the bdev->bd_dev and bdev->bd_openers at the beginning of blkdev_put()
and blkdev_get()?

I am not convinced that your analysis is correct, given the "USB
disconnect" message.  So let's see the exact call stack for the calls
to blkdev_get() and blkdev_put(), and see exactly which device is
getting obtained and released.

> My query now is, has anybody seen similar kind of issue before ? Could
> this be a known bug ?

Nothing like this before, no.  Note that the invalidate_pages() in
blkdev_put() only happens when bdev->bd_openers drops down to zero.
If the file system is mounted, then bd_openers will be one.  So even
if someone is calling blkdev_get() and blkdev_put() on the file
system, bd_openers will not drop to zero.

Also, the USB device would be a different bdev than the one for the
system disk.  So your theory simply doesn't make any sense to me.  If
you think that is really what's going on, let's put in the debugging
printk's that show exactly which device and the bd_openers count for
each call to blkdev_put() and blkdev_get(), and then let's get the
precise stack trace used when the pages get invalidated.

This pattern:

[ 413.607849] usb 2-1.1: USB disconnect, device number 12
[ 414.022630] EXT4-fs error (device mmcblk0p20): ext4_readdir:227: inode #81827: block 328308: comm installd...

is the normal thing that one would expect if someone yanks the USB
device or a SD card containing a mounted file system from the system.
Any theory of what's going that doesn't account for the "USB
disconnect" message is going to be fundamentally incomplete.

Cheers,

						- Ted

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2014-03-07 21:32 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <CAMbOQaUW7K=FBESVeo=BOXfYU6cuqthjnmMR6jmeNFnx8PcvuQ@mail.gmail.com>
2014-03-06 20:09 ` Reporting a bug - Memory corruption in Linux kernel Nilesh More
2014-03-07  4:00   ` Theodore Ts'o
2014-03-07 20:18     ` Nilesh More
2014-03-07 20:55       ` Nilesh More
2014-03-07 21:32       ` Theodore Ts'o

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox