* ext4 filesystem corruption across partitions
@ 2014-04-17 15:05 Devrin Talen
2014-04-17 16:12 ` Theodore Ts'o
0 siblings, 1 reply; 5+ messages in thread
From: Devrin Talen @ 2014-04-17 15:05 UTC (permalink / raw)
To: linux-ext4
Hi all,
I'm debugging an issue on my platform. In short, I can corrupt an ext4
filesystem on one partition by writing a file on a different one. I'm
suspecting something is off either with my partition table or filesystem
parameters, but I'm such an ext4 beginner that I thought I'd start here
to get some help in where to look.
If I run this (which writes a relatively large file to partition 12):
dd if=/dev/zero of=/cache/goingtodie bs=4096 count=120000
Then (after rebooting) I'll get an ext4 error like this on partition 13:
EXT4-fs error (device mmcblk0p13): ext4_readdir:214: inode
#102545: block 426479: comm er.ServerThread:
path /data/app-private: bad entry in directory: rec_len is
smaller than minimal - offset=0(0), inode=0, rec_len=0, na0
My android system runs a slightly modified 3.0.31 kernel with 4GB eMMC
as the block device. My partition table is set up as such (located in
the first 34 blocks):
lba size = 512
lba_start partition_size name
========= ====================== ==============
34 97280( 95K) environment
224 16384( 16K) crypto
256 393216( 384K) xloader
1024 524288( 512K) bootloader
2048 524288( 512K) device_info
3072 524288( 512K) bootloader2
4096 524288( 512K) misc
5120 8388608( 8M) recovery
21504 8388608( 8M) boot
37888 16777216( 16M) efs
70656 1073741824( 1024M) system
2167808 536870912( 512M) cache
3216384 2195193856( 2093M) userdata
========= ====================== ==============
The two filesystems in question are the cache and userdata partitions.
These are created with the following make_ext4fs[1] commands and then
flashed to eMMC:
% make_ext4fs -s -L cache -l 536870912 cache.img cache
Creating filesystem with parameters:
Size: 536870912
Block size: 4096
Blocks per group: 32768
Inodes per group: 8192
Inode size: 256
Journal blocks: 2048
Label: cache
Blocks: 131072
Block groups: 4
Reserved block group size: 31
Created filesystem with 11/32768 inodes and 4206/131072 blocks
% make_ext4fs -s -l 2143744K -a data userdata.img data/
Creating filesystem with parameters:
Size: 2195193856
Block size: 4096
Blocks per group: 32768
Inodes per group: 7888
Inode size: 256
Journal blocks: 8374
Label:
Blocks: 535936
Block groups: 17
Reserved block group size: 135
Created filesystem with 11/134096 inodes and 17614/535936 blocks
Any help is very much appreciated. Does anyone see anything amiss, or
that I should try looking into? If there's any more information that's
needed just let me know. Or if you think there's a better mailing list
for me to take this to, I can do that too.
[1]:
https://android.googlesource.com/platform/system/extras/+/fb109b894a5fc2891e49ec8e81c0dda171b45b7f/ext4_utils/make_ext4fs_main.c
--
Devrin Talen <dct23@cornell.edu>
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: ext4 filesystem corruption across partitions
2014-04-17 15:05 ext4 filesystem corruption across partitions Devrin Talen
@ 2014-04-17 16:12 ` Theodore Ts'o
2014-05-06 2:01 ` Devrin Talen
0 siblings, 1 reply; 5+ messages in thread
From: Theodore Ts'o @ 2014-04-17 16:12 UTC (permalink / raw)
To: Devrin Talen; +Cc: linux-ext4
On Thu, Apr 17, 2014 at 11:05:23AM -0400, Devrin Talen wrote:
> Hi all,
>
> I'm debugging an issue on my platform. In short, I can corrupt an ext4
> filesystem on one partition by writing a file on a different one. I'm
> suspecting something is off either with my partition table or filesystem
> parameters, but I'm such an ext4 beginner that I thought I'd start here
> to get some help in where to look.
The partition table looks fine. (What I did was to take the lba_start
and partition_size fields from your table, imported them into a
spreadsheet, and then verified that "lba_start + partition_size/512"
for each partition was the same as the lba_start of the next
partition. Obviously, there is no partition table overlap.)
The kernel is supposed to make sure that writes in one partition can't
affect another parition, so either you have a kernel bug in the block
device layer or driver, or you have a hardware problem.
I hate to ask this, but are you sure you have a quality 4GB sd card?
There are fraudulent cards out there where a card will be marked as
having X GB, but it only really has Y GB, or even Y MB worth of flash.
The people making these fraudulent cards rely on the fact that very
often people don't actually fill up their flash cards, so as long as
they don't write to more than Y GB worth of disk sectors, they won't
notice anything wrong. But if you do write to more sectors than there
is flash, then the N+Ith unique disk sector write ends up going to the
Ith disk sector that had been written.
- Ted
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: ext4 filesystem corruption across partitions
2014-04-17 16:12 ` Theodore Ts'o
@ 2014-05-06 2:01 ` Devrin Talen
2014-05-06 19:40 ` Theodore Ts'o
0 siblings, 1 reply; 5+ messages in thread
From: Devrin Talen @ 2014-05-06 2:01 UTC (permalink / raw)
To: Theodore Ts'o; +Cc: linux-ext4
On Thu, 17 Apr 2014 12:12:49 -0400
"Theodore Ts'o" <tytso@mit.edu> wrote:
> On Thu, Apr 17, 2014 at 11:05:23AM -0400, Devrin Talen wrote:
> > Hi all,
> >
> > I'm debugging an issue on my platform. In short, I can corrupt an
> > ext4 filesystem on one partition by writing a file on a different
> > one. I'm suspecting something is off either with my partition
> > table or filesystem parameters, but I'm such an ext4 beginner that
> > I thought I'd start here to get some help in where to look.
>
> The partition table looks fine. (What I did was to take the lba_start
> and partition_size fields from your table, imported them into a
> spreadsheet, and then verified that "lba_start + partition_size/512"
> for each partition was the same as the lba_start of the next
> partition. Obviously, there is no partition table overlap.)
Ted, thanks for the response. I wanted to reply sooner but I had to
make sure I had a good way to reproduce the filesystem corruption
before getting back.
As far as the partition table, that's what I thought too but it helps to
have a second pair of eyes on it. Thanks!
> The kernel is supposed to make sure that writes in one partition can't
> affect another parition, so either you have a kernel bug in the block
> device layer or driver, or you have a hardware problem.
That could be. We're fairly certain it's not electrical, just because
of how simple the hookups are to our CPU, but it wouldn't be surprising
if there's some setting on the eMMC part that we're missing. Anyway,
here's how we've been able to get this to reproduce fairly reliably:
1. Run `ls -R *` in a loop from the root directory. The root is
mounted from partition 11 (system) on the eMMC and the ls will read
the /cache (partition 12) and /data (partition 13) filesystems as well.
2. Write data to partition 12 via ADB (using `adb push ... /cache/`)
Doing these two things, we'll get ext4 errors reported on partition
13. I'll get the exact error messages when I'm back at my desk
tomorrow.
Fortunately, we managed to capture the failure while printing out the
trace of eMMC commands from the block driver. It's a large file, but
if someone would find that useful I think I can make it available
somehow.
> I hate to ask this, but are you sure you have a quality 4GB sd card?
> There are fraudulent cards out there where a card will be marked as
> having X GB, but it only really has Y GB, or even Y MB worth of flash.
> The people making these fraudulent cards rely on the fact that very
> often people don't actually fill up their flash cards, so as long as
> they don't write to more than Y GB worth of disk sectors, they won't
> notice anything wrong. But if you do write to more sectors than there
> is flash, then the N+Ith unique disk sector write ends up going to the
> Ith disk sector that had been written.
That's a good point, but we're actually using a Micron eMMC part
soldered to out board, so it better be as big as they advertise it :).
Again, thanks for the help so far and hope that we can track this down.
--
Devrin Talen <dct23@cornell.edu>
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: ext4 filesystem corruption across partitions
2014-05-06 2:01 ` Devrin Talen
@ 2014-05-06 19:40 ` Theodore Ts'o
2014-05-06 22:38 ` Andreas Dilger
0 siblings, 1 reply; 5+ messages in thread
From: Theodore Ts'o @ 2014-05-06 19:40 UTC (permalink / raw)
To: Devrin Talen; +Cc: linux-ext4
On Mon, May 05, 2014 at 10:01:30PM -0400, Devrin Talen wrote:
>
> 1. Run `ls -R *` in a loop from the root directory. The root is
> mounted from partition 11 (system) on the eMMC and the ls will read
> the /cache (partition 12) and /data (partition 13) filesystems as well.
Try mounting /data read-only. That should pretty much guarantee that
nothing should be able to write to it. You can also use blktrace to
capture block I/O traces to the device, and use that to make sure
nothing was actually writing to it.
> 2. Write data to partition 12 via ADB (using `adb push ... /cache/`)
Instead of using ADB, I would suggest writing a test program which
writes a series of 512 byte sectors to a single large file in /cache.
At the beginning of each 512 byte sector include a 4 byte serial
number (which is incremented by one for each sector), a 4 byte testID
which is different for each run of your test program, a time stamp, a
CRC of these fields, and then fill the rest of the sector with some
text string to make it easy to recognize this pattern. It can be
anything from 0xDEADBEEF, to a string such as "DEBUGGING RANDOM HW
BUGS REALLY SUCKS". :-)
Now try to reproduce the problem with this write load. If you can
reproduce the problem, check and see if the corrupted file system
block in the shows evidence of the string that was supposed to be
written into /cache, showing up in /data. You can also check the
large file being written in the /cache has the expended serial number
and checksum.
This will allow you to see if a the block writes are just going to the
wrong place on the SSD, or something else more strange might be going
on. Depending on the pattern of what blocks are ending up where they
shouldn't, it might point towards different possible causes (i.e., a
flaky solder joint, a buggy flash translation layer in the eMMC chip,
etc.)
Cheers,
- Ted
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: ext4 filesystem corruption across partitions
2014-05-06 19:40 ` Theodore Ts'o
@ 2014-05-06 22:38 ` Andreas Dilger
0 siblings, 0 replies; 5+ messages in thread
From: Andreas Dilger @ 2014-05-06 22:38 UTC (permalink / raw)
To: Devrin Talen; +Cc: Theodore Ts'o, Ext4 Developers List
[-- Attachment #1: Type: text/plain, Size: 2964 bytes --]
On May 6, 2014, at 1:40 PM, Theodore Ts'o <tytso@mit.edu> wrote:
> On Mon, May 05, 2014 at 10:01:30PM -0400, Devrin Talen wrote:
>> 2. Write data to partition 12 via ADB (using `adb push ... /cache/`)
>
> Instead of using ADB, I would suggest writing a test program which
> writes a series of 512 byte sectors to a single large file in /cache.
> At the beginning of each 512 byte sector include a 4 byte serial
> number (which is incremented by one for each sector), a 4 byte testID
> which is different for each run of your test program, a time stamp, a
> CRC of these fields, and then fill the rest of the sector with some
> text string to make it easy to recognize this pattern. It can be
> anything from 0xDEADBEEF, to a string such as "DEBUGGING RANDOM HW
> BUGS REALLY SUCKS". :-)
We wrote a tool "llverfs" to do this years ago, for debugging problems
with >16TB LUN sizes and other 64/32-bit address truncation problems:
http://git.hpdd.intel.com/?p=fs/lustre-release.git;a=blob;f=lustre/utils/llverfs.c
It either partially fills the filesystem (write all files then read all
files, with one write per MB) to do a fast test of the system or can
optionally completely fill the filesystem and writes to every 4kB block
and then reads it back and verifies the data.
Each block contains the inode number, block offset, and a timestamp
(to distinguish between separate runs) so that it can detect where
badly written data is coming from.
There is a companion tool for doing block-device testing
http://git.hpdd.intel.com/?p=fs/lustre-release.git;a=blob;f=lustre/utils/llverdev.c
Caveat - we only ever use the llverfs on disposable filesystems, and
while I don't _think_ it will clobber the other files that already
exist, I've never tested it in such a manner. Obviously, llverdev is
overwriting the whole block device, so it will erase all data in the
device it is pointed at.
Cheers, Andreas
> Now try to reproduce the problem with this write load. If you can
> reproduce the problem, check and see if the corrupted file system
> block in the shows evidence of the string that was supposed to be
> written into /cache, showing up in /data. You can also check the
> large file being written in the /cache has the expended serial number
> and checksum.
>
> This will allow you to see if a the block writes are just going to the
> wrong place on the SSD, or something else more strange might be going
> on. Depending on the pattern of what blocks are ending up where they
> shouldn't, it might point towards different possible causes (i.e., a
> flaky solder joint, a buggy flash translation layer in the eMMC chip,
> etc.)
>
> Cheers,
>
> - Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
Cheers, Andreas
[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2014-05-06 22:38 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-04-17 15:05 ext4 filesystem corruption across partitions Devrin Talen
2014-04-17 16:12 ` Theodore Ts'o
2014-05-06 2:01 ` Devrin Talen
2014-05-06 19:40 ` Theodore Ts'o
2014-05-06 22:38 ` Andreas Dilger
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).