* file corruption issue
@ 2012-05-11 1:27 Patrick Shirkey
2012-05-11 16:50 ` Ben Myers
0 siblings, 1 reply; 10+ messages in thread
From: Patrick Shirkey @ 2012-05-11 1:27 UTC (permalink / raw)
To: xfs
Hi,
I have some HP machines running centos:
kernel 2.6.32-042stab049.6
AMD Opteron(tm) Processor 6180 SE
RAM: 528 GB
RAID bus controller: Hewlett-Packard Company Smart Array G6 controllers
We have experienced some kernel crashes due to a kernel bug with
interleaving ram on this hardware which require hard reset of the
machines.
After reboot we are finding that there is severe file corruption on the
xfs file system where TBs of readonly databases are getting partially or
fully truncated.
Has anyone come across this or similar?
We don't think it is related to write cache due to the amount of data that
is being corrupted.
rgds,
--
Patrick Shirkey
Boost Hardware Ltd
Kernel trace below:
======
May 10 20:49:42 h4 kernel: [586068.444002] BUG: soft lockup - CPU#0 stuck
for 67s! [python:173511]
May 10 20:49:42 h4 kernel: [586068.444002] Modules linked in: vzethdev
simfs vzrst vzcpt nfs lockd fscache nfs_acl auth_rpcgss vzdquota
ip6table_mangle xt_length xt_hl xt_tcpmss xt_TCPMSS xt_multiport xt_limit
xt_dscp vzevent mptctl mptbase autofs4 sunrpc vznetdev vzmon vzdev bonding
ipt_REJECT iptable_filter iptable_mangle ipt_MASQUERADE iptable_nat nf_nat
nf_conntrack_ipv4 nf_defrag_ipv4 ip_tables ip6t_REJECT nf_conntrack_ipv6
nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 xfs
exportfs power_meter hpilo hpwdt netxen_nic microcode sg serio_raw k10temp
amd64_edac_mod edac_core edac_mce_amd i2c_piix4 shpchp ext4 mbcache jbd2
sd_mod crc_t10dif sr_mod cdrom hpsa ata_generic pata_acpi pata_atiixp ahci
radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core dm_mirror
dm_region_hash dm_log dm_mod [last unloaded: mperf]
May 10 20:49:42 h4 kernel: [586068.444002] CPU 0
May 10 20:49:42 h4 kernel: [586068.444002] Modules linked in: vzethdev
simfs vzrst vzcpt nfs lockd fscache nfs_acl auth_rpcgss vzdquota
ip6table_mangle xt_length xt_hl xt_tcpmss xt_TCPMSS xt_multiport xt_limit
xt_dscp vzevent mptctl mptbase autofs4 sunrpc vznetdev vzmon vzdev bonding
ipt_REJECT iptable_filter iptable_mangle ipt_MASQUERADE iptable_nat nf_nat
nf_conntrack_ipv4 nf_defrag_ipv4 ip_tables ip6t_REJECT nf_conntrack_ipv6
nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 xfs
exportfs power_meter hpilo hpwdt netxen_nic microcode sg serio_raw k10temp
amd64_edac_mod edac_core edac_mce_amd i2c_piix4 shpchp ext4 mbcache jbd2
sd_mod crc_t10dif sr_mod cdrom hpsa ata_generic pata_acpi pata_atiixp ahci
radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core dm_mirror
dm_region_hash dm_log dm_mod [last unloaded: mperf]
May 10 20:49:42 h4 kernel: [586068.444002]
May 10 20:49:42 h4 kernel: [586068.444002] Pid: 173511, comm: python veid:
430 Not tainted 2.6.32-042stab049.6 #1 042stab049_6 HP ProLiant DL585 G7
May 10 20:49:42 h4 kernel: [586068.444002] RIP: 0010:[<ffffffff8114033e>]
[<ffffffff8114033e>] shrink_zone+0x21e/0x9a0
May 10 20:49:42 h4 kernel: [586068.444002] RSP: 0000:ffff8818fecab9a8
EFLAGS: 00000286
May 10 20:49:42 h4 kernel: [586068.444002] RAX: ffff8850400192a8 RBX:
ffff8818fecaba68 RCX: ffff8818fecaba10
May 10 20:49:42 h4 kernel: [586068.444002] RDX: 0000000000000000 RSI:
28f5c28f5c28f5c3 RDI: ffff8850400192a8
May 10 20:49:42 h4 kernel: [586068.444002] RBP: ffffffff8100bcce R08:
0000000000000000 R09: 0000000000000000
May 10 20:49:42 h4 kernel: [586068.444002] R10: 0000000000000001 R11:
0000000000000020 R12: ffff8818fecab998
May 10 20:49:42 h4 kernel: [586068.444002] R13: ffffffff8100bcce R14:
ffffffff81137d77 R15: ffff8818fecab958
May 10 20:49:42 h4 kernel: [586068.444002] FS: 00007f6c25324700(0000)
GS:ffff880028200000(0000) knlGS:00000000b77e76c0
May 10 20:49:42 h4 kernel: [586068.444002] CS: 0010 DS: 0000 ES: 0000
CR0: 0000000080050033
May 10 20:49:42 h4 kernel: [586068.444002] CR2: 0000000000434020 CR3:
0000007a892ab000 CR4: 00000000000006f0
May 10 20:49:42 h4 kernel: [586068.444002] DR0: 0000000000000000 DR1:
0000000000000000 DR2: 0000000000000000
May 10 20:49:42 h4 kernel: [586068.444002] DR3: 0000000000000000 DR6:
00000000ffff0ff0 DR7: 0000000000000400
May 10 20:49:42 h4 kernel: [586068.444002] Process python (pid: 173511,
veid: 430, threadinfo ffff8818fecaa000, task ffff88190ee5a600)
May 10 20:49:42 h4 kernel: [586068.444002] Stack:
May 10 20:49:42 h4 kernel: [586068.444002] 0000000000000000
ffff8818fecaba38 ffff885037bab180 00000064fecaba68
May 10 20:49:42 h4 kernel: [586068.444002] <0> 00ff881800000000
0000000000000020 ffff8850400192a8 000000008109fd79
May 10 20:49:42 h4 kernel: [586068.444002] <0> 0000000000000000
ffff885040010e40 0000000000000000 0000000000000000
May 10 20:49:42 h4 kernel: [586068.444002] Call Trace:
May 10 20:49:42 h4 kernel: [586068.444002] [<ffffffff8109fd79>] ?
ktime_get_ts+0xa9/0xe0
May 10 20:49:42 h4 kernel: [586068.444002] [<ffffffff81140d54>] ?
do_try_to_free_pages+0x294/0x7f0
May 10 20:49:42 h4 kernel: [586068.444002] [<ffffffff811413f7>] ?
try_to_free_gang_pages+0x77/0xf0
May 10 20:49:42 h4 kernel: [586068.444002] [<ffffffff8113e040>] ?
isolate_pages_global+0x0/0x520
May 10 20:49:42 h4 kernel: [586068.444002] [<ffffffff810a9845>] ?
ub_try_to_free_pages+0x45/0x130
May 10 20:49:42 h4 kernel: [586068.444002] [<ffffffff810a999b>] ?
__ub_check_ram_limits+0x6b/0x90
May 10 20:49:42 h4 kernel: [586068.444002] [<ffffffff81153c25>] ?
__do_fault+0x565/0x600
May 10 20:49:42 h4 kernel: [586068.444002] [<ffffffff8119c35e>] ?
__link_path_walk+0x88e/0x1060
May 10 20:49:42 h4 kernel: [586068.444002] [<ffffffff81153db9>] ?
handle_pte_fault+0xf9/0xd00
May 10 20:49:42 h4 kernel: [586068.444002] [<ffffffff811ad180>] ?
mntput_no_expire+0x30/0x110
May 10 20:49:42 h4 kernel: [586068.444002] [<ffffffff811ad180>] ?
mntput_no_expire+0x30/0x110
May 10 20:49:42 h4 kernel: [586068.444002] [<ffffffff81154ba4>] ?
handle_mm_fault+0x1e4/0x2b0
May 10 20:49:42 h4 kernel: [586068.444002] [<ffffffff8119a2e5>] ?
putname+0x35/0x50
May 10 20:49:42 h4 kernel: [586068.444002] [<ffffffff81042aa9>] ?
__do_page_fault+0x139/0x480
May 10 20:49:42 h4 kernel: [586068.444002] [<ffffffff811922b4>] ?
cp_new_stat+0xe4/0x100
May 10 20:49:42 h4 kernel: [586068.444002] [<ffffffff814e4a7e>] ?
do_page_fault+0x3e/0xa0
May 10 20:49:42 h4 kernel: [586068.444002] [<ffffffff814e1e25>] ?
page_fault+0x25/0x30
May 10 20:49:42 h4 kernel: [586068.444002] Code: 00 89 b5 60 ff ff ff 89
85 5c ff ff ff eb 56 66 0f 1f 44 00 00 31 d2 48 89 11 48 8b 85 70 ff ff ff
66 ff 00 66 66 90 fb 66 66 90 <66> 66 90 48 83 39 00 75 21 80 bd 67 ff ff
ff 00 74 18 4d 63 f6
May 10 20:49:42 h4 kernel: [586068.444002] Call Trace:
May 10 20:49:42 h4 kernel: [586068.444002] [<ffffffff811403fe>] ?
shrink_zone+0x2de/0x9a0
May 10 20:49:42 h4 kernel: [586068.444002] [<ffffffff8109fd79>] ?
ktime_get_ts+0xa9/0xe0
May 10 20:49:42 h4 kernel: [586068.444002] [<ffffffff81140d54>] ?
do_try_to_free_pages+0x294/0x7f0
May 10 20:49:42 h4 kernel: [586068.444002] [<ffffffff811413f7>] ?
try_to_free_gang_pages+0x77/0xf0
May 10 20:49:42 h4 kernel: [586068.444002] [<ffffffff8113e040>] ?
isolate_pages_global+0x0/0x520
May 10 20:49:42 h4 kernel: [586068.444002] [<ffffffff810a9845>] ?
ub_try_to_free_pages+0x45/0x130
May 10 20:49:42 h4 kernel: [586068.444002] [<ffffffff810a999b>] ?
__ub_check_ram_limits+0x6b/0x90
May 10 20:49:42 h4 kernel: [586068.444002] [<ffffffff81153c25>] ?
__do_fault+0x565/0x600
May 10 20:49:42 h4 kernel: [586068.444002] [<ffffffff8119c35e>] ?
__link_path_walk+0x88e/0x1060
May 10 20:49:42 h4 kernel: [586068.444002] [<ffffffff81153db9>] ?
handle_pte_fault+0xf9/0xd00
May 10 20:49:42 h4 kernel: [586068.444002] [<ffffffff811ad180>] ?
mntput_no_expire+0x30/0x110
May 10 20:49:42 h4 kernel: [586068.444002] [<ffffffff811ad180>] ?
mntput_no_expire+0x30/0x110
May 10 20:49:42 h4 kernel: [586068.444002] [<ffffffff81154ba4>] ?
handle_mm_fault+0x1e4/0x2b0
May 10 20:49:42 h4 kernel: [586068.444002] [<ffffffff8119a2e5>] ?
putname+0x35/0x50
May 10 20:49:42 h4 kernel: [586068.444002] [<ffffffff81042aa9>] ?
__do_page_fault+0x139/0x480
May 10 20:49:42 h4 kernel: [586068.444002] [<ffffffff811922b4>] ?
cp_new_stat+0xe4/0x100
May 10 20:49:42 h4 kernel: [586068.444002] [<ffffffff814e4a7e>] ?
do_page_fault+0x3e/0xa0
May 10 20:49:42 h4 kernel: [586068.444002] [<ffffffff814e1e25>] ?
page_fault+0x25/0x30
======
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: file corruption issue
2012-05-11 1:27 file corruption issue Patrick Shirkey
@ 2012-05-11 16:50 ` Ben Myers
2012-05-14 1:45 ` Patrick Shirkey
0 siblings, 1 reply; 10+ messages in thread
From: Ben Myers @ 2012-05-11 16:50 UTC (permalink / raw)
To: Patrick Shirkey; +Cc: xfs
On Fri, May 11, 2012 at 03:27:02AM +0200, Patrick Shirkey wrote:
> I have some HP machines running centos:
>
> kernel 2.6.32-042stab049.6
> AMD Opteron(tm) Processor 6180 SE
> RAM: 528 GB
> RAID bus controller: Hewlett-Packard Company Smart Array G6 controllers
>
> We have experienced some kernel crashes due to a kernel bug with
> interleaving ram on this hardware which require hard reset of the
> machines.
>
> After reboot we are finding that there is severe file corruption on the
> xfs file system where TBs of readonly databases are getting partially or
> fully truncated.
>
> Has anyone come across this or similar?
This rings a bell for me but I can't be certain. Could you provide a metadump?
-Ben
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: file corruption issue
2012-05-11 16:50 ` Ben Myers
@ 2012-05-14 1:45 ` Patrick Shirkey
2012-05-14 14:29 ` Ben Myers
0 siblings, 1 reply; 10+ messages in thread
From: Patrick Shirkey @ 2012-05-14 1:45 UTC (permalink / raw)
To: Ben Myers; +Cc: xfs
On Fri, May 11, 2012 6:50 pm, Ben Myers wrote:
> On Fri, May 11, 2012 at 03:27:02AM +0200, Patrick Shirkey wrote:
>> I have some HP machines running centos:
>>
>> kernel 2.6.32-042stab049.6
>> AMD Opteron(tm) Processor 6180 SE
>> RAM: 528 GB
>> RAID bus controller: Hewlett-Packard Company Smart Array G6 controllers
>>
>> We have experienced some kernel crashes due to a kernel bug with
>> interleaving ram on this hardware which require hard reset of the
>> machines.
>>
>> After reboot we are finding that there is severe file corruption on the
>> xfs file system where TBs of readonly databases are getting partially or
>> fully truncated.
>>
>> Has anyone come across this or similar?
>
> This rings a bell for me but I can't be certain. Could you provide a
> metadump?
>
The machines are live so we have already restored the data several times.
Will a metadump from the existing file system be useful or do you need it
post crash?
--
Patrick Shirkey
Boost Hardware Ltd
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: file corruption issue
2012-05-14 1:45 ` Patrick Shirkey
@ 2012-05-14 14:29 ` Ben Myers
2012-05-15 0:58 ` Patrick Shirkey
0 siblings, 1 reply; 10+ messages in thread
From: Ben Myers @ 2012-05-14 14:29 UTC (permalink / raw)
To: Patrick Shirkey; +Cc: xfs
Hey Patrick,
On Mon, May 14, 2012 at 03:45:06AM +0200, Patrick Shirkey wrote:
>
> On Fri, May 11, 2012 6:50 pm, Ben Myers wrote:
> > On Fri, May 11, 2012 at 03:27:02AM +0200, Patrick Shirkey wrote:
> >> I have some HP machines running centos:
> >>
> >> kernel 2.6.32-042stab049.6
> >> AMD Opteron(tm) Processor 6180 SE
> >> RAM: 528 GB
> >> RAID bus controller: Hewlett-Packard Company Smart Array G6 controllers
> >>
> >> We have experienced some kernel crashes due to a kernel bug with
> >> interleaving ram on this hardware which require hard reset of the
> >> machines.
> >>
> >> After reboot we are finding that there is severe file corruption on the
> >> xfs file system where TBs of readonly databases are getting partially or
> >> fully truncated.
> >>
> >> Has anyone come across this or similar?
> >
> > This rings a bell for me but I can't be certain. Could you provide a
> > metadump?
> >
>
> The machines are live so we have already restored the data several times.
> Will a metadump from the existing file system be useful or do you need it
> post crash?
Well... one of each would be best. It might be helpful to compare the block
map from before the crash with the block map after the crash for one of the
read-only corrupted databases.
Regards,
Ben
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: file corruption issue
2012-05-14 14:29 ` Ben Myers
@ 2012-05-15 0:58 ` Patrick Shirkey
2012-05-15 15:13 ` Ben Myers
[not found] ` <20120601203451.32ae2ed5@asix.localdomain>
0 siblings, 2 replies; 10+ messages in thread
From: Patrick Shirkey @ 2012-05-15 0:58 UTC (permalink / raw)
To: xfs
On Mon, May 14, 2012 4:29 pm, Ben Myers wrote:
> Hey Patrick,
>
> On Mon, May 14, 2012 at 03:45:06AM +0200, Patrick Shirkey wrote:
>>
>> On Fri, May 11, 2012 6:50 pm, Ben Myers wrote:
>> > On Fri, May 11, 2012 at 03:27:02AM +0200, Patrick Shirkey wrote:
>> >> I have some HP machines running centos:
>> >>
>> >> kernel 2.6.32-042stab049.6
>> >> AMD Opteron(tm) Processor 6180 SE
>> >> RAM: 528 GB
>> >> RAID bus controller: Hewlett-Packard Company Smart Array G6
>> controllers
>> >>
>> >> We have experienced some kernel crashes due to a kernel bug with
>> >> interleaving ram on this hardware which require hard reset of the
>> >> machines.
>> >>
>> >> After reboot we are finding that there is severe file corruption on
>> the
>> >> xfs file system where TBs of readonly databases are getting partially
>> or
>> >> fully truncated.
>> >>
>> >> Has anyone come across this or similar?
>> >
>> > This rings a bell for me but I can't be certain. Could you provide a
>> > metadump?
>> >
>>
>> The machines are live so we have already restored the data several
>> times.
>> Will a metadump from the existing file system be useful or do you need
>> it
>> post crash?
>
> Well... one of each would be best. It might be helpful to compare the
> block
> map from before the crash with the block map after the crash for one of
> the
> read-only corrupted databases.
>
Unfortunately I cannot unmount the partition/s to run xfs_metadump because
they are in use.
I have found some files that were truncated on a recent crash. Is there
any tool I can run on those files to get info that might be useful?
--
Patrick Shirkey
Boost Hardware Ltd
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: file corruption issue
2012-05-15 0:58 ` Patrick Shirkey
@ 2012-05-15 15:13 ` Ben Myers
2012-05-16 2:30 ` Patrick Shirkey
[not found] ` <20120601203451.32ae2ed5@asix.localdomain>
1 sibling, 1 reply; 10+ messages in thread
From: Ben Myers @ 2012-05-15 15:13 UTC (permalink / raw)
To: Patrick Shirkey; +Cc: xfs
On Tue, May 15, 2012 at 02:58:42AM +0200, Patrick Shirkey wrote:
> Unfortunately I cannot unmount the partition/s to run xfs_metadump because
> they are in use.
>
> I have found some files that were truncated on a recent crash. Is there
> any tool I can run on those files to get info that might be useful?
Hrm.. xfs_bmap output could be helpful so we can see the block map. Do you
know how big they are supposed to be? How much was truncated?
Unfortunately since you don't know which database will have the corruption...
you'll need to get xfs_bmap output for all of them, and then after a crash get
the 'after'. Is that a possibility?
-Ben
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: file corruption issue
2012-05-15 15:13 ` Ben Myers
@ 2012-05-16 2:30 ` Patrick Shirkey
2012-05-24 15:33 ` Ben Myers
0 siblings, 1 reply; 10+ messages in thread
From: Patrick Shirkey @ 2012-05-16 2:30 UTC (permalink / raw)
To: Ben Myers; +Cc: xfs
On Tue, May 15, 2012 5:13 pm, Ben Myers wrote:
>
> On Tue, May 15, 2012 at 02:58:42AM +0200, Patrick Shirkey wrote:
>> Unfortunately I cannot unmount the partition/s to run xfs_metadump
>> because
>> they are in use.
>>
>> I have found some files that were truncated on a recent crash. Is there
>> any tool I can run on those files to get info that might be useful?
>
> Hrm.. xfs_bmap output could be helpful so we can see the block map. Do
> you
> know how big they are supposed to be? How much was truncated?
>
The files that we have as examples were originally 28bytes but are now 0byte.
Running xfs_bmap on the 0 byte file returns "no extent".
ex.
These files are located next to each other in the same folder.
- 28 byte file:
EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET
TOTAL
0: [0..7]: 28230136440..28230136447 13 (312849120..312849127)
8
- 0 byte file: no extents
> Unfortunately since you don't know which database will have the
> corruption...
> you'll need to get xfs_bmap output for all of them, and then after a crash
> get
> the 'after'. Is that a possibility?
>
I'll try to get some more data.
- Separately I was able to run xfs_metadump against one of our partitions.
The resulting file is 1.4 GB and it also has some potentially sensitive
information in it so I am not sure about posting it to a public location.
Is there anything that I can look for that might be useful.
I have some data from xfs_bmap on specific files located in the same
partition as the metadump was generated from. I'm not sure if that will
actually give us any details that can help though as this data is all post
crash atm.
- A few more details that may be relevant.
1: We are running openvz and LVM on these machines. Are there any known
issue/s with file corruption after a hard reset with openvz/LVM running?
2: We have observed that while there is no obvious pattern in the data
corruption is does happen in chunks. It appears to be random chunks of
files that are corrupted after a crash->reset sequence.
--
Patrick Shirkey
Boost Hardware Ltd
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: file corruption issue
2012-05-16 2:30 ` Patrick Shirkey
@ 2012-05-24 15:33 ` Ben Myers
2012-05-24 21:46 ` Patrick Shirkey
0 siblings, 1 reply; 10+ messages in thread
From: Ben Myers @ 2012-05-24 15:33 UTC (permalink / raw)
To: Patrick Shirkey; +Cc: xfs
Hey Patrick,
On Wed, May 16, 2012 at 04:30:47AM +0200, Patrick Shirkey wrote:
> On Tue, May 15, 2012 5:13 pm, Ben Myers wrote:
> > On Tue, May 15, 2012 at 02:58:42AM +0200, Patrick Shirkey wrote:
> >> Unfortunately I cannot unmount the partition/s to run xfs_metadump because
> >> they are in use.
> >>
> >> I have found some files that were truncated on a recent crash. Is there
> >> any tool I can run on those files to get info that might be useful?
> >
> > Hrm.. xfs_bmap output could be helpful so we can see the block map. Do you
> > know how big they are supposed to be? How much was truncated?
> >
>
> The files that we have as examples were originally 28bytes but are now 0byte.
>
> Running xfs_bmap on the 0 byte file returns "no extent".
>
> ex.
>
> These files are located next to each other in the same folder.
>
> - 28 byte file: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET
> TOTAL 0: [0..7]: 28230136440..28230136447 13 (312849120..312849127)
> 8
>
> - 0 byte file: no extents
So how old are the files that get truncated? Were they created very recently?
> - A few more details that may be relevant.
>
> 1: We are running openvz and LVM on these machines. Are there any known
> issue/s with file corruption after a hard reset with openvz/LVM running?
I don't know about openvz/LVM...
> 2: We have observed that while there is no obvious pattern in the data
> corruption is does happen in chunks. It appears to be random chunks of files
> that are corrupted after a crash->reset sequence.
...and the data corruption happened in files that are read only? Again.. when
were they created?
Thanks,
Ben
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: file corruption issue
2012-05-24 15:33 ` Ben Myers
@ 2012-05-24 21:46 ` Patrick Shirkey
0 siblings, 0 replies; 10+ messages in thread
From: Patrick Shirkey @ 2012-05-24 21:46 UTC (permalink / raw)
To: Ben Myers; +Cc: xfs
On Thu, May 24, 2012 5:33 pm, Ben Myers wrote:
> Hey Patrick,
>
> On Wed, May 16, 2012 at 04:30:47AM +0200, Patrick Shirkey wrote:
>> On Tue, May 15, 2012 5:13 pm, Ben Myers wrote:
>> > On Tue, May 15, 2012 at 02:58:42AM +0200, Patrick Shirkey wrote:
>> >> Unfortunately I cannot unmount the partition/s to run xfs_metadump
>> because
>> >> they are in use.
>> >>
>> >> I have found some files that were truncated on a recent crash. Is
>> there
>> >> any tool I can run on those files to get info that might be useful?
>> >
>> > Hrm.. xfs_bmap output could be helpful so we can see the block map.
>> Do you
>> > know how big they are supposed to be? How much was truncated?
>> >
>>
>> The files that we have as examples were originally 28bytes but are now
>> 0byte.
>>
>> Running xfs_bmap on the 0 byte file returns "no extent".
>>
>> ex.
>>
>> These files are located next to each other in the same folder.
>>
>> - 28 byte file: EXT: FILE-OFFSET BLOCK-RANGE AG
>> AG-OFFSET
>> TOTAL 0: [0..7]: 28230136440..28230136447 13
>> (312849120..312849127)
>> 8
>>
>> - 0 byte file: no extents
>
> So how old are the files that get truncated? Were they created very
> recently?
>
Most of the corrupted files were create weeks earlier and many of them
have been through several reboots before getting chomped. We have seen it
on SSD and HDD's in multiple different partitions. All of them are xfs
partitions.
We thought it might be something to do with file descriptors but the
amount of data corruption and the random blockiness suggests it is a
different issue.
We do not see the problem with any other hardware. It seems to be directly
related to these HP machines. To replicate we just need to pull the plug
and we see data loss/corruption across the drives. It may be that it is a
hardware issue with the mobo and raid controller which is spiking the
disks? HP have installed a firmware upgrade which didn't make any
difference.
At this stage we are trying to understand how it could be possible so
would like to rule out or confirm if it would be possible in XFS. Some of
our research suggest that there are known issues with XFS in some cases of
high load but we don't want to point the finger unless we are sure.
>> - A few more details that may be relevant.
>>
>> 1: We are running openvz and LVM on these machines. Are there any known
>> issue/s with file corruption after a hard reset with openvz/LVM running?
>
> I don't know about openvz/LVM...
>
>> 2: We have observed that while there is no obvious pattern in the data
>> corruption is does happen in chunks. It appears to be random chunks of
>> files
>> that are corrupted after a crash->reset sequence.
>
> ...and the data corruption happened in files that are read only? Again..
> when
> were they created?
>
> Thanks,
> Ben
>
--
Patrick Shirkey
Boost Hardware Ltd
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: file corruption issue
[not found] ` <20120601203451.32ae2ed5@asix.localdomain>
@ 2012-06-01 12:38 ` Igor M Podlesny
0 siblings, 0 replies; 10+ messages in thread
From: Igor M Podlesny @ 2012-06-01 12:38 UTC (permalink / raw)
To: xfs
On Tue, 15 May 2012 02:58:42 +0200 (CEST)
pshirkey at boosthardware.com (Patrick Shirkey) wrote:
> On Mon, May 14, 2012 4:29 pm, Ben Myers wrote:
> > Hey Patrick,
[...]
> > Well... one of each would be best. It might be helpful to compare the
> > block
> > map from before the crash with the block map after the crash for one of
> > the
> > read-only corrupted databases.
> >
>
> Unfortunately I cannot unmount the partition/s to run xfs_metadump because
> they are in use.
[...]
Why not, if it's on LVM? You just snapshot it and then
mount/unmount snapshot instead (see man mount to overcome some XFS'
precaution mount checks due to the same UUID of the snapshot).
--
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2012-06-01 12:39 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-05-11 1:27 file corruption issue Patrick Shirkey
2012-05-11 16:50 ` Ben Myers
2012-05-14 1:45 ` Patrick Shirkey
2012-05-14 14:29 ` Ben Myers
2012-05-15 0:58 ` Patrick Shirkey
2012-05-15 15:13 ` Ben Myers
2012-05-16 2:30 ` Patrick Shirkey
2012-05-24 15:33 ` Ben Myers
2012-05-24 21:46 ` Patrick Shirkey
[not found] ` <20120601203451.32ae2ed5@asix.localdomain>
2012-06-01 12:38 ` Igor M Podlesny
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox