* Ext4 file corruption using cp
[not found] <CAH3VSLyZvtHC5G_Hx6oHr_jR43+vYz46nXLkifuD3gXGLd21yQ@mail.gmail.com>
@ 2012-11-11 11:37 ` Roger Niva
2012-11-11 17:50 ` Andreas Dilger
0 siblings, 1 reply; 5+ messages in thread
From: Roger Niva @ 2012-11-11 11:37 UTC (permalink / raw)
To: linux-ext4
Hi.
We are trying to pin down a file corruption issue we have on 5
productionservers and would like some suggestions about how to proceed
to find the culprit. It may or may not be ext4-related, but as that is
the only clue we have so far, we're trying here first.
The productionservers are running Slackware 13.37 with a selfcompiled
kernel (no patches or external modules).
We have a script running daily that copies files from one folder to
another using cp. On occasion (once or twice a week) the
destinationfile will not match the original file. The first bytes of
the files will be ok, but the rest of the file will be filled with
nullbytes (the file size matches, though). We had to create a loop in
the script that uses cmp to check if the cp failed and retry if it
did. After 20-25 attempts (sleep 1 between the cps), the cp normally
succeeds.
If we copy the files from ext3 to ext3, the problem goes away. If we
copy it from ext3 to ext4 or from ext4 to ext4, the files will
sometimes be corrupt.
The servers are not being rebooted and the filesystems are not being
remounted, so it's probably not linked to the recent ext4 corruption.
The kernel is x86_64, but the OS is 32-bit. The filesystems reside on
an aacraid controller (hw RAID-5) with batterybackup and an SSD cache
(we tried to remove the SSD, but it still failed). ext4 is mounted
with noatime,data=writeback. There are no kernel errormessages and
there does not appear to be any hardwareissues.
We have verified the corruption on 3.2.9 and 3.5.3. 2.6.35.6 seems to
not be affected.
Since these are productionservers (we haven't been able to reproduce
it inhouse), there is only so much testing we can do, but we're
currently trying to figure out what we can do to narrow it down. I am
aware that I'm not neccessary providing much information, but as this
point we're just looking for suggestions about how to proceed to
figure out what may be the issue.
Any help would be much appreciated.
--
Vennlig hilsen,
Roger Niva
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Ext4 file corruption using cp
2012-11-11 11:37 ` Ext4 file corruption using cp Roger Niva
@ 2012-11-11 17:50 ` Andreas Dilger
2012-11-13 11:33 ` Peng Tao
2012-11-26 8:48 ` Roger Niva
0 siblings, 2 replies; 5+ messages in thread
From: Andreas Dilger @ 2012-11-11 17:50 UTC (permalink / raw)
To: Roger Niva; +Cc: linux-ext4@vger.kernel.org
On 2012-11-11, at 4:37, Roger Niva <rogerniva@gmail.com> wrote:
>
> We are trying to pin down a file corruption issue we have on 5
> productionservers and would like some suggestions about how to proceed
> to find the culprit. It may or may not be ext4-related, but as that is
> the only clue we have so far, we're trying here first.
>
> The productionservers are running Slackware 13.37 with a selfcompiled
> kernel (no patches or external modules).
> We have a script running daily that copies files from one folder to
> another using cp.
there was a bug in ext4 FIEMAP ioctl code in the past that interacted badly with fileutils for copying files that were just written and still in cache. That was around 2.6.26 or so.
You should probably try a new version of fileutils to see if that solves the problem. Alternately, if you run "sync" before "cp" this should also avoid the problem.
Cheers, Andreas
> On occasion (once or twice a week) the
> destinationfile will not match the original file. The first bytes of
> the files will be ok, but the rest of the file will be filled with
> nullbytes (the file size matches, though). We had to create a loop in
> the script that uses cmp to check if the cp failed and retry if it
> did. After 20-25 attempts (sleep 1 between the cps), the cp normally
> succeeds.
>
> If we copy the files from ext3 to ext3, the problem goes away. If we
> copy it from ext3 to ext4 or from ext4 to ext4, the files will
> sometimes be corrupt.
>
> The servers are not being rebooted and the filesystems are not being
> remounted, so it's probably not linked to the recent ext4 corruption.
>
> The kernel is x86_64, but the OS is 32-bit. The filesystems reside on
> an aacraid controller (hw RAID-5) with batterybackup and an SSD cache
> (we tried to remove the SSD, but it still failed). ext4 is mounted
> with noatime,data=writeback. There are no kernel errormessages and
> there does not appear to be any hardwareissues.
>
> We have verified the corruption on 3.2.9 and 3.5.3. 2.6.35.6 seems to
> not be affected.
>
> Since these are productionservers (we haven't been able to reproduce
> it inhouse), there is only so much testing we can do, but we're
> currently trying to figure out what we can do to narrow it down. I am
> aware that I'm not neccessary providing much information, but as this
> point we're just looking for suggestions about how to proceed to
> figure out what may be the issue.
>
> Any help would be much appreciated.
>
>
> --
> Vennlig hilsen,
> Roger Niva
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Ext4 file corruption using cp
2012-11-11 17:50 ` Andreas Dilger
@ 2012-11-13 11:33 ` Peng Tao
2012-11-26 8:48 ` Roger Niva
1 sibling, 0 replies; 5+ messages in thread
From: Peng Tao @ 2012-11-13 11:33 UTC (permalink / raw)
To: Andreas Dilger; +Cc: Roger Niva, linux-ext4@vger.kernel.org
On Mon, Nov 12, 2012 at 1:50 AM, Andreas Dilger <adilger@dilger.ca> wrote:
> On 2012-11-11, at 4:37, Roger Niva <rogerniva@gmail.com> wrote:
>>
>> We are trying to pin down a file corruption issue we have on 5
>> productionservers and would like some suggestions about how to proceed
>> to find the culprit. It may or may not be ext4-related, but as that is
>> the only clue we have so far, we're trying here first.
>>
>> The productionservers are running Slackware 13.37 with a selfcompiled
>> kernel (no patches or external modules).
>> We have a script running daily that copies files from one folder to
>> another using cp.
>
> there was a bug in ext4 FIEMAP ioctl code in the past that interacted badly with fileutils for copying files that were just written and still in cache. That was around 2.6.26 or so.
>
It is commit 6d9c85eb700bd3ac59e63bb9de463dea1aca084c that went in at v2.6.39.
However, looking at ext4_fiemap(), it does seem racy. If pages are
written back between ext4_ext_find_extent() and ext4_ext_fiemap_cb(),
fiemap will report holes. This can possibly happen when cp runs
concurrently with background flusher, which is common for a long
running production server. If this is true, the bug also exists in
latest upstream.
--
Thanks,
Tao
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Ext4 file corruption using cp
2012-11-11 17:50 ` Andreas Dilger
2012-11-13 11:33 ` Peng Tao
@ 2012-11-26 8:48 ` Roger Niva
2012-11-26 14:24 ` Theodore Ts'o
1 sibling, 1 reply; 5+ messages in thread
From: Roger Niva @ 2012-11-26 8:48 UTC (permalink / raw)
To: Andreas Dilger; +Cc: linux-ext4@vger.kernel.org
On Sun, Nov 11, 2012 at 6:50 PM, Andreas Dilger <adilger@dilger.ca> wrote:
> On 2012-11-11, at 4:37, Roger Niva <rogerniva@gmail.com> wrote:
>>
>> We are trying to pin down a file corruption issue we have on 5
>> productionservers and would like some suggestions about how to proceed
>> to find the culprit. It may or may not be ext4-related, but as that is
>> the only clue we have so far, we're trying here first.
>>
>> The productionservers are running Slackware 13.37 with a selfcompiled
>> kernel (no patches or external modules).
>> We have a script running daily that copies files from one folder to
>> another using cp.
>
> there was a bug in ext4 FIEMAP ioctl code in the past that interacted badly with fileutils for copying files that were just written and still in cache. That was around 2.6.26 or so.
>
> You should probably try a new version of fileutils to see if that solves the problem. Alternately, if you run "sync" before "cp" this should also avoid the problem.
>
> Cheers, Andreas
Hi.
We have now been running a newer coreutils (/bin/cp) for about a week
without seeing this pop up again, so we're reasonably sure it fixed
the issue for us.
Thanks a bunch!
--
Vennlig hilsen,
Roger Niva
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Ext4 file corruption using cp
2012-11-26 8:48 ` Roger Niva
@ 2012-11-26 14:24 ` Theodore Ts'o
0 siblings, 0 replies; 5+ messages in thread
From: Theodore Ts'o @ 2012-11-26 14:24 UTC (permalink / raw)
To: Roger Niva; +Cc: Andreas Dilger, linux-ext4@vger.kernel.org
On Mon, Nov 26, 2012 at 09:48:10AM +0100, Roger Niva wrote:
>
> We have now been running a newer coreutils (/bin/cp) for about a week
> without seeing this pop up again, so we're reasonably sure it fixed
> the issue for us.
Thanks for confirming that this fixed your issue!
- Ted
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2012-11-26 14:24 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <CAH3VSLyZvtHC5G_Hx6oHr_jR43+vYz46nXLkifuD3gXGLd21yQ@mail.gmail.com>
2012-11-11 11:37 ` Ext4 file corruption using cp Roger Niva
2012-11-11 17:50 ` Andreas Dilger
2012-11-13 11:33 ` Peng Tao
2012-11-26 8:48 ` Roger Niva
2012-11-26 14:24 ` Theodore Ts'o
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).