* [linux-lvm] LVM snapshot merge and corrupted file
@ 2013-12-02 11:41 Guilherme Moro
2013-12-02 14:39 ` Mike Snitzer
0 siblings, 1 reply; 3+ messages in thread
From: Guilherme Moro @ 2013-12-02 11:41 UTC (permalink / raw)
To: linux-lvm
Hi,
I know that is a too broad question, but please be kind ;)
The scenario:
RHEL 6.2 - snapshot a disk mounted over multipath device mapper
Upgrade system to RHEL 6.4
Merge the snapshot to return the system to previous state.
System get unstable and rebooting cyclic (not reaching user-level, at
least the logs don't show it)
Spot a file that got more or less 1200 bytes corrupted (mostly turned to 0).
Sadly, I got called to the machine too late to recover the console
output of the reboot (it's a blade and no console logs was
configured), and could figure out if some hardware failure happened.
As I don't have proper logs to further investigate my questions is:
- There are any know issues around snapshotting in this conditions
(RHEL 6.2 -> RHEL 6.4, multipath)?
- There's any chance of this being a software failure (bug?) and do
the restore procedure warn me in the logs (/var/log/message?) about
any failure during the restore (even if hardware related).
My main suspicion for now is a hardware failure somewhere, but I was
kindly asked to be sure that this can't be a bug.
Any thoughts or pointers (docs, pieces of code, testing reports) would
be appreciate, so don't be shy :)
Regards,
Guilherme Moro
PS: Do Red Hat, or somebody else do any kind of continuous integration
tests on LVM?
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [linux-lvm] LVM snapshot merge and corrupted file
2013-12-02 11:41 [linux-lvm] LVM snapshot merge and corrupted file Guilherme Moro
@ 2013-12-02 14:39 ` Mike Snitzer
2013-12-02 15:46 ` Guilherme Moro
0 siblings, 1 reply; 3+ messages in thread
From: Mike Snitzer @ 2013-12-02 14:39 UTC (permalink / raw)
To: Guilherme Moro; +Cc: linux-lvm
On Mon, Dec 02 2013 at 6:41am -0500,
Guilherme Moro <guilherme.moro@gmail.com> wrote:
> Hi,
>
> I know that is a too broad question, but please be kind ;)
> The scenario:
> RHEL 6.2 - snapshot a disk mounted over multipath device mapper
> Upgrade system to RHEL 6.4
> Merge the snapshot to return the system to previous state.
> System get unstable and rebooting cyclic (not reaching user-level, at
> least the logs don't show it)
> Spot a file that got more or less 1200 bytes corrupted (mostly turned to 0).
The first rollback attempt was done in production?
> Sadly, I got called to the machine too late to recover the console
> output of the reboot (it's a blade and no console logs was
> configured), and could figure out if some hardware failure happened.
>
> As I don't have proper logs to further investigate my questions is:
>
> - There are any know issues around snapshotting in this conditions
> (RHEL 6.2 -> RHEL 6.4, multipath)?
Not aware of any.
> - There's any chance of this being a software failure (bug?) and do
> the restore procedure warn me in the logs (/var/log/message?) about
> any failure during the restore (even if hardware related).
>
> My main suspicion for now is a hardware failure somewhere, but I was
> kindly asked to be sure that this can't be a bug.
>
> Any thoughts or pointers (docs, pieces of code, testing reports) would
> be appreciate, so don't be shy :)
The lvm2 testsuite has support for testing snapshot-merge; but it
doesn't test layering snapshot ontop of multipath.
Without context (e.g. logs) for what happened it is really hard to say
definitively whether or not you hit some software bug or if your problem
was hardware failure like you suspect.
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [linux-lvm] LVM snapshot merge and corrupted file
2013-12-02 14:39 ` Mike Snitzer
@ 2013-12-02 15:46 ` Guilherme Moro
0 siblings, 0 replies; 3+ messages in thread
From: Guilherme Moro @ 2013-12-02 15:46 UTC (permalink / raw)
To: Mike Snitzer; +Cc: linux-lvm
Hi,
Thanks for the response.
On Mon, Dec 2, 2013 at 2:39 PM, Mike Snitzer <snitzer@redhat.com> wrote:
> On Mon, Dec 02 2013 at 6:41am -0500,
> Guilherme Moro <guilherme.moro@gmail.com> wrote:
>
>> Hi,
>>
>> I know that is a too broad question, but please be kind ;)
>> The scenario:
>> RHEL 6.2 - snapshot a disk mounted over multipath device mapper
>> Upgrade system to RHEL 6.4
>> Merge the snapshot to return the system to previous state.
>> System get unstable and rebooting cyclic (not reaching user-level, at
>> least the logs don't show it)
>> Spot a file that got more or less 1200 bytes corrupted (mostly turned to 0).
>
> The first rollback attempt was done in production?
No, this is a test system, and the actual procedure was tested dozen
of times without any issue (we never checksummed the files, but the
system never got in a failed state before), so this is why we think is
probably hardware related.
>
>> Sadly, I got called to the machine too late to recover the console
>> output of the reboot (it's a blade and no console logs was
>> configured), and could figure out if some hardware failure happened.
>>
>> As I don't have proper logs to further investigate my questions is:
>>
>> - There are any know issues around snapshotting in this conditions
>> (RHEL 6.2 -> RHEL 6.4, multipath)?
>
> Not aware of any.
This is great, the main reason for the e-mail was to confirm that no
known issue exists.
>
>> - There's any chance of this being a software failure (bug?) and do
>> the restore procedure warn me in the logs (/var/log/message?) about
>> any failure during the restore (even if hardware related).
>>
>> My main suspicion for now is a hardware failure somewhere, but I was
>> kindly asked to be sure that this can't be a bug.
>>
>> Any thoughts or pointers (docs, pieces of code, testing reports) would
>> be appreciate, so don't be shy :)
>
> The lvm2 testsuite has support for testing snapshot-merge; but it
> doesn't test layering snapshot ontop of multipath.
I supposed that, just confirming :)
>
> Without context (e.g. logs) for what happened it is really hard to say
> definitively whether or not you hit some software bug or if your problem
> was hardware failure like you suspect.
A snippet of the messages log is here http://pastebin.com/3k1y358N
But I couldn't spot anything weird, besides the fact that the logs
never go past that until some 4 hours later. (the syslog error goes
away after 2 hours, probably the right file get delivered by puppet in
the meantime, don't know how tho, but even this is not enough to get
logs further than that immediately). Anyway, didn't send the logs
before because they seem useless :)
Just on the other question, does LVM spit out any output if things
goes wrong during the restore?
We are hooking on our CI a test to snapshot -> upgrade -> restore,
with proper file checksum in place, so let's see if we can ever
reproduced it in normal operation.
Regards,
Guilherme Moro
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2013-12-02 15:46 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-12-02 11:41 [linux-lvm] LVM snapshot merge and corrupted file Guilherme Moro
2013-12-02 14:39 ` Mike Snitzer
2013-12-02 15:46 ` Guilherme Moro
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).