[Bug 118511] New: Corruption of VM qcow2 image file on EXT4 with crypto enabled

linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [Bug 118511] New: Corruption of VM qcow2 image file on EXT4 with crypto enabled
@ 2016-05-19 14:50 bugzilla-daemon
  2016-05-24 14:50 ` [Bug 118511] " bugzilla-daemon
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: bugzilla-daemon @ 2016-05-19 14:50 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=118511

            Bug ID: 118511
           Summary: Corruption of VM qcow2 image file on EXT4 with crypto
                    enabled
           Product: File System
           Version: 2.5
    Kernel Version: 4.5.3
          Hardware: All
                OS: Linux
              Tree: Mainline
            Status: NEW
          Severity: high
          Priority: P1
         Component: ext4
          Assignee: fs_ext4@kernel-bugs.osdl.org
          Reporter: ass3mbler@gmail.com
        Regression: No

Created attachment 216801
  --> https://bugzilla.kernel.org/attachment.cgi?id=216801&action=edit
Hypervisro kernel config file

Hello,

I have experienced two times in 48 hours a file system corruption on a QCOW2
image file running a linux Guest.

My configuration is the following:

Hypervisor:  
  - Gentoo Linux with a pure kernel 4.5.3, compiled manually.
  - QEMU 2.8.0 + KVM
  - /dev/md4, raid 1 with two identical partitions (/dev/sda4 and /dev/sdb4),
ext4
  - /dev/md4 is mounted under /mnt/md4 and it contains a single dir
/mnt/md4/kvm, encrypted  
  - after de-encrypting the /mnt/md4/kvm dir, it's bind-mounted in /kvm (mount
--bind /mnt/md4/ /kvm)
  - nothing else is actually running on the hypervisor, only an openssh server

Guest:
  - Gentoo Linux with a pure kernel 4.5.4, compiled manually
  - virtio drivers for disk, networking etc.
  - the whole image of the guest is a 250GB QCOW2 file, stored under
/kvm/xxx.qcow2 in the hypervisor's filesystem
  - the root partition is /dev/sda2 (about 230GB), EXT3

I'm running this configuration successfully on many other (even very busy)
deployments without any problem, the only difference in this installation is
the encrypted /mnt/md4/kvm directory on the hypervisor.

For two times in the lasts 48h I've found the root filesystem of the guest
(/dev/sda2) remounted in read-only mode after a detected write problem. Here is
the log from dmesg:

[[Guest]]
[208323.124266] blk_update_request: critical target error, dev sda, sector
231060144
[208323.124540] Aborting journal on device sda2-8.
[208323.729847] EXT4-fs error (device sda2): ext4_journal_check_start:56:
Detected aborted journal
[208323.729855] EXT4-fs (sda2): Remounting filesystem read-only
[208323.740861] EXT4-fs error (device sda2): ext4_journal_check_start:56:
Detected aborted journal
[208323.772340] EXT4-fs error (device sda2): ext4_journal_check_start:56:
Detected aborted journal
[208323.772346] EXT4-fs error (device sda2): ext4_journal_check_start:56:
Detected aborted journal
[208323.773233] EXT4-fs error (device sda2): ext4_journal_check_start:56:
Detected aborted journal

At the same time, on the hypervisor dmesg i have only this line:

[[Hypervisor]]
[596477.535490] ext4_bio_write_page: ret = -12

After that, I have to perform a reboot of the Guest. I've started the guest
from a gentoo iso and performed a fsck on the root (/dev/sda2) partition. This
is the output:

e2fsck 1.42.13 (17-May-2015)
/dev/sda2: recovering journal
/dev/sda2 contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Deleted inode 5709828 has zero dtime. Fix<y>? yes
Inodes that were part of a corrupted orphan linked list found. Fix<y>? yes
Inode 5709829 was part of the orphaned inode list. FIXED.
Inode 5709830 was part of the orphaned inode list. FIXED.
Inode 5709831 was part of the orphaned inode list. FIXED.
Inode 5709832 was part of the orphaned inode list. FIXED.
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Free blocks count wrong (25234981, counted=22539957).
Fix<y>? yes
Inode bitmap differences: -(5709828--5709832)
Fix<y>? yes
Free inodes count wrong for group #697 (8175, counted=8180).
Fix<y>? yes
Free inodes count wrong (11008395, counted=10993791).
Fix<y>? yes
/dev/sda2: ***** FILE SYSTEM WAS MODIFIED *****
/dev/sda2: 3424129/14417920 files (3.9% non-contiguous), 35131723/57671680
block

I attach the .config of the Hypervisor kernel.

Thank you in advance and best regards,

Andrew

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Bug 118511] Corruption of VM qcow2 image file on EXT4 with crypto enabled
  2016-05-19 14:50 [Bug 118511] New: Corruption of VM qcow2 image file on EXT4 with crypto enabled bugzilla-daemon
@ 2016-05-24 14:50 ` bugzilla-daemon
  2016-05-25 17:45 ` bugzilla-daemon
  2016-05-30 16:46 ` bugzilla-daemon
  2 siblings, 0 replies; 4+ messages in thread
From: bugzilla-daemon @ 2016-05-24 14:50 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=118511

Navin <navinp1912@gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |navinp1912@gmail.com

--- Comment #1 from Navin <navinp1912@gmail.com> ---
Can you check the memory state of Host/Hypervisor if is full around the range
[596476,596478.]  so that 596477.535490 when ext4_bio_write_page is not able to
get memory ? 

If your hypervisor is using encryption then this patch may help (already
present in 4.6 mainline)

https://patchwork.ozlabs.org/patch/602204/

If that doesn't work then ,You need to your system stats logged and check and
check when ENOMEM is returned. It could genuinely out of memory or there could
be something wrong with code.

Hypervisor/Host cannot write/commit/allocate buffers because it is out of
memory. 

Hence your guest is in a transient state where the change are not committed and
most probably journal is aborted.


[[Hypervisor]]
[596477.535490] ext4_bio_write_page: ret = -12

http://lxr.free-electrons.com/source/include/uapi/asm-generic/errno-base.h#L15

 15 #define ENOMEM          12      /* Out of memory */

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Bug 118511] Corruption of VM qcow2 image file on EXT4 with crypto enabled
  2016-05-19 14:50 [Bug 118511] New: Corruption of VM qcow2 image file on EXT4 with crypto enabled bugzilla-daemon
  2016-05-24 14:50 ` [Bug 118511] " bugzilla-daemon
@ 2016-05-25 17:45 ` bugzilla-daemon
  2016-05-30 16:46 ` bugzilla-daemon
  2 siblings, 0 replies; 4+ messages in thread
From: bugzilla-daemon @ 2016-05-25 17:45 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=118511

--- Comment #2 from ass3mbler@gmail.com ---
Hi Navin,

thank you a lot for your help, I'll upgrade the kernel to the v4.6 today's
night to see if it gets better!

I have some doubt about a real out-of-memory condition since the Hypervisor has
8GB of RAM, the Guest is hard-limited to 4GB and the only other "big" process
running on the Hypervisor is a simple (mostly idle) opensshd server instance...
so I really hope that the patch will solve the issue.

I'll let you know very shortly, thank you again for your precious help and best
regards,

Andrew



(In reply to Navin from comment #1)
> Can you check the memory state of Host/Hypervisor if is full around the
> range [596476,596478.]  so that 596477.535490 when ext4_bio_write_page is
> not able to get memory ? 
> 
> If your hypervisor is using encryption then this patch may help (already
> present in 4.6 mainline)
> 
> https://patchwork.ozlabs.org/patch/602204/
> 
> If that doesn't work then ,You need to your system stats logged and check
> and check when ENOMEM is returned. It could genuinely out of memory or there
> could be something wrong with code.
> 
> Hypervisor/Host cannot write/commit/allocate buffers because it is out of
> memory. 
> 
> Hence your guest is in a transient state where the change are not committed
> and most probably journal is aborted.
> 
> 
> [[Hypervisor]]
> [596477.535490] ext4_bio_write_page: ret = -12
> 
> http://lxr.free-electrons.com/source/include/uapi/asm-generic/errno-base.
> h#L15
> 
>  15 #define ENOMEM          12      /* Out of memory */

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Bug 118511] Corruption of VM qcow2 image file on EXT4 with crypto enabled
  2016-05-19 14:50 [Bug 118511] New: Corruption of VM qcow2 image file on EXT4 with crypto enabled bugzilla-daemon
  2016-05-24 14:50 ` [Bug 118511] " bugzilla-daemon
  2016-05-25 17:45 ` bugzilla-daemon
@ 2016-05-30 16:46 ` bugzilla-daemon
  2 siblings, 0 replies; 4+ messages in thread
From: bugzilla-daemon @ 2016-05-30 16:46 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=118511

ass3mbler@gmail.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|---                         |PATCH_ALREADY_AVAILABLE

--- Comment #3 from ass3mbler@gmail.com ---
Hi Navin,

I can confirm that moving to kernel 4.6 following your suggestion fully solved
the issue.

Thank you a lot for pointing me in the right direction, I mark this issue as
resolved.

Best regards,

Andrew

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2016-05-30 16:46 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-05-19 14:50 [Bug 118511] New: Corruption of VM qcow2 image file on EXT4 with crypto enabled bugzilla-daemon
2016-05-24 14:50 ` [Bug 118511] " bugzilla-daemon
2016-05-25 17:45 ` bugzilla-daemon
2016-05-30 16:46 ` bugzilla-daemon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).