* [Bug 118511] New: Corruption of VM qcow2 image file on EXT4 with crypto enabled
@ 2016-05-19 14:50 bugzilla-daemon
2016-05-24 14:50 ` [Bug 118511] " bugzilla-daemon
` (2 more replies)
0 siblings, 3 replies; 4+ messages in thread
From: bugzilla-daemon @ 2016-05-19 14:50 UTC (permalink / raw)
To: linux-ext4
https://bugzilla.kernel.org/show_bug.cgi?id=118511
Bug ID: 118511
Summary: Corruption of VM qcow2 image file on EXT4 with crypto
enabled
Product: File System
Version: 2.5
Kernel Version: 4.5.3
Hardware: All
OS: Linux
Tree: Mainline
Status: NEW
Severity: high
Priority: P1
Component: ext4
Assignee: fs_ext4@kernel-bugs.osdl.org
Reporter: ass3mbler@gmail.com
Regression: No
Created attachment 216801
--> https://bugzilla.kernel.org/attachment.cgi?id=216801&action=edit
Hypervisro kernel config file
Hello,
I have experienced two times in 48 hours a file system corruption on a QCOW2
image file running a linux Guest.
My configuration is the following:
Hypervisor:
- Gentoo Linux with a pure kernel 4.5.3, compiled manually.
- QEMU 2.8.0 + KVM
- /dev/md4, raid 1 with two identical partitions (/dev/sda4 and /dev/sdb4),
ext4
- /dev/md4 is mounted under /mnt/md4 and it contains a single dir
/mnt/md4/kvm, encrypted
- after de-encrypting the /mnt/md4/kvm dir, it's bind-mounted in /kvm (mount
--bind /mnt/md4/ /kvm)
- nothing else is actually running on the hypervisor, only an openssh server
Guest:
- Gentoo Linux with a pure kernel 4.5.4, compiled manually
- virtio drivers for disk, networking etc.
- the whole image of the guest is a 250GB QCOW2 file, stored under
/kvm/xxx.qcow2 in the hypervisor's filesystem
- the root partition is /dev/sda2 (about 230GB), EXT3
I'm running this configuration successfully on many other (even very busy)
deployments without any problem, the only difference in this installation is
the encrypted /mnt/md4/kvm directory on the hypervisor.
For two times in the lasts 48h I've found the root filesystem of the guest
(/dev/sda2) remounted in read-only mode after a detected write problem. Here is
the log from dmesg:
[[Guest]]
[208323.124266] blk_update_request: critical target error, dev sda, sector
231060144
[208323.124540] Aborting journal on device sda2-8.
[208323.729847] EXT4-fs error (device sda2): ext4_journal_check_start:56:
Detected aborted journal
[208323.729855] EXT4-fs (sda2): Remounting filesystem read-only
[208323.740861] EXT4-fs error (device sda2): ext4_journal_check_start:56:
Detected aborted journal
[208323.772340] EXT4-fs error (device sda2): ext4_journal_check_start:56:
Detected aborted journal
[208323.772346] EXT4-fs error (device sda2): ext4_journal_check_start:56:
Detected aborted journal
[208323.773233] EXT4-fs error (device sda2): ext4_journal_check_start:56:
Detected aborted journal
At the same time, on the hypervisor dmesg i have only this line:
[[Hypervisor]]
[596477.535490] ext4_bio_write_page: ret = -12
After that, I have to perform a reboot of the Guest. I've started the guest
from a gentoo iso and performed a fsck on the root (/dev/sda2) partition. This
is the output:
e2fsck 1.42.13 (17-May-2015)
/dev/sda2: recovering journal
/dev/sda2 contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Deleted inode 5709828 has zero dtime. Fix<y>? yes
Inodes that were part of a corrupted orphan linked list found. Fix<y>? yes
Inode 5709829 was part of the orphaned inode list. FIXED.
Inode 5709830 was part of the orphaned inode list. FIXED.
Inode 5709831 was part of the orphaned inode list. FIXED.
Inode 5709832 was part of the orphaned inode list. FIXED.
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Free blocks count wrong (25234981, counted=22539957).
Fix<y>? yes
Inode bitmap differences: -(5709828--5709832)
Fix<y>? yes
Free inodes count wrong for group #697 (8175, counted=8180).
Fix<y>? yes
Free inodes count wrong (11008395, counted=10993791).
Fix<y>? yes
/dev/sda2: ***** FILE SYSTEM WAS MODIFIED *****
/dev/sda2: 3424129/14417920 files (3.9% non-contiguous), 35131723/57671680
block
I attach the .config of the Hypervisor kernel.
Thank you in advance and best regards,
Andrew
--
You are receiving this mail because:
You are watching the assignee of the bug.
^ permalink raw reply [flat|nested] 4+ messages in thread
* [Bug 118511] Corruption of VM qcow2 image file on EXT4 with crypto enabled
2016-05-19 14:50 [Bug 118511] New: Corruption of VM qcow2 image file on EXT4 with crypto enabled bugzilla-daemon
@ 2016-05-24 14:50 ` bugzilla-daemon
2016-05-25 17:45 ` bugzilla-daemon
2016-05-30 16:46 ` bugzilla-daemon
2 siblings, 0 replies; 4+ messages in thread
From: bugzilla-daemon @ 2016-05-24 14:50 UTC (permalink / raw)
To: linux-ext4
https://bugzilla.kernel.org/show_bug.cgi?id=118511
Navin <navinp1912@gmail.com> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |navinp1912@gmail.com
--- Comment #1 from Navin <navinp1912@gmail.com> ---
Can you check the memory state of Host/Hypervisor if is full around the range
[596476,596478.] so that 596477.535490 when ext4_bio_write_page is not able to
get memory ?
If your hypervisor is using encryption then this patch may help (already
present in 4.6 mainline)
https://patchwork.ozlabs.org/patch/602204/
If that doesn't work then ,You need to your system stats logged and check and
check when ENOMEM is returned. It could genuinely out of memory or there could
be something wrong with code.
Hypervisor/Host cannot write/commit/allocate buffers because it is out of
memory.
Hence your guest is in a transient state where the change are not committed and
most probably journal is aborted.
[[Hypervisor]]
[596477.535490] ext4_bio_write_page: ret = -12
http://lxr.free-electrons.com/source/include/uapi/asm-generic/errno-base.h#L15
15 #define ENOMEM 12 /* Out of memory */
--
You are receiving this mail because:
You are watching the assignee of the bug.
^ permalink raw reply [flat|nested] 4+ messages in thread
* [Bug 118511] Corruption of VM qcow2 image file on EXT4 with crypto enabled
2016-05-19 14:50 [Bug 118511] New: Corruption of VM qcow2 image file on EXT4 with crypto enabled bugzilla-daemon
2016-05-24 14:50 ` [Bug 118511] " bugzilla-daemon
@ 2016-05-25 17:45 ` bugzilla-daemon
2016-05-30 16:46 ` bugzilla-daemon
2 siblings, 0 replies; 4+ messages in thread
From: bugzilla-daemon @ 2016-05-25 17:45 UTC (permalink / raw)
To: linux-ext4
https://bugzilla.kernel.org/show_bug.cgi?id=118511
--- Comment #2 from ass3mbler@gmail.com ---
Hi Navin,
thank you a lot for your help, I'll upgrade the kernel to the v4.6 today's
night to see if it gets better!
I have some doubt about a real out-of-memory condition since the Hypervisor has
8GB of RAM, the Guest is hard-limited to 4GB and the only other "big" process
running on the Hypervisor is a simple (mostly idle) opensshd server instance...
so I really hope that the patch will solve the issue.
I'll let you know very shortly, thank you again for your precious help and best
regards,
Andrew
(In reply to Navin from comment #1)
> Can you check the memory state of Host/Hypervisor if is full around the
> range [596476,596478.] so that 596477.535490 when ext4_bio_write_page is
> not able to get memory ?
>
> If your hypervisor is using encryption then this patch may help (already
> present in 4.6 mainline)
>
> https://patchwork.ozlabs.org/patch/602204/
>
> If that doesn't work then ,You need to your system stats logged and check
> and check when ENOMEM is returned. It could genuinely out of memory or there
> could be something wrong with code.
>
> Hypervisor/Host cannot write/commit/allocate buffers because it is out of
> memory.
>
> Hence your guest is in a transient state where the change are not committed
> and most probably journal is aborted.
>
>
> [[Hypervisor]]
> [596477.535490] ext4_bio_write_page: ret = -12
>
> http://lxr.free-electrons.com/source/include/uapi/asm-generic/errno-base.
> h#L15
>
> 15 #define ENOMEM 12 /* Out of memory */
--
You are receiving this mail because:
You are watching the assignee of the bug.
^ permalink raw reply [flat|nested] 4+ messages in thread
* [Bug 118511] Corruption of VM qcow2 image file on EXT4 with crypto enabled
2016-05-19 14:50 [Bug 118511] New: Corruption of VM qcow2 image file on EXT4 with crypto enabled bugzilla-daemon
2016-05-24 14:50 ` [Bug 118511] " bugzilla-daemon
2016-05-25 17:45 ` bugzilla-daemon
@ 2016-05-30 16:46 ` bugzilla-daemon
2 siblings, 0 replies; 4+ messages in thread
From: bugzilla-daemon @ 2016-05-30 16:46 UTC (permalink / raw)
To: linux-ext4
https://bugzilla.kernel.org/show_bug.cgi?id=118511
ass3mbler@gmail.com changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution|--- |PATCH_ALREADY_AVAILABLE
--- Comment #3 from ass3mbler@gmail.com ---
Hi Navin,
I can confirm that moving to kernel 4.6 following your suggestion fully solved
the issue.
Thank you a lot for pointing me in the right direction, I mark this issue as
resolved.
Best regards,
Andrew
--
You are receiving this mail because:
You are watching the assignee of the bug.
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2016-05-30 16:46 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-05-19 14:50 [Bug 118511] New: Corruption of VM qcow2 image file on EXT4 with crypto enabled bugzilla-daemon
2016-05-24 14:50 ` [Bug 118511] " bugzilla-daemon
2016-05-25 17:45 ` bugzilla-daemon
2016-05-30 16:46 ` bugzilla-daemon
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).