status of bugzilla #99171 - mdraid broken for O

public inbox for linux-raid@vger.kernel.org
 help / color / mirror / Atom feed

* status of bugzilla #99171 - mdraid broken for O_DIRECT
       [not found] <A4168F21-4CDF-4BAD-8754-30BAA1315C6F@web.de>
@ 2025-10-14 20:14 ` Roland
  2025-10-15  6:56   ` Hannes Reinecke
  0 siblings, 1 reply; 15+ messages in thread
From: Roland @ 2025-10-14 20:14 UTC (permalink / raw)
  To: Hannes Reinecke, Reindl Harald, linux-raid

sorry, resend in text format as mail contained html and bounced from ML.

Am 14.10.25 um 08:31 schrieb Hannes Reinecke:
> Hmm. I still would argue that the testcase quoted is invalid.
>
> What you do is to issue writes of a given buffer, while at the
> same time modifying the contents of that buffer.
>
> As we're doing zerocopy with O_DIRECT the buffer passed to pwrite
> is _the same_ buffer used when issuing the write to disk. The
> block layer now assumes that the buffer will _not_ be modified
> when writing to disk (ie between issuing 'pwrite' and the resulting
> request being send to disk).
> But that's not the case here; it will be modified, and consequently
> all sorts of issues will pop up.
> We have had all sorts of fun some years back with this issue until
> we fixed up all filesystems to do this correctly; if interested
> dig up the threads regarding 'stable pages' on linux-fsdevel.
>
> I would think you will end up with a corrupted filesystem if you
> run this without mdraid by just using btrfs with data checksumming.
>
yes, it's correct. you also end up with corrupted btrfs with this tool, 
see https://bugzilla.kernel.org/show_bug.cgi?id=99171#c16

> So really I'm not sure how to go from here; I would declare this
> as invalid, but what do I know ...
>
> Cheers,
>
> Hannes 

anyhow, i don't see why this testcase is invalid, especially when zfs 
seems not to be affected.

please look at this issue from a security perspective.

if you can break or corrupt your raid mirror from userspace even from an 
insulated layer/environment, i would better consider this "testcase" to 
be "malicious code" , which is able to subvert the 
virtualization/block/fs layer stack.

how could we prevent, that non-trused users in a vm or container 
environment can execute this "invalid" code ?

how can we prevent, that they do harm on the underlying mirror in a 
hosting environment for example ?

not using it in a hosting environment is a little bit weird strategy for 
a linux basic technoligy which exists for years.

and let it up to the hoster to remember he needs to disable direct-io 
for the hypervisor - is dissatisfying and error-prone, too.

roland

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: status of bugzilla #99171 - mdraid broken for O_DIRECT
  2025-10-14 20:14 ` status of bugzilla #99171 - mdraid broken for O_DIRECT Roland
@ 2025-10-15  6:56   ` Hannes Reinecke
  2025-10-15 23:09     ` Roland
  0 siblings, 1 reply; 15+ messages in thread
From: Hannes Reinecke @ 2025-10-15  6:56 UTC (permalink / raw)
  To: Roland, Reindl Harald, linux-raid

On 10/14/25 22:14, Roland wrote:
> sorry, resend in text format as mail contained html and bounced from ML.
> 
> 
> Am 14.10.25 um 08:31 schrieb Hannes Reinecke:
>> Hmm. I still would argue that the testcase quoted is invalid.
>>
>> What you do is to issue writes of a given buffer, while at the
>> same time modifying the contents of that buffer.
>>
>> As we're doing zerocopy with O_DIRECT the buffer passed to pwrite
>> is _the same_ buffer used when issuing the write to disk. The
>> block layer now assumes that the buffer will _not_ be modified
>> when writing to disk (ie between issuing 'pwrite' and the resulting
>> request being send to disk).
>> But that's not the case here; it will be modified, and consequently
>> all sorts of issues will pop up.
>> We have had all sorts of fun some years back with this issue until
>> we fixed up all filesystems to do this correctly; if interested
>> dig up the threads regarding 'stable pages' on linux-fsdevel.
>>
>> I would think you will end up with a corrupted filesystem if you
>> run this without mdraid by just using btrfs with data checksumming.
>>
> yes, it's correct. you also end up with corrupted btrfs with this tool, 
> see https://bugzilla.kernel.org/show_bug.cgi?id=99171#c16
> 
>> So really I'm not sure how to go from here; I would declare this
>> as invalid, but what do I know ...
>>
>> Cheers,
>>
>> Hannes 
> 
> anyhow, i don't see why this testcase is invalid, especially when zfs 
> seems not to be affected.
> 
Welll ... I am sure you are aware of the somewhat dubious state of zfs 
and linux, right?

And anyway: 'break userspace' is a matter of debate here; the use of
O_DIRECT effectively moves the burden of checking I/O from the kernel
to userspace; with O_DIRECT you can submit _any_ I/O without the kernel
interfering, but at the same time you _must_ ensure that the I/O
submitted conforms to the expectations the block layer has.
And one of the expectation is that data is not modified between
assembling the request and submitting the request to the drive.

But that is precisely what the test program does.

> please look at this issue from a security perspective.
> 
> if you can break or corrupt your raid mirror from userspace even from an 
> insulated layer/environment, i would better consider this "testcase" to 
> be "malicious code" , which is able to subvert the virtualization/block/ 
> fs layer stack.
> 
> how could we prevent, that non-trused users in a vm or container 
> environment can execute this "invalid" code ?
> 
Well, yes, but then this is O_DIRECT.

> 
> how can we prevent, that they do harm on the underlying mirror in a 
> hosting environment for example ?
> 

Well, this has been an ongoing debate for years, and we from the linux
side have had long discussions about that, too.
But eventually we settled on the notion of 'stable pages', ie that the
data buffer for a command _must not_ be modified between assembling the
command and submitting the command to the drivers.
Precisely such that we _can_ do things like data checksumming.

> not using it in a hosting environment is a little bit weird strategy for 
> a linux basic technoligy which exists for years.
> 
Oh, agreed. We do want to make linux better.
But there is a perfectly viable workaround (namely: do not disable
caching on the VM ...). So the question really is: where's the
advantage?
Security and O_DIRECT is always a very tricky subject, as O_DIRECT
is precisely there to circumvent checks in the kernel. And yes,
some of these checks are there to prevent security issues.
So of course the will be security implications, but that was
kinda the idea.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: status of bugzilla #99171 - mdraid broken for O_DIRECT
  2025-10-15  6:56   ` Hannes Reinecke
@ 2025-10-15 23:09     ` Roland
  2025-10-16  6:02       ` Hannes Reinecke
  0 siblings, 1 reply; 15+ messages in thread
From: Roland @ 2025-10-15 23:09 UTC (permalink / raw)
  To: Hannes Reinecke, Reindl Harald, linux-raid

> Welll ... I am sure you are aware of the somewhat dubious state of zfs 
> and linux, right?

yes , i know about this "dubious" state due to licensing issues, but 
it's in this state for years now and a pretty solid and well installable 
and usable filesystem , used in many enterprise setups, though.
i run dozens of zfs installations for years and did not have a single 
major issue , data loss or data corruption with those. but that's a 
different story not belonging here...

> And anyway: 'break userspace' is a matter of debate here; the use of
> O_DIRECT effectively moves the burden of checking I/O from the kernel
> to userspace; with O_DIRECT you can submit _any_ I/O without the kernel
> interfering, but at the same time you _must_ ensure that the I/O
> submitted conforms to the expectations the block layer has.
> And one of the expectation is that data is not modified between
> assembling the request and submitting the request to the drive.
>
> But that is precisely what the test program does.
>
>> please look at this issue from a security perspective.
>>
>> if you can break or corrupt your raid mirror from userspace even from 
>> an insulated layer/environment, i would better consider this 
>> "testcase" to be "malicious code" , which is able to subvert the 
>> virtualization/block/ fs layer stack.
>>
>> how could we prevent, that non-trused users in a vm or container 
>> environment can execute this "invalid" code ?
>>
> Well, yes, but then this is O_DIRECT.
>
>>
>> how can we prevent, that they do harm on the underlying mirror in a 
>> hosting environment for example ?
>>
>
> Well, this has been an ongoing debate for years, and we from the linux
> side have had long discussions about that, too.
> But eventually we settled on the notion of 'stable pages', ie that the
> data buffer for a command _must not_ be modified between assembling the
> command and submitting the command to the drivers.
> Precisely such that we _can_ do things like data checksumming.
>
>> not using it in a hosting environment is a little bit weird strategy 
>> for a linux basic technoligy which exists for years.
>>
> Oh, agreed. We do want to make linux better.
> But there is a perfectly viable workaround (namely: do not disable
> caching on the VM ...). So the question really is: where's the
> advantage?
> Security and O_DIRECT is always a very tricky subject, as O_DIRECT
> is precisely there to circumvent checks in the kernel. And yes,
> some of these checks are there to prevent security issues.
> So of course the will be security implications, but that was
> kinda the idea.
>
> Cheers,
>
> Hannes 

thank you for your feedback.

i see, things are complicated and O_DIRECT is a very special beast....

meanwhile, i gave bcachefs a try today , because it looks interesting .

like zfs, it does not seem to be affected by this problem, at least from 
my first tests reported at 
https://bugzilla.kernel.org/show_bug.cgi?id=99171#c26 (i hope this is a 
valid test for consistency)

so we have at least a second "software raid" technology besides zfs, 
which does NOT suffer from the "by design" O_DIRECT breakage.

that's at least surprising me, as bcachefs is far from production 
ready,  and i wonder why it just seems to work at this early stage of 
development.

roland







^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: status of bugzilla #99171 - mdraid broken for O_DIRECT
  2025-10-15 23:09     ` Roland
@ 2025-10-16  6:02       ` Hannes Reinecke
  2025-10-17 20:18         ` Roland
  0 siblings, 1 reply; 15+ messages in thread
From: Hannes Reinecke @ 2025-10-16  6:02 UTC (permalink / raw)
  To: Roland, Reindl Harald, linux-raid

On 10/16/25 01:09, Roland wrote:
[ .. ]>
> thank you for your feedback.
> 
> i see, things are complicated and O_DIRECT is a very special beast....
> 
> meanwhile, i gave bcachefs a try today , because it looks interesting .
> 
> like zfs, it does not seem to be affected by this problem, at least from 
> my first tests reported at https://bugzilla.kernel.org/show_bug.cgi? 
> id=99171#c26 (i hope this is a valid test for consistency)
> 
> so we have at least a second "software raid" technology besides zfs, 
> which does NOT suffer from the "by design" O_DIRECT breakage.
> 
> that's at least surprising me, as bcachefs is far from production 
> ready,  and i wonder why it just seems to work at this early stage of 
> development.
> 
Hmm. True.

I would suggest bringing up this topic on linux-fsdevel; there is
always a chance that there is a bug somewhere.
At least some explanation would be warranted why bcachefs does not
suffer from this issue.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: status of bugzilla #99171 - mdraid broken for O_DIRECT
  2025-10-16  6:02       ` Hannes Reinecke
@ 2025-10-17 20:18         ` Roland
  2025-10-20  6:44           ` Hannes Reinecke
  0 siblings, 1 reply; 15+ messages in thread
From: Roland @ 2025-10-17 20:18 UTC (permalink / raw)
  To: Hannes Reinecke, Reindl Harald, linux-raid

good evening,

are you really sure fs-fsdevel is the right place to report this?

i just was able to reproduce the mdraid breakage without any filesystem 
involved,  just put lvm on top of mdraid and passed lvm logical volumes 
from that to the debian vm, and then ran "break-raid-odirect /dev/sdb" 
inside vm.

meanwhile,  btrfs issue with O_DIRECT seems to be fixed,  at least from 
my quick tests, reported at 
https://bugzilla.kernel.org/show_bug.cgi?id=99171#c35 . btrfs fix is 
also linked there.

regards
Roland

Am 16.10.25 um 08:02 schrieb Hannes Reinecke:
> On 10/16/25 01:09, Roland wrote:
> [ .. ]>
>> thank you for your feedback.
>>
>> i see, things are complicated and O_DIRECT is a very special beast....
>>
>> meanwhile, i gave bcachefs a try today , because it looks interesting .
>>
>> like zfs, it does not seem to be affected by this problem, at least 
>> from my first tests reported at 
>> https://bugzilla.kernel.org/show_bug.cgi? id=99171#c26 (i hope this 
>> is a valid test for consistency)
>>
>> so we have at least a second "software raid" technology besides zfs, 
>> which does NOT suffer from the "by design" O_DIRECT breakage.
>>
>> that's at least surprising me, as bcachefs is far from production 
>> ready,  and i wonder why it just seems to work at this early stage of 
>> development.
>>
> Hmm. True.
>
> I would suggest bringing up this topic on linux-fsdevel; there is
> always a chance that there is a bug somewhere.
> At least some explanation would be warranted why bcachefs does not
> suffer from this issue.
>
> Cheers,
>
> Hannes

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: status of bugzilla #99171 - mdraid broken for O_DIRECT
  2025-10-17 20:18         ` Roland
@ 2025-10-20  6:44           ` Hannes Reinecke
  0 siblings, 0 replies; 15+ messages in thread
From: Hannes Reinecke @ 2025-10-20  6:44 UTC (permalink / raw)
  To: Roland, Reindl Harald, linux-raid

On 10/17/25 22:18, Roland wrote:
> good evening,
> 
> are you really sure fs-fsdevel is the right place to report this?
> 
> i just was able to reproduce the mdraid breakage without any filesystem 
> involved,  just put lvm on top of mdraid and passed lvm logical volumes 
> from that to the debian vm, and then ran "break-raid-odirect /dev/sdb" 
> inside vm.
> 
> meanwhile,  btrfs issue with O_DIRECT seems to be fixed,  at least from 
> my quick tests, reported at https://bugzilla.kernel.org/show_bug.cgi? 
> id=99171#c35 . btrfs fix is also linked there.
> 
Ah, so btrfs is fixed, so indeed a report on fsdevel is pointless.
So on the good side my analysis was correct (phew :-); on the flip
side my attempt to offload that problem to someone else has failed :-(

So guess we need to fix it after all.
Curious, though; for RAID5 we do set the 'STABLE_WRITES' flag.
Does the issue occur with RAID5, too?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 15+ messages in thread

* status of bugzilla #99171 - mdraid broken for O_DIRECT
@ 2024-10-09 20:08 Roland
  2024-10-09 21:38 ` Reindl Harald
  0 siblings, 1 reply; 15+ messages in thread
From: Roland @ 2024-10-09 20:08 UTC (permalink / raw)
  To: linux-raid

Hello,

as proxmox hypervisor does not offer mdadm software raid at installation
time because of this bugticket

"MD RAID or DRBD can be broken from userspace when using O_DIRECT"
https://bugzilla.kernel.org/show_bug.cgi?id=99171

i tried to find some more references besides the kernel bugzilla entry -
and  - besides some discussion in the proxmox community - i did not succeed.

why is this apparent fundamental design flaw (should we call it that?)
so damn unknown ?

and what about O_DIRECT with other software raid solutions like btrfs or
zfs ?

how/why do they right but not mdraid ?  the latter exists for much
longer time and afaik is offered as install-time option for RHEL
(enterprise linux) for example, where people use oracle on top - which
IS using O_DIRECT very often/likely.

not accusing anyone - just being curious why totally unknown/unpopular
bugticket #99171 bitrots for nearly a decade now...

regards
Roland

Sysadmin

ps:
also see "qemu cache=none should not be used with mdadm"
https://bugzilla.proxmox.com/show_bug.cgi?id=5235

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: status of bugzilla #99171 - mdraid broken for O_DIRECT
  2024-10-09 20:08 Roland
@ 2024-10-09 21:38 ` Reindl Harald
  2024-10-10  6:53   ` Hannes Reinecke
  0 siblings, 1 reply; 15+ messages in thread
From: Reindl Harald @ 2024-10-09 21:38 UTC (permalink / raw)
  To: Roland, linux-raid


Am 09.10.24 um 22:08 schrieb Roland:
> as proxmox hypervisor does not offer mdadm software raid at installation
> time because of this bugticket
> 
> "MD RAID or DRBD can be broken from userspace when using O_DIRECT"
> https://bugzilla.kernel.org/show_bug.cgi?id=99171
> 
> ps:
> also see "qemu cache=none should not be used with mdadm"
> https://bugzilla.proxmox.com/show_bug.cgi?id=5235
that all sounds like terrible nosense

if "Yes. O_DIRECT is really fundamentally broken. There's just no way to 
fix it sanely. Except by teaching people not to use it, and making the 
normal paths fast enough" it has to go away

it's not acceptable that userspace can break the integrity of the 
underlying RAID - period

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: status of bugzilla #99171 - mdraid broken for O_DIRECT
  2024-10-09 21:38 ` Reindl Harald
@ 2024-10-10  6:53   ` Hannes Reinecke
  2024-10-10  7:29     ` Roland
  0 siblings, 1 reply; 15+ messages in thread
From: Hannes Reinecke @ 2024-10-10  6:53 UTC (permalink / raw)
  To: Reindl Harald, Roland, linux-raid

On 10/9/24 23:38, Reindl Harald wrote:
> 
> Am 09.10.24 um 22:08 schrieb Roland:
>> as proxmox hypervisor does not offer mdadm software raid at installation
>> time because of this bugticket
>>
>> "MD RAID or DRBD can be broken from userspace when using O_DIRECT"
>> https://bugzilla.kernel.org/show_bug.cgi?id=99171
>>
>> ps:
>> also see "qemu cache=none should not be used with mdadm"
>> https://bugzilla.proxmox.com/show_bug.cgi?id=5235
> that all sounds like terrible nosense
> 
> if "Yes. O_DIRECT is really fundamentally broken. There's just no way to 
> fix it sanely. Except by teaching people not to use it, and making the 
> normal paths fast enough" it has to go away
> 
> it's not acceptable that userspace can break the integrity of the 
> underlying RAID - period
> 
Take deep breath everyone.
Nothing has happened, nothing has been broken.
All systems continue to operate as normal.

If you look closely at the mentioned bug, you'll find that it does 
modify the buffer at random times, in particular while it's being 
written to disk.
Now, the boilerplate text for O_DIRECT says: the application is in 
control of the data, and the data will be written without any caching.
Applying that to our testcase it means that the application _can_ modify
the data, even if it's in the process of being written to disk (zero 
copy and all that).
We do guarantee that data is consistent once I/O is completed (here:
once 'write' returns), but we do not (and, in fact, cannot) guarantee
that data is consistent while write() is running.

Which means that the test case is actually invalid; you either would 
need drop O_DIRECT or modify the buffer after write() to arrive with
a valid example.

That doesn't mean that I don't agree with the comments about O_DIRECT.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: status of bugzilla #99171 - mdraid broken for O_DIRECT
  2024-10-10  6:53   ` Hannes Reinecke
@ 2024-10-10  7:29     ` Roland
  2024-10-10  8:34       ` Hannes Reinecke
  0 siblings, 1 reply; 15+ messages in thread
From: Roland @ 2024-10-10  7:29 UTC (permalink / raw)
  To: Hannes Reinecke, Reindl Harald, linux-raid

thank you for clearing things up.

 >Which means that the test case is actually invalid; you either would
need drop O_DIRECT or modify the buffer
 >after write() to arrive with a valid example.

ok, but what about running virtual machines in O_DIRECT mode on top of
mdraid then ?

https://forum.proxmox.com/threads/zfs-on-debian-or-mdadm-softraid-stability-and-reliability-of-zfs.116871/post-505697

i have not seen any report of broken/inconsistent mdraid caused by
virtual machines, so is this just a "theoretical" issue ?

i'm curious why we can use zfs software raid with virtual machines but
not md software raid.     shouldn't that have the same problem  (
https://www.phoronix.com/news/OpenZFS-Direct-IO ) , at least from now on ?

regards
Roland


Am 10.10.24 um 08:53 schrieb Hannes Reinecke:
> On 10/9/24 23:38, Reindl Harald wrote:
>>
>> Am 09.10.24 um 22:08 schrieb Roland:
>>> as proxmox hypervisor does not offer mdadm software raid at
>>> installation
>>> time because of this bugticket
>>>
>>> "MD RAID or DRBD can be broken from userspace when using O_DIRECT"
>>> https://bugzilla.kernel.org/show_bug.cgi?id=99171
>>>
>>> ps:
>>> also see "qemu cache=none should not be used with mdadm"
>>> https://bugzilla.proxmox.com/show_bug.cgi?id=5235
>> that all sounds like terrible nosense
>>
>> if "Yes. O_DIRECT is really fundamentally broken. There's just no way
>> to fix it sanely. Except by teaching people not to use it, and making
>> the normal paths fast enough" it has to go away
>>
>> it's not acceptable that userspace can break the integrity of the
>> underlying RAID - period
>>
> Take deep breath everyone.
> Nothing has happened, nothing has been broken.
> All systems continue to operate as normal.
>
> If you look closely at the mentioned bug, you'll find that it does
> modify the buffer at random times, in particular while it's being
> written to disk.
> Now, the boilerplate text for O_DIRECT says: the application is in
> control of the data, and the data will be written without any caching.
> Applying that to our testcase it means that the application _can_ modify
> the data, even if it's in the process of being written to disk (zero
> copy and all that).
> We do guarantee that data is consistent once I/O is completed (here:
> once 'write' returns), but we do not (and, in fact, cannot) guarantee
> that data is consistent while write() is running.
>
> Which means that the test case is actually invalid; you either would
> need drop O_DIRECT or modify the buffer after write() to arrive with
> a valid example.
>
> That doesn't mean that I don't agree with the comments about O_DIRECT.
>
> Cheers,
>
> Hannes

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: status of bugzilla #99171 - mdraid broken for O_DIRECT
  2024-10-10  7:29     ` Roland
@ 2024-10-10  8:34       ` Hannes Reinecke
  2025-10-11 19:25         ` Roland
  0 siblings, 1 reply; 15+ messages in thread
From: Hannes Reinecke @ 2024-10-10  8:34 UTC (permalink / raw)
  To: Roland, Reindl Harald, linux-raid

On 10/10/24 09:29, Roland wrote:
> thank you for clearing things up.
> 
>  >Which means that the test case is actually invalid; you either would
> need drop O_DIRECT or modify the buffer
>  >after write() to arrive with a valid example.
> 
> ok, but what about running virtual machines in O_DIRECT mode on top of
> mdraid then ?
> 
> https://forum.proxmox.com/threads/zfs-on-debian-or-mdadm-softraid- 
> stability-and-reliability-of-zfs.116871/post-505697
> 

The example quoted is this:
 > Take a virtual machine, give it a disk - put the image on a software
 > raid and tell qemu to disable caching (iow. use O_DIRECT, because the
 > guest already does caching anyway).
 > Run linux in the VM, add part of the/a disk on the raid as swap, and
 > cause the guest to start swapping a lot.

And then ending up with data corruption on MD. Which I really would love
to see reproduced, especially with recent kernels, as there is a lot of
vagueness around it (add part of the disk on the raid as swap? How?
In the host? On the guest?).

Hint: we (SUSE) have a bugzilla.suse.com. And if someone would be 
reproducing that with, say, OpenSUSE Tumbleweed and open a bugzilla
someone on this list would be more than happy to have a look and do
a proper debugging here. There are a lot of things which have changed
since 2017 (Stable pages? Anyone?), so it might be that the cited issue
simply is not reproducible anymore.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: status of bugzilla #99171 - mdraid broken for O_DIRECT
  2024-10-10  8:34       ` Hannes Reinecke
@ 2025-10-11 19:25         ` Roland
  2025-10-13  6:48           ` Hannes Reinecke
  0 siblings, 1 reply; 15+ messages in thread
From: Roland @ 2025-10-11 19:25 UTC (permalink / raw)
  To: Hannes Reinecke, Reindl Harald, linux-raid

hello,

some late reply for this...

 > Am 10.10.24 um 10:34 schrieb Hannes Reinecke:
 > Which I really would love
 > to see reproduced, especially with recent kernels, as there is a lot of
 > vagueness around it (add part of the disk on the raid as swap? How?
 > In the host? On the guest?).

here is a reproducer everybody should be able to follow/reproduce.

1. install proxmox pve9 on a system with two empty disks for mdraid. 
build mdraid and format with ext4 .

2. add that ext4 mountpoint as a datastore type "dir" for file/vm 
storage in proxmox.

3. install a debian13 in a normal/default (cache=none, i.e. O_DIRECT = 
on)  linux VM. the virtual disk should be backed by that mdraid/ext4 
datastore created above.

4. inside the vm as an ordinary user get break-raid-odirect.c from 
https://forum.proxmox.com/threads/mdraid-o_direct.156036/post-713543 , 
compile that and let that run for a while. then terminate with ctrl-c.

5. on the pve host, check if your raid did not throw any error or has 
mismatch_count >0 ( cat /sys/block/md127/md/mismatch_cnt ) in the meantime.

6. on the pve host start raid check with  "echo check > 
/sys/block/md127/md/sync_action"

6. let that check run and wait until it finishes (/proc/mdstat)

7. check for inconsistencies  via "cat /sys/block/md127/md/mismatch_cnt" 
again

i am getting:

cat /sys/block/md127/md/mismatch_cnt
1048832

so , we see that even with recent kernel (pve9 kernel is 6.14 based on 
ubuntu kernel),  we can break mdraid from non-root user inside a qemu VM 
on top ext4 on top of mdraid.

roland

Am 10.10.24 um 10:34 schrieb Hannes Reinecke:
> On 10/10/24 09:29, Roland wrote:
>> thank you for clearing things up.
>>
>>  >Which means that the test case is actually invalid; you either would
>> need drop O_DIRECT or modify the buffer
>>  >after write() to arrive with a valid example.
>>
>> ok, but what about running virtual machines in O_DIRECT mode on top of
>> mdraid then ?
>>
>> https://forum.proxmox.com/threads/zfs-on-debian-or-mdadm-softraid- 
>> stability-and-reliability-of-zfs.116871/post-505697
>>
>
> The example quoted is this:
> > Take a virtual machine, give it a disk - put the image on a software
> > raid and tell qemu to disable caching (iow. use O_DIRECT, because the
> > guest already does caching anyway).
> > Run linux in the VM, add part of the/a disk on the raid as swap, and
> > cause the guest to start swapping a lot.
>
> And then ending up with data corruption on MD. Which I really would love
> to see reproduced, especially with recent kernels, as there is a lot of
> vagueness around it (add part of the disk on the raid as swap? How?
> In the host? On the guest?).
>
> Hint: we (SUSE) have a bugzilla.suse.com. And if someone would be 
> reproducing that with, say, OpenSUSE Tumbleweed and open a bugzilla
> someone on this list would be more than happy to have a look and do
> a proper debugging here. There are a lot of things which have changed
> since 2017 (Stable pages? Anyone?), so it might be that the cited issue
> simply is not reproducible anymore.
>
> Cheers,
>
> Hannes

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: status of bugzilla #99171 - mdraid broken for O_DIRECT
  2025-10-11 19:25         ` Roland
@ 2025-10-13  6:48           ` Hannes Reinecke
  2025-10-13 19:06             ` Roland
       [not found]             ` <6fb3e2cb-8eeb-4e76-9364-16348d807784@web.de>
  0 siblings, 2 replies; 15+ messages in thread
From: Hannes Reinecke @ 2025-10-13  6:48 UTC (permalink / raw)
  To: Roland, Reindl Harald, linux-raid

On 10/11/25 21:25, Roland wrote:
> hello,
> 
> some late reply for this...
> 
>  > Am 10.10.24 um 10:34 schrieb Hannes Reinecke:
>  > Which I really would love
>  > to see reproduced, especially with recent kernels, as there is a lot of
>  > vagueness around it (add part of the disk on the raid as swap? How?
>  > In the host? On the guest?).
> 
> here is a reproducer everybody should be able to follow/reproduce.
> 
> 1. install proxmox pve9 on a system with two empty disks for mdraid. 
> build mdraid and format with ext4 .
> 
> 2. add that ext4 mountpoint as a datastore type "dir" for file/vm 
> storage in proxmox.
> 
> 3. install a debian13 in a normal/default (cache=none, i.e. O_DIRECT = 
> on)  linux VM. the virtual disk should be backed by that mdraid/ext4 
> datastore created above.
> 
> 4. inside the vm as an ordinary user get break-raid-odirect.c from 
> https://forum.proxmox.com/threads/mdraid-o_direct.156036/post-713543 , 
> compile that and let that run for a while. then terminate with ctrl-c.
> 
> 5. on the pve host, check if your raid did not throw any error or has 
> mismatch_count >0 ( cat /sys/block/md127/md/mismatch_cnt ) in the meantime.
> 
> 6. on the pve host start raid check with  "echo check > /sys/block/ 
> md127/md/sync_action"
> 
> 6. let that check run and wait until it finishes (/proc/mdstat)
> 
> 7. check for inconsistencies  via "cat /sys/block/md127/md/mismatch_cnt" 
> again
> 
> i am getting:
> 
> cat /sys/block/md127/md/mismatch_cnt
> 1048832
> 
> so , we see that even with recent kernel (pve9 kernel is 6.14 based on 
> ubuntu kernel),  we can break mdraid from non-root user inside a qemu VM 
> on top ext4 on top of mdraid.
> 

And what would happen if you use 'xfs' instead of 'ext4'?
ext4 has some nasty requirements regarding 'flush', and that might well
explain the issue here.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: status of bugzilla #99171 - mdraid broken for O_DIRECT
  2025-10-13  6:48           ` Hannes Reinecke
@ 2025-10-13 19:06             ` Roland
       [not found]             ` <6fb3e2cb-8eeb-4e76-9364-16348d807784@web.de>
  1 sibling, 0 replies; 15+ messages in thread
From: Roland @ 2025-10-13 19:06 UTC (permalink / raw)
  To: Hannes Reinecke, Reindl Harald, linux-raid

hello,

>> 7. check for inconsistencies  via "cat 
>> /sys/block/md127/md/mismatch_cnt" again
>>
>> i am getting:
>>
>> cat /sys/block/md127/md/mismatch_cnt
>> 1048832
>>
>> so , we see that even with recent kernel (pve9 kernel is 6.14 based 
>> on ubuntu kernel),  we can break mdraid from non-root user inside a 
>> qemu VM on top ext4 on top of mdraid.
>>
>
> And what would happen if you use 'xfs' instead of 'ext4'?
> ext4 has some nasty requirements regarding 'flush', and that might well
> explain the issue here.
>
> Cheers,
>
> Hannes 


thanks for feedback and for this hint.

i tested with xfs today , and it seems to make no difference.

i can inject inconsistency into the raid with the mentioned tool via 
debian-vm with xfs inside debian and hosted on xfs formatted md raid1 on 
proxmox.

root@pve-hpmini-gen8:~# cat /sys/block/md126/md/mismatch_cnt
59648

# cat /proc/mdstat
Personalities : [raid0] [raid1] [raid4] [raid5] [raid6] [raid10] [linear]
md126 : active raid1 sdb3[1] sda3[0]
       48794624 blocks super 1.2 [2/2] [UU]
       [==================>..]  check = 90.5% (44183744/48794624) 
finish=0.7min speed=107987K/sec
       bitmap: 1/1 pages [4KB], 65536KB chunk

md127 : active raid1 sdb1[2] sda1[3]
       48794624 blocks super 1.2 [2/2] [UU]
       bitmap: 0/1 pages [0KB], 65536KB chunk

unused devices: <none>


roland


Am 13.10.25 um 08:48 schrieb Hannes Reinecke:
> On 10/11/25 21:25, Roland wrote:
>> hello,
>>
>> some late reply for this...
>>
>>  > Am 10.10.24 um 10:34 schrieb Hannes Reinecke:
>>  > Which I really would love
>>  > to see reproduced, especially with recent kernels, as there is a 
>> lot of
>>  > vagueness around it (add part of the disk on the raid as swap? How?
>>  > In the host? On the guest?).
>>
>> here is a reproducer everybody should be able to follow/reproduce.
>>
>> 1. install proxmox pve9 on a system with two empty disks for mdraid. 
>> build mdraid and format with ext4 .
>>
>> 2. add that ext4 mountpoint as a datastore type "dir" for file/vm 
>> storage in proxmox.
>>
>> 3. install a debian13 in a normal/default (cache=none, i.e. O_DIRECT 
>> = on)  linux VM. the virtual disk should be backed by that 
>> mdraid/ext4 datastore created above.
>>
>> 4. inside the vm as an ordinary user get break-raid-odirect.c from 
>> https://forum.proxmox.com/threads/mdraid-o_direct.156036/post-713543 , 
>> compile that and let that run for a while. then terminate with ctrl-c.
>>
>> 5. on the pve host, check if your raid did not throw any error or has 
>> mismatch_count >0 ( cat /sys/block/md127/md/mismatch_cnt ) in the 
>> meantime.
>>
>> 6. on the pve host start raid check with  "echo check > /sys/block/ 
>> md127/md/sync_action"
>>
>> 6. let that check run and wait until it finishes (/proc/mdstat)
>>
>> 7. check for inconsistencies  via "cat 
>> /sys/block/md127/md/mismatch_cnt" again
>>
>> i am getting:
>>
>> cat /sys/block/md127/md/mismatch_cnt
>> 1048832
>>
>> so , we see that even with recent kernel (pve9 kernel is 6.14 based 
>> on ubuntu kernel),  we can break mdraid from non-root user inside a 
>> qemu VM on top ext4 on top of mdraid.
>>
>
> And what would happen if you use 'xfs' instead of 'ext4'?
> ext4 has some nasty requirements regarding 'flush', and that might well
> explain the issue here.
>
> Cheers,
>
> Hannes

^ permalink raw reply	[flat|nested] 15+ messages in thread

[parent not found: <6fb3e2cb-8eeb-4e76-9364-16348d807784@web.de>]

* Re: status of bugzilla #99171 - mdraid broken for O_DIRECT
       [not found]             ` <6fb3e2cb-8eeb-4e76-9364-16348d807784@web.de>
@ 2025-10-14  6:31               ` Hannes Reinecke
  0 siblings, 0 replies; 15+ messages in thread
From: Hannes Reinecke @ 2025-10-14  6:31 UTC (permalink / raw)
  To: Roland, Reindl Harald, linux-raid

On 10/13/25 21:04, Roland wrote:
> hello,
> 
>>> 7. check for inconsistencies  via "cat /sys/block/md127/md/ 
>>> mismatch_cnt" again
>>>
>>> i am getting:
>>>
>>> cat /sys/block/md127/md/mismatch_cnt
>>> 1048832
>>>
>>> so , we see that even with recent kernel (pve9 kernel is 6.14 based 
>>> on ubuntu kernel),  we can break mdraid from non-root user inside a 
>>> qemu VM on top ext4 on top of mdraid.
>>>
>>
>> And what would happen if you use 'xfs' instead of 'ext4'?
>> ext4 has some nasty requirements regarding 'flush', and that might well
>> explain the issue here.
>>
>> Cheers,
>>
>> Hannes 
> 
> 
> thanks for feedback and for this hint.
> 
> i tested with xfs today , and it seems to make no difference.
> 
> i can inject inconsistency at disk with the mentioned tool at block 
> level via debian-vm with xfs inside debian and hosted on xfs formatted 
> md raid1 on proxmox.
> 
> root@pve-hpmini-gen8:~# cat /sys/block/md126/md/mismatch_cnt
> 59648
> 
> # cat /proc/mdstat
> Personalities : [raid0] [raid1] [raid4] [raid5] [raid6] [raid10] [linear]
> md126 : active raid1 sdb3[1] sda3[0]
>        48794624 blocks super 1.2 [2/2] [UU]
>        [==================>..]  check = 90.5% (44183744/48794624) 
> finish=0.7min speed=107987K/sec
>        bitmap: 1/1 pages [4KB], 65536KB chunk
> 
> md127 : active raid1 sdb1[2] sda1[3]
>        48794624 blocks super 1.2 [2/2] [UU]
>        bitmap: 0/1 pages [0KB], 65536KB chunk
> 
> unused devices: <none>
> 
Hmm. I still would argue that the testcase quoted is invalid.

What you do is to issue writes of a given buffer, while at the
same time modifying the contents of that buffer.

As we're doing zerocopy with O_DIRECT the buffer passed to pwrite
is _the same_ buffer used when issuing the write to disk. The
block layer now assumes that the buffer will _not_ be modified
when writing to disk (ie between issuing 'pwrite' and the resulting
request being send to disk).
But that's not the case here; it will be modified, and consequently
all sorts of issues will pop up.
We have had all sorts of fun some years back with this issue until
we fixed up all filesystems to do this correctly; if interested
dig up the threads regarding 'stable pages' on linux-fsdevel.

I would think you will end up with a corrupted filesystem if you
run this without mdraid by just using btrfs with data checksumming.

So really I'm not sure how to go from here; I would declare this
as invalid, but what do I know ...

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2025-10-20  6:44 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <A4168F21-4CDF-4BAD-8754-30BAA1315C6F@web.de>
2025-10-14 20:14 ` status of bugzilla #99171 - mdraid broken for O_DIRECT Roland
2025-10-15  6:56   ` Hannes Reinecke
2025-10-15 23:09     ` Roland
2025-10-16  6:02       ` Hannes Reinecke
2025-10-17 20:18         ` Roland
2025-10-20  6:44           ` Hannes Reinecke
2024-10-09 20:08 Roland
2024-10-09 21:38 ` Reindl Harald
2024-10-10  6:53   ` Hannes Reinecke
2024-10-10  7:29     ` Roland
2024-10-10  8:34       ` Hannes Reinecke
2025-10-11 19:25         ` Roland
2025-10-13  6:48           ` Hannes Reinecke
2025-10-13 19:06             ` Roland
     [not found]             ` <6fb3e2cb-8eeb-4e76-9364-16348d807784@web.de>
2025-10-14  6:31               ` Hannes Reinecke

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox