* [PATCH 1/9] block: fix data loss and stale date exposure problems during append write
2025-11-21 8:17 Fix potential data loss and corruption due to Incorrect BIO Chain Handling zhangshida
@ 2025-11-21 8:17 ` zhangshida
2025-11-21 9:34 ` Johannes Thumshirn
` (2 more replies)
2025-11-21 8:17 ` [PATCH 2/9] block: export bio_chain_and_submit zhangshida
` (9 subsequent siblings)
10 siblings, 3 replies; 40+ messages in thread
From: zhangshida @ 2025-11-21 8:17 UTC (permalink / raw)
To: linux-kernel
Cc: linux-block, nvdimm, virtualization, linux-nvme, gfs2, ntfs3,
linux-xfs, zhangshida, starzhangzsd
From: Shida Zhang <zhangshida@kylinos.cn>
Signed-off-by: Shida Zhang <zhangshida@kylinos.cn>
---
block/bio.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/block/bio.c b/block/bio.c
index b3a79285c27..55c2c1a0020 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -322,7 +322,7 @@ static struct bio *__bio_chain_endio(struct bio *bio)
static void bio_chain_endio(struct bio *bio)
{
- bio_endio(__bio_chain_endio(bio));
+ bio_endio(bio);
}
/**
--
2.34.1
^ permalink raw reply related [flat|nested] 40+ messages in thread* Re: [PATCH 1/9] block: fix data loss and stale date exposure problems during append write
2025-11-21 8:17 ` [PATCH 1/9] block: fix data loss and stale date exposure problems during append write zhangshida
@ 2025-11-21 9:34 ` Johannes Thumshirn
2025-11-22 7:08 ` Stephen Zhang
2025-11-21 10:31 ` Christoph Hellwig
2025-11-22 12:15 ` Ming Lei
2 siblings, 1 reply; 40+ messages in thread
From: Johannes Thumshirn @ 2025-11-21 9:34 UTC (permalink / raw)
To: zhangshida, linux-kernel@vger.kernel.org
Cc: linux-block@vger.kernel.org, nvdimm@lists.linux.dev,
virtualization@lists.linux.dev, linux-nvme@lists.infradead.org,
gfs2@lists.linux.dev, ntfs3@lists.linux.dev,
linux-xfs@vger.kernel.org, zhangshida@kylinos.cn
On 11/21/25 9:19 AM, zhangshida wrote:
> From: Shida Zhang <zhangshida@kylinos.cn>
>
> Signed-off-by: Shida Zhang <zhangshida@kylinos.cn>
Regardless of the code change, this needs documentation what you are
doing and why it is correct
> ---
> block/bio.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/block/bio.c b/block/bio.c
> index b3a79285c27..55c2c1a0020 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -322,7 +322,7 @@ static struct bio *__bio_chain_endio(struct bio *bio)
>
> static void bio_chain_endio(struct bio *bio)
> {
> - bio_endio(__bio_chain_endio(bio));
> + bio_endio(bio);
> }
>
> /**
^ permalink raw reply [flat|nested] 40+ messages in thread* Re: [PATCH 1/9] block: fix data loss and stale date exposure problems during append write
2025-11-21 9:34 ` Johannes Thumshirn
@ 2025-11-22 7:08 ` Stephen Zhang
0 siblings, 0 replies; 40+ messages in thread
From: Stephen Zhang @ 2025-11-22 7:08 UTC (permalink / raw)
To: Johannes Thumshirn
Cc: linux-kernel@vger.kernel.org, linux-block@vger.kernel.org,
nvdimm@lists.linux.dev, virtualization@lists.linux.dev,
linux-nvme@lists.infradead.org, gfs2@lists.linux.dev,
ntfs3@lists.linux.dev, linux-xfs@vger.kernel.org,
zhangshida@kylinos.cn
Johannes Thumshirn <Johannes.Thumshirn@wdc.com> 于2025年11月21日周五 17:35写道:
>
> On 11/21/25 9:19 AM, zhangshida wrote:
> > From: Shida Zhang <zhangshida@kylinos.cn>
> >
> > Signed-off-by: Shida Zhang <zhangshida@kylinos.cn>
>
>
> Regardless of the code change, this needs documentation what you are
> doing and why it is correct
>
Okay, will do that if I can get the chance to make a v2 version.
Thanks,
Shida
> > ---
> > block/bio.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/block/bio.c b/block/bio.c
> > index b3a79285c27..55c2c1a0020 100644
> > --- a/block/bio.c
> > +++ b/block/bio.c
> > @@ -322,7 +322,7 @@ static struct bio *__bio_chain_endio(struct bio *bio)
> >
> > static void bio_chain_endio(struct bio *bio)
> > {
> > - bio_endio(__bio_chain_endio(bio));
> > + bio_endio(bio);
> > }
> >
> > /**
>
>
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 1/9] block: fix data loss and stale date exposure problems during append write
2025-11-21 8:17 ` [PATCH 1/9] block: fix data loss and stale date exposure problems during append write zhangshida
2025-11-21 9:34 ` Johannes Thumshirn
@ 2025-11-21 10:31 ` Christoph Hellwig
2025-11-21 16:13 ` Andreas Gruenbacher
2025-11-22 12:15 ` Ming Lei
2 siblings, 1 reply; 40+ messages in thread
From: Christoph Hellwig @ 2025-11-21 10:31 UTC (permalink / raw)
To: zhangshida
Cc: linux-kernel, linux-block, nvdimm, virtualization, linux-nvme,
gfs2, ntfs3, linux-xfs, zhangshida
On Fri, Nov 21, 2025 at 04:17:40PM +0800, zhangshida wrote:
> From: Shida Zhang <zhangshida@kylinos.cn>
>
> Signed-off-by: Shida Zhang <zhangshida@kylinos.cn>
> ---
> block/bio.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/block/bio.c b/block/bio.c
> index b3a79285c27..55c2c1a0020 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -322,7 +322,7 @@ static struct bio *__bio_chain_endio(struct bio *bio)
>
> static void bio_chain_endio(struct bio *bio)
> {
> - bio_endio(__bio_chain_endio(bio));
> + bio_endio(bio);
I don't see how this can work. bio_chain_endio is called literally
as the result of calling bio_endio, so you recurse into that.
Also please put your analysis of the problem into this patch as
said by Johannes. And please wrap it at 73 characters as the wall of
text in the cover letter is quite unreadable.
^ permalink raw reply [flat|nested] 40+ messages in thread* Re: [PATCH 1/9] block: fix data loss and stale date exposure problems during append write
2025-11-21 10:31 ` Christoph Hellwig
@ 2025-11-21 16:13 ` Andreas Gruenbacher
2025-11-22 7:25 ` Stephen Zhang
2025-11-28 3:22 ` Stephen Zhang
0 siblings, 2 replies; 40+ messages in thread
From: Andreas Gruenbacher @ 2025-11-21 16:13 UTC (permalink / raw)
To: Christoph Hellwig
Cc: zhangshida, linux-kernel, linux-block, nvdimm, virtualization,
linux-nvme, gfs2, ntfs3, linux-xfs, zhangshida
On Fri, Nov 21, 2025 at 11:38 AM Christoph Hellwig <hch@infradead.org> wrote:
> On Fri, Nov 21, 2025 at 04:17:40PM +0800, zhangshida wrote:
> > From: Shida Zhang <zhangshida@kylinos.cn>
> >
> > Signed-off-by: Shida Zhang <zhangshida@kylinos.cn>
> > ---
> > block/bio.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/block/bio.c b/block/bio.c
> > index b3a79285c27..55c2c1a0020 100644
> > --- a/block/bio.c
> > +++ b/block/bio.c
> > @@ -322,7 +322,7 @@ static struct bio *__bio_chain_endio(struct bio *bio)
> >
> > static void bio_chain_endio(struct bio *bio)
> > {
> > - bio_endio(__bio_chain_endio(bio));
> > + bio_endio(bio);
>
> I don't see how this can work. bio_chain_endio is called literally
> as the result of calling bio_endio, so you recurse into that.
Hmm, I don't actually see where: bio_endio() only calls
__bio_chain_endio(), which is fine.
Once bio_chain_endio() only calls bio_endio(), it can probably be
removed in a follow-up patch.
Also, loosely related, what I find slightly odd is this code in
__bio_chain_endio():
if (bio->bi_status && !parent->bi_status)
parent->bi_status = bio->bi_status;
I don't think it really matters whether or not parent->bi_status is
already set here?
Also, multiple completions can race setting bi_status, so shouldn't we
at least have a WRITE_ONCE() here and in the other places that set
bi_status?
Thanks,
Andreas
^ permalink raw reply [flat|nested] 40+ messages in thread* Re: [PATCH 1/9] block: fix data loss and stale date exposure problems during append write
2025-11-21 16:13 ` Andreas Gruenbacher
@ 2025-11-22 7:25 ` Stephen Zhang
2025-11-28 3:22 ` Stephen Zhang
1 sibling, 0 replies; 40+ messages in thread
From: Stephen Zhang @ 2025-11-22 7:25 UTC (permalink / raw)
To: Andreas Gruenbacher
Cc: Christoph Hellwig, linux-kernel, linux-block, nvdimm,
virtualization, linux-nvme, gfs2, ntfs3, linux-xfs, zhangshida
Andreas Gruenbacher <agruenba@redhat.com> 于2025年11月22日周六 00:13写道:
>
> On Fri, Nov 21, 2025 at 11:38 AM Christoph Hellwig <hch@infradead.org> wrote:
> > On Fri, Nov 21, 2025 at 04:17:40PM +0800, zhangshida wrote:
> > > From: Shida Zhang <zhangshida@kylinos.cn>
> > >
> > > Signed-off-by: Shida Zhang <zhangshida@kylinos.cn>
> > > ---
> > > block/bio.c | 2 +-
> > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/block/bio.c b/block/bio.c
> > > index b3a79285c27..55c2c1a0020 100644
> > > --- a/block/bio.c
> > > +++ b/block/bio.c
> > > @@ -322,7 +322,7 @@ static struct bio *__bio_chain_endio(struct bio *bio)
> > >
> > > static void bio_chain_endio(struct bio *bio)
> > > {
> > > - bio_endio(__bio_chain_endio(bio));
> > > + bio_endio(bio);
> >
> > I don't see how this can work. bio_chain_endio is called literally
> > as the result of calling bio_endio, so you recurse into that.
>
> Hmm, I don't actually see where: bio_endio() only calls
> __bio_chain_endio(), which is fine.
>
> Once bio_chain_endio() only calls bio_endio(), it can probably be
> removed in a follow-up patch.
>
> Also, loosely related, what I find slightly odd is this code in
> __bio_chain_endio():
>
> if (bio->bi_status && !parent->bi_status)
> parent->bi_status = bio->bi_status;
>
> I don't think it really matters whether or not parent->bi_status is
> already set here?
>
From what I understand, it wants to pass the bi_status to the very last bio
because end_io is only called for that final one. This allows the
end_io function
to know the overall status of the entire I/O chain.
> Also, multiple completions can race setting bi_status, so shouldn't we
> at least have a WRITE_ONCE() here and in the other places that set
> bi_status?
>
Great, that means I could add two more patches to this series: one to
remove __bio_chain_endio() and another to use WRITE_ONCE()?
This gives me the feeling of becoming rich overnight, since I'm making so
many patch contributions this time! :)
Thanks,
shida
> Thanks,
> Andreas
>
^ permalink raw reply [flat|nested] 40+ messages in thread* Re: [PATCH 1/9] block: fix data loss and stale date exposure problems during append write
2025-11-21 16:13 ` Andreas Gruenbacher
2025-11-22 7:25 ` Stephen Zhang
@ 2025-11-28 3:22 ` Stephen Zhang
2025-11-28 5:55 ` Christoph Hellwig
1 sibling, 1 reply; 40+ messages in thread
From: Stephen Zhang @ 2025-11-28 3:22 UTC (permalink / raw)
To: Andreas Gruenbacher
Cc: Christoph Hellwig, linux-kernel, linux-block, nvdimm,
virtualization, linux-nvme, gfs2, ntfs3, linux-xfs, zhangshida
Andreas Gruenbacher <agruenba@redhat.com> 于2025年11月22日周六 00:13写道:
>
> On Fri, Nov 21, 2025 at 11:38 AM Christoph Hellwig <hch@infradead.org> wrote:
> > On Fri, Nov 21, 2025 at 04:17:40PM +0800, zhangshida wrote:
> > > From: Shida Zhang <zhangshida@kylinos.cn>
> > >
> > > Signed-off-by: Shida Zhang <zhangshida@kylinos.cn>
> > > ---
> > > block/bio.c | 2 +-
> > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/block/bio.c b/block/bio.c
> > > index b3a79285c27..55c2c1a0020 100644
> > > --- a/block/bio.c
> > > +++ b/block/bio.c
> > > @@ -322,7 +322,7 @@ static struct bio *__bio_chain_endio(struct bio *bio)
> > >
> > > static void bio_chain_endio(struct bio *bio)
> > > {
> > > - bio_endio(__bio_chain_endio(bio));
> > > + bio_endio(bio);
> >
> > I don't see how this can work. bio_chain_endio is called literally
> > as the result of calling bio_endio, so you recurse into that.
>
> Hmm, I don't actually see where: bio_endio() only calls
> __bio_chain_endio(), which is fine.
>
> Once bio_chain_endio() only calls bio_endio(), it can probably be
> removed in a follow-up patch.
>
> Also, loosely related, what I find slightly odd is this code in
> __bio_chain_endio():
>
> if (bio->bi_status && !parent->bi_status)
> parent->bi_status = bio->bi_status;
>
> I don't think it really matters whether or not parent->bi_status is
> already set here?
>
> Also, multiple completions can race setting bi_status, so shouldn't we
> at least have a WRITE_ONCE() here and in the other places that set
> bi_status?
>
I'm considering whether we need to add a WRITE_ONCE() in version 2
of this series.
From my understanding, WRITE_ONCE() prevents write merging and
tearing by ensuring the write operation is performed as a single, atomic
access. For instance, it stops the compiler from splitting a 32-bit write
into multiple 8-bit writes that could be interleaved with reads from other
CPUs.
However, since we're dealing with a single-byte (u8/blk_status_t) write,
it's naturally atomic at the hardware level. The CPU won't tear a byte-sized
write into separate bit-level operations.
Therefore, we could potentially change it to::
if (bio->bi_status && !READ_ONCE(parent->bi_status))
parent->bi_status = bio->bi_status;
But as you mentioned, the check might not be critical here. So ultimately,
we can simplify it to:
if (bio->bi_status)
parent->bi_status = bio->bi_status;
Thanks,
shida
> Thanks,
> Andreas
>
^ permalink raw reply [flat|nested] 40+ messages in thread* Re: [PATCH 1/9] block: fix data loss and stale date exposure problems during append write
2025-11-28 3:22 ` Stephen Zhang
@ 2025-11-28 5:55 ` Christoph Hellwig
2025-11-28 6:26 ` Stephen Zhang
0 siblings, 1 reply; 40+ messages in thread
From: Christoph Hellwig @ 2025-11-28 5:55 UTC (permalink / raw)
To: Stephen Zhang
Cc: Andreas Gruenbacher, Christoph Hellwig, linux-kernel, linux-block,
nvdimm, virtualization, linux-nvme, gfs2, ntfs3, linux-xfs,
zhangshida
On Fri, Nov 28, 2025 at 11:22:49AM +0800, Stephen Zhang wrote:
> Therefore, we could potentially change it to::
>
> if (bio->bi_status && !READ_ONCE(parent->bi_status))
> parent->bi_status = bio->bi_status;
>
> But as you mentioned, the check might not be critical here. So ultimately,
> we can simplify it to:
>
> if (bio->bi_status)
> parent->bi_status = bio->bi_status;
It might make sense to just use cmpxchg. See btrfs_bio_end_io as an
example (although it is operating on the btrfs_bio structure)
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 1/9] block: fix data loss and stale date exposure problems during append write
2025-11-28 5:55 ` Christoph Hellwig
@ 2025-11-28 6:26 ` Stephen Zhang
0 siblings, 0 replies; 40+ messages in thread
From: Stephen Zhang @ 2025-11-28 6:26 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Andreas Gruenbacher, linux-kernel, linux-block, nvdimm,
virtualization, linux-nvme, gfs2, ntfs3, linux-xfs, zhangshida
Christoph Hellwig <hch@infradead.org> 于2025年11月28日周五 13:55写道:
>
> On Fri, Nov 28, 2025 at 11:22:49AM +0800, Stephen Zhang wrote:
> > Therefore, we could potentially change it to::
> >
> > if (bio->bi_status && !READ_ONCE(parent->bi_status))
> > parent->bi_status = bio->bi_status;
> >
> > But as you mentioned, the check might not be critical here. So ultimately,
> > we can simplify it to:
> >
> > if (bio->bi_status)
> > parent->bi_status = bio->bi_status;
>
> It might make sense to just use cmpxchg. See btrfs_bio_end_io as an
> example (although it is operating on the btrfs_bio structure)
>
Thanks for the suggestion! I learned something new today.
Thanks,
Shida
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 1/9] block: fix data loss and stale date exposure problems during append write
2025-11-21 8:17 ` [PATCH 1/9] block: fix data loss and stale date exposure problems during append write zhangshida
2025-11-21 9:34 ` Johannes Thumshirn
2025-11-21 10:31 ` Christoph Hellwig
@ 2025-11-22 12:15 ` Ming Lei
2 siblings, 0 replies; 40+ messages in thread
From: Ming Lei @ 2025-11-22 12:15 UTC (permalink / raw)
To: zhangshida
Cc: linux-kernel, linux-block, nvdimm, virtualization, linux-nvme,
gfs2, ntfs3, linux-xfs, zhangshida
On Fri, Nov 21, 2025 at 04:17:40PM +0800, zhangshida wrote:
> From: Shida Zhang <zhangshida@kylinos.cn>
>
> Signed-off-by: Shida Zhang <zhangshida@kylinos.cn>
> ---
> block/bio.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/block/bio.c b/block/bio.c
> index b3a79285c27..55c2c1a0020 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -322,7 +322,7 @@ static struct bio *__bio_chain_endio(struct bio *bio)
>
> static void bio_chain_endio(struct bio *bio)
> {
> - bio_endio(__bio_chain_endio(bio));
> + bio_endio(bio);
> }
bio_chain_endio() should never get called, so how can this change make any
difference?
Thanks,
Ming
^ permalink raw reply [flat|nested] 40+ messages in thread
* [PATCH 2/9] block: export bio_chain_and_submit
2025-11-21 8:17 Fix potential data loss and corruption due to Incorrect BIO Chain Handling zhangshida
2025-11-21 8:17 ` [PATCH 1/9] block: fix data loss and stale date exposure problems during append write zhangshida
@ 2025-11-21 8:17 ` zhangshida
2025-11-21 10:32 ` Christoph Hellwig
2025-11-21 17:12 ` Andreas Gruenbacher
2025-11-21 8:17 ` [PATCH 3/9] gfs2: use bio_chain_and_submit for simplification zhangshida
` (8 subsequent siblings)
10 siblings, 2 replies; 40+ messages in thread
From: zhangshida @ 2025-11-21 8:17 UTC (permalink / raw)
To: linux-kernel
Cc: linux-block, nvdimm, virtualization, linux-nvme, gfs2, ntfs3,
linux-xfs, zhangshida, starzhangzsd
From: Shida Zhang <zhangshida@kylinos.cn>
Signed-off-by: Shida Zhang <zhangshida@kylinos.cn>
---
block/bio.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/block/bio.c b/block/bio.c
index 55c2c1a0020..a6912aa8d69 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -363,6 +363,7 @@ struct bio *bio_chain_and_submit(struct bio *prev, struct bio *new)
}
return new;
}
+EXPORT_SYMBOL_GPL(bio_chain_and_submit);
struct bio *blk_next_bio(struct bio *bio, struct block_device *bdev,
unsigned int nr_pages, blk_opf_t opf, gfp_t gfp)
--
2.34.1
^ permalink raw reply related [flat|nested] 40+ messages in thread* Re: [PATCH 2/9] block: export bio_chain_and_submit
2025-11-21 8:17 ` [PATCH 2/9] block: export bio_chain_and_submit zhangshida
@ 2025-11-21 10:32 ` Christoph Hellwig
2025-11-21 17:12 ` Andreas Gruenbacher
1 sibling, 0 replies; 40+ messages in thread
From: Christoph Hellwig @ 2025-11-21 10:32 UTC (permalink / raw)
To: zhangshida
Cc: linux-kernel, linux-block, nvdimm, virtualization, linux-nvme,
gfs2, ntfs3, linux-xfs, zhangshida
This is almost missing a commit message. Also I think this
and the following patches are cleanups / refactoring, so maybe
let's concentrate on the bug fix first?
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 2/9] block: export bio_chain_and_submit
2025-11-21 8:17 ` [PATCH 2/9] block: export bio_chain_and_submit zhangshida
2025-11-21 10:32 ` Christoph Hellwig
@ 2025-11-21 17:12 ` Andreas Gruenbacher
2025-11-22 7:02 ` Stephen Zhang
1 sibling, 1 reply; 40+ messages in thread
From: Andreas Gruenbacher @ 2025-11-21 17:12 UTC (permalink / raw)
To: zhangshida
Cc: linux-kernel, linux-block, nvdimm, virtualization, linux-nvme,
gfs2, ntfs3, linux-xfs, zhangshida
On Fri, Nov 21, 2025 at 9:27 AM zhangshida <starzhangzsd@gmail.com> wrote:
> From: Shida Zhang <zhangshida@kylinos.cn>
>
> Signed-off-by: Shida Zhang <zhangshida@kylinos.cn>
> ---
> block/bio.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/block/bio.c b/block/bio.c
> index 55c2c1a0020..a6912aa8d69 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -363,6 +363,7 @@ struct bio *bio_chain_and_submit(struct bio *prev, struct bio *new)
> }
> return new;
> }
> +EXPORT_SYMBOL_GPL(bio_chain_and_submit);
>
> struct bio *blk_next_bio(struct bio *bio, struct block_device *bdev,
> unsigned int nr_pages, blk_opf_t opf, gfp_t gfp)
> --
> 2.34.1
Can this and the following patches please go in a separate patch
queue? It's got nothing to do with the bug.
Thanks,
Andreas
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 2/9] block: export bio_chain_and_submit
2025-11-21 17:12 ` Andreas Gruenbacher
@ 2025-11-22 7:02 ` Stephen Zhang
0 siblings, 0 replies; 40+ messages in thread
From: Stephen Zhang @ 2025-11-22 7:02 UTC (permalink / raw)
To: Andreas Gruenbacher
Cc: linux-kernel, linux-block, nvdimm, virtualization, linux-nvme,
gfs2, ntfs3, linux-xfs, zhangshida
Andreas Gruenbacher <agruenba@redhat.com> 于2025年11月22日周六 01:12写道:
>
> On Fri, Nov 21, 2025 at 9:27 AM zhangshida <starzhangzsd@gmail.com> wrote:
> > From: Shida Zhang <zhangshida@kylinos.cn>
> >
> > Signed-off-by: Shida Zhang <zhangshida@kylinos.cn>
> > ---
> > block/bio.c | 1 +
> > 1 file changed, 1 insertion(+)
> >
> > diff --git a/block/bio.c b/block/bio.c
> > index 55c2c1a0020..a6912aa8d69 100644
> > --- a/block/bio.c
> > +++ b/block/bio.c
> > @@ -363,6 +363,7 @@ struct bio *bio_chain_and_submit(struct bio *prev, struct bio *new)
> > }
> > return new;
> > }
> > +EXPORT_SYMBOL_GPL(bio_chain_and_submit);
> >
> > struct bio *blk_next_bio(struct bio *bio, struct block_device *bdev,
> > unsigned int nr_pages, blk_opf_t opf, gfp_t gfp)
> > --
> > 2.34.1
>
> Can this and the following patches please go in a separate patch
> queue? It's got nothing to do with the bug.
>
Should we necessarily separate it that way?
Currently, I am including all cleanups with the fix because it provides a reason
to CC all related communities. That way, developers who are monitoring them
can help identify similar problems if someone asksfor help in the
future, provided
that is the correct analysis and fix.
Thanks,
Shida
> Thanks,
> Andreas
>
^ permalink raw reply [flat|nested] 40+ messages in thread
* [PATCH 3/9] gfs2: use bio_chain_and_submit for simplification
2025-11-21 8:17 Fix potential data loss and corruption due to Incorrect BIO Chain Handling zhangshida
2025-11-21 8:17 ` [PATCH 1/9] block: fix data loss and stale date exposure problems during append write zhangshida
2025-11-21 8:17 ` [PATCH 2/9] block: export bio_chain_and_submit zhangshida
@ 2025-11-21 8:17 ` zhangshida
2025-11-21 8:17 ` [PATCH 4/9] xfs: " zhangshida
` (7 subsequent siblings)
10 siblings, 0 replies; 40+ messages in thread
From: zhangshida @ 2025-11-21 8:17 UTC (permalink / raw)
To: linux-kernel
Cc: linux-block, nvdimm, virtualization, linux-nvme, gfs2, ntfs3,
linux-xfs, zhangshida, starzhangzsd
From: Shida Zhang <zhangshida@kylinos.cn>
Signed-off-by: Shida Zhang <zhangshida@kylinos.cn>
---
fs/gfs2/lops.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/fs/gfs2/lops.c b/fs/gfs2/lops.c
index 9c8c305a75c..c8c48593449 100644
--- a/fs/gfs2/lops.c
+++ b/fs/gfs2/lops.c
@@ -487,8 +487,7 @@ static struct bio *gfs2_chain_bio(struct bio *prev, unsigned int nr_iovecs)
new = bio_alloc(prev->bi_bdev, nr_iovecs, prev->bi_opf, GFP_NOIO);
bio_clone_blkg_association(new, prev);
new->bi_iter.bi_sector = bio_end_sector(prev);
- bio_chain(new, prev);
- submit_bio(prev);
+ bio_chain_and_submit(prev, new);
return new;
}
--
2.34.1
^ permalink raw reply related [flat|nested] 40+ messages in thread* [PATCH 4/9] xfs: use bio_chain_and_submit for simplification
2025-11-21 8:17 Fix potential data loss and corruption due to Incorrect BIO Chain Handling zhangshida
` (2 preceding siblings ...)
2025-11-21 8:17 ` [PATCH 3/9] gfs2: use bio_chain_and_submit for simplification zhangshida
@ 2025-11-21 8:17 ` zhangshida
2025-11-21 8:17 ` [PATCH 5/9] block: " zhangshida
` (6 subsequent siblings)
10 siblings, 0 replies; 40+ messages in thread
From: zhangshida @ 2025-11-21 8:17 UTC (permalink / raw)
To: linux-kernel
Cc: linux-block, nvdimm, virtualization, linux-nvme, gfs2, ntfs3,
linux-xfs, zhangshida, starzhangzsd
From: Shida Zhang <zhangshida@kylinos.cn>
Signed-off-by: Shida Zhang <zhangshida@kylinos.cn>
---
fs/xfs/xfs_bio_io.c | 3 +--
fs/xfs/xfs_buf.c | 3 +--
fs/xfs/xfs_log.c | 3 +--
3 files changed, 3 insertions(+), 6 deletions(-)
diff --git a/fs/xfs/xfs_bio_io.c b/fs/xfs/xfs_bio_io.c
index 2a736d10eaf..4a6577b0789 100644
--- a/fs/xfs/xfs_bio_io.c
+++ b/fs/xfs/xfs_bio_io.c
@@ -38,8 +38,7 @@ xfs_rw_bdev(
bio_max_vecs(count - done),
prev->bi_opf, GFP_KERNEL);
bio->bi_iter.bi_sector = bio_end_sector(prev);
- bio_chain(prev, bio);
- submit_bio(prev);
+ bio_chain_and_submit(prev, bio);
}
done += added;
} while (done < count);
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 773d959965d..c26bd28edb4 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1357,8 +1357,7 @@ xfs_buf_submit_bio(
split = bio_split(bio, bp->b_maps[map].bm_len, GFP_NOFS,
&fs_bio_set);
split->bi_iter.bi_sector = bp->b_maps[map].bm_bn;
- bio_chain(split, bio);
- submit_bio(split);
+ bio_chain_and_submit(split, bio);
}
bio->bi_iter.bi_sector = bp->b_maps[map].bm_bn;
submit_bio(bio);
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 603e85c1ab4..f4c9ad1d148 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -1687,8 +1687,7 @@ xlog_write_iclog(
split = bio_split(&iclog->ic_bio, log->l_logBBsize - bno,
GFP_NOIO, &fs_bio_set);
- bio_chain(split, &iclog->ic_bio);
- submit_bio(split);
+ bio_chain_and_submit(split, &iclog->ic_bio);
/* restart at logical offset zero for the remainder */
iclog->ic_bio.bi_iter.bi_sector = log->l_logBBstart;
--
2.34.1
^ permalink raw reply related [flat|nested] 40+ messages in thread* [PATCH 5/9] block: use bio_chain_and_submit for simplification
2025-11-21 8:17 Fix potential data loss and corruption due to Incorrect BIO Chain Handling zhangshida
` (3 preceding siblings ...)
2025-11-21 8:17 ` [PATCH 4/9] xfs: " zhangshida
@ 2025-11-21 8:17 ` zhangshida
2025-11-21 8:17 ` [PATCH 6/9] fs/ntfs3: " zhangshida
` (5 subsequent siblings)
10 siblings, 0 replies; 40+ messages in thread
From: zhangshida @ 2025-11-21 8:17 UTC (permalink / raw)
To: linux-kernel
Cc: linux-block, nvdimm, virtualization, linux-nvme, gfs2, ntfs3,
linux-xfs, zhangshida, starzhangzsd
From: Shida Zhang <zhangshida@kylinos.cn>
Signed-off-by: Shida Zhang <zhangshida@kylinos.cn>
---
fs/squashfs/block.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/fs/squashfs/block.c b/fs/squashfs/block.c
index a05e3793f93..5818e473255 100644
--- a/fs/squashfs/block.c
+++ b/fs/squashfs/block.c
@@ -126,8 +126,7 @@ static int squashfs_bio_read_cached(struct bio *fullbio,
if (bio) {
bio_trim(bio, start_idx * PAGE_SECTORS,
(end_idx - start_idx) * PAGE_SECTORS);
- bio_chain(bio, new);
- submit_bio(bio);
+ bio_chain_and_submit(bio, new);
}
bio = new;
--
2.34.1
^ permalink raw reply related [flat|nested] 40+ messages in thread* [PATCH 6/9] fs/ntfs3: use bio_chain_and_submit for simplification
2025-11-21 8:17 Fix potential data loss and corruption due to Incorrect BIO Chain Handling zhangshida
` (4 preceding siblings ...)
2025-11-21 8:17 ` [PATCH 5/9] block: " zhangshida
@ 2025-11-21 8:17 ` zhangshida
2025-11-21 8:17 ` [PATCH 7/9] zram: " zhangshida
` (4 subsequent siblings)
10 siblings, 0 replies; 40+ messages in thread
From: zhangshida @ 2025-11-21 8:17 UTC (permalink / raw)
To: linux-kernel
Cc: linux-block, nvdimm, virtualization, linux-nvme, gfs2, ntfs3,
linux-xfs, zhangshida, starzhangzsd
From: Shida Zhang <zhangshida@kylinos.cn>
Signed-off-by: Shida Zhang <zhangshida@kylinos.cn>
---
fs/ntfs3/fsntfs.c | 12 ++----------
1 file changed, 2 insertions(+), 10 deletions(-)
diff --git a/fs/ntfs3/fsntfs.c b/fs/ntfs3/fsntfs.c
index c7a2f191254..35685ee4ed2 100644
--- a/fs/ntfs3/fsntfs.c
+++ b/fs/ntfs3/fsntfs.c
@@ -1514,11 +1514,7 @@ int ntfs_bio_pages(struct ntfs_sb_info *sbi, const struct runs_tree *run,
len = ((u64)clen << cluster_bits) - off;
new_bio:
new = bio_alloc(bdev, nr_pages - page_idx, op, GFP_NOFS);
- if (bio) {
- bio_chain(bio, new);
- submit_bio(bio);
- }
- bio = new;
+ bio = bio_chain_and_submit(bio, new);
bio->bi_iter.bi_sector = lbo >> 9;
while (len) {
@@ -1611,11 +1607,7 @@ int ntfs_bio_fill_1(struct ntfs_sb_info *sbi, const struct runs_tree *run)
len = (u64)clen << cluster_bits;
new_bio:
new = bio_alloc(bdev, BIO_MAX_VECS, REQ_OP_WRITE, GFP_NOFS);
- if (bio) {
- bio_chain(bio, new);
- submit_bio(bio);
- }
- bio = new;
+ bio = bio_chain_and_submit(bio, new);
bio->bi_iter.bi_sector = lbo >> 9;
for (;;) {
--
2.34.1
^ permalink raw reply related [flat|nested] 40+ messages in thread* [PATCH 7/9] zram: use bio_chain_and_submit for simplification
2025-11-21 8:17 Fix potential data loss and corruption due to Incorrect BIO Chain Handling zhangshida
` (5 preceding siblings ...)
2025-11-21 8:17 ` [PATCH 6/9] fs/ntfs3: " zhangshida
@ 2025-11-21 8:17 ` zhangshida
2025-11-21 8:17 ` [PATCH 8/9] nvmet: fix the potential bug and " zhangshida
` (3 subsequent siblings)
10 siblings, 0 replies; 40+ messages in thread
From: zhangshida @ 2025-11-21 8:17 UTC (permalink / raw)
To: linux-kernel
Cc: linux-block, nvdimm, virtualization, linux-nvme, gfs2, ntfs3,
linux-xfs, zhangshida, starzhangzsd
From: Shida Zhang <zhangshida@kylinos.cn>
Signed-off-by: Shida Zhang <zhangshida@kylinos.cn>
---
drivers/block/zram/zram_drv.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index a4307465753..084de60ebaf 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -730,8 +730,7 @@ static void read_from_bdev_async(struct zram *zram, struct page *page,
bio = bio_alloc(zram->bdev, 1, parent->bi_opf, GFP_NOIO);
bio->bi_iter.bi_sector = entry * (PAGE_SIZE >> 9);
__bio_add_page(bio, page, PAGE_SIZE, 0);
- bio_chain(bio, parent);
- submit_bio(bio);
+ bio_chain_and_submit(bio, parent);
}
static int zram_writeback_slots(struct zram *zram, struct zram_pp_ctl *ctl)
--
2.34.1
^ permalink raw reply related [flat|nested] 40+ messages in thread* [PATCH 8/9] nvmet: fix the potential bug and use bio_chain_and_submit for simplification
2025-11-21 8:17 Fix potential data loss and corruption due to Incorrect BIO Chain Handling zhangshida
` (6 preceding siblings ...)
2025-11-21 8:17 ` [PATCH 7/9] zram: " zhangshida
@ 2025-11-21 8:17 ` zhangshida
2025-11-21 8:17 ` [PATCH 9/9] nvdimm: " zhangshida
` (2 subsequent siblings)
10 siblings, 0 replies; 40+ messages in thread
From: zhangshida @ 2025-11-21 8:17 UTC (permalink / raw)
To: linux-kernel
Cc: linux-block, nvdimm, virtualization, linux-nvme, gfs2, ntfs3,
linux-xfs, zhangshida, starzhangzsd
From: Shida Zhang <zhangshida@kylinos.cn>
the prev and new may be in the wrong position...
Signed-off-by: Shida Zhang <zhangshida@kylinos.cn>
---
drivers/nvme/target/io-cmd-bdev.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/drivers/nvme/target/io-cmd-bdev.c b/drivers/nvme/target/io-cmd-bdev.c
index 8d246b8ca60..4af45659bd2 100644
--- a/drivers/nvme/target/io-cmd-bdev.c
+++ b/drivers/nvme/target/io-cmd-bdev.c
@@ -312,8 +312,7 @@ static void nvmet_bdev_execute_rw(struct nvmet_req *req)
opf, GFP_KERNEL);
bio->bi_iter.bi_sector = sector;
- bio_chain(bio, prev);
- submit_bio(prev);
+ bio_chain_and_submit(prev, bio);
}
sector += sg->length >> 9;
--
2.34.1
^ permalink raw reply related [flat|nested] 40+ messages in thread* [PATCH 9/9] nvdimm: use bio_chain_and_submit for simplification
2025-11-21 8:17 Fix potential data loss and corruption due to Incorrect BIO Chain Handling zhangshida
` (7 preceding siblings ...)
2025-11-21 8:17 ` [PATCH 8/9] nvmet: fix the potential bug and " zhangshida
@ 2025-11-21 8:17 ` zhangshida
2025-11-21 10:37 ` Fix potential data loss and corruption due to Incorrect BIO Chain Handling Christoph Hellwig
2025-11-22 3:35 ` Ming Lei
10 siblings, 0 replies; 40+ messages in thread
From: zhangshida @ 2025-11-21 8:17 UTC (permalink / raw)
To: linux-kernel
Cc: linux-block, nvdimm, virtualization, linux-nvme, gfs2, ntfs3,
linux-xfs, zhangshida, starzhangzsd
From: Shida Zhang <zhangshida@kylinos.cn>
Signed-off-by: Shida Zhang <zhangshida@kylinos.cn>
---
drivers/nvdimm/nd_virtio.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/drivers/nvdimm/nd_virtio.c b/drivers/nvdimm/nd_virtio.c
index c3f07be4aa2..e6ec7ceee9b 100644
--- a/drivers/nvdimm/nd_virtio.c
+++ b/drivers/nvdimm/nd_virtio.c
@@ -122,8 +122,7 @@ int async_pmem_flush(struct nd_region *nd_region, struct bio *bio)
return -ENOMEM;
bio_clone_blkg_association(child, bio);
child->bi_iter.bi_sector = -1;
- bio_chain(child, bio);
- submit_bio(child);
+ bio_chain_and_submit(child, bio);
return 0;
}
if (virtio_pmem_flush(nd_region))
--
2.34.1
^ permalink raw reply related [flat|nested] 40+ messages in thread* Re: Fix potential data loss and corruption due to Incorrect BIO Chain Handling
2025-11-21 8:17 Fix potential data loss and corruption due to Incorrect BIO Chain Handling zhangshida
` (8 preceding siblings ...)
2025-11-21 8:17 ` [PATCH 9/9] nvdimm: " zhangshida
@ 2025-11-21 10:37 ` Christoph Hellwig
2025-11-22 6:38 ` Stephen Zhang
2025-11-22 3:35 ` Ming Lei
10 siblings, 1 reply; 40+ messages in thread
From: Christoph Hellwig @ 2025-11-21 10:37 UTC (permalink / raw)
To: zhangshida
Cc: linux-kernel, linux-block, nvdimm, virtualization, linux-nvme,
gfs2, ntfs3, linux-xfs, zhangshida
On Fri, Nov 21, 2025 at 04:17:39PM +0800, zhangshida wrote:
> We have captured four instances of this corruption in our production
> environment.
> In each case, we observed a distinct pattern:
> The corruption starts at an offset that aligns with the beginning of
> an XFS extent.
> The corruption ends at an offset that is aligned to the system's
> `PAGE_SIZE` (64KB in our case).
>
> Corruption Instances:
> 1. Start:`0x73be000`, **End:** `0x73c0000` (Length: 8KB)
> 2. Start:`0x10791a000`, **End:** `0x107920000` (Length: 24KB)
> 3. Start:`0x14535a000`, **End:** `0x145b70000` (Length: 8280KB)
> 4. Start:`0x370d000`, **End:** `0x3710000` (Length: 12KB)
Do you have a somwhat isolate reproducer for this?
> After analysis, we believe the root cause is in the handling of chained
> bios, specifically related to out-of-order io completion.
>
> Consider a bio chain where `bi_remaining` is decremented as each bio in
> the chain completes.
> For example,
> if a chain consists of three bios (bio1 -> bio2 -> bio3) with
> bi_remaining count:
> 1->2->2
> if the bio completes in the reverse order, there will be a problem.
> if bio 3 completes first, it will become:
> 1->2->1
> then bio 2 completes:
> 1->1->0
>
> Because `bi_remaining` has reached zero, the final `end_io` callback
> for the entire chain is triggered, even though not all bios in the
> chain have actually finished processing. This premature completion can
> lead to stale data being exposed, as seen in our case.
It sounds like there is a problem because bi_remaining is only
incremented after already submittin a bio. Which code path do you
see this with? iomap doesn't chain bios, so is this the buffer cache
or log code? Or is there a remapping driver involved?
^ permalink raw reply [flat|nested] 40+ messages in thread* Re: Fix potential data loss and corruption due to Incorrect BIO Chain Handling
2025-11-21 10:37 ` Fix potential data loss and corruption due to Incorrect BIO Chain Handling Christoph Hellwig
@ 2025-11-22 6:38 ` Stephen Zhang
2025-11-24 6:22 ` Christoph Hellwig
0 siblings, 1 reply; 40+ messages in thread
From: Stephen Zhang @ 2025-11-22 6:38 UTC (permalink / raw)
To: Christoph Hellwig
Cc: linux-kernel, linux-block, nvdimm, virtualization, linux-nvme,
gfs2, ntfs3, linux-xfs, zhangshida
Christoph Hellwig <hch@infradead.org> 于2025年11月21日周五 18:37写道:
>
> On Fri, Nov 21, 2025 at 04:17:39PM +0800, zhangshida wrote:
> > We have captured four instances of this corruption in our production
> > environment.
> > In each case, we observed a distinct pattern:
> > The corruption starts at an offset that aligns with the beginning of
> > an XFS extent.
> > The corruption ends at an offset that is aligned to the system's
> > `PAGE_SIZE` (64KB in our case).
> >
> > Corruption Instances:
> > 1. Start:`0x73be000`, **End:** `0x73c0000` (Length: 8KB)
> > 2. Start:`0x10791a000`, **End:** `0x107920000` (Length: 24KB)
> > 3. Start:`0x14535a000`, **End:** `0x145b70000` (Length: 8280KB)
> > 4. Start:`0x370d000`, **End:** `0x3710000` (Length: 12KB)
>
> Do you have a somwhat isolate reproducer for this?
>
=====The background=====
Sorry, We do not have a consistent way to reproduce this issue, as the
data was collected from a production environment run by database
providers. However, the I/O model is straightforward:
t0:A -> B
About 100 minutes later...
t1:A -> C
A, B, and C are three separate physical machines.
A is the primary writer. It runs a MySQL instance where a single thread
appends data to a log file.
B is a reader. At time t0, this server concurrently reads the log file from
A to perform a CRC check for backup purposes. The CRC check confirms
that the data on both A and B is identical and correct.
C is another reader. At time t1 (about 100 minutes after the A -> B backup
completed), this server also reads the log file from A for a CRC check.
However, this check on C indicates that the data is corrupt.
The most unusual part is that upon investigation, the data on the primary
server A was also found to be corrupt at this later time, differing from the
correct state it was in at t0.
Another factor to consider is that memory reclamation is active, as the
environment uses cgroups to restrict resources, including memory.
After inspecting the corrupted data, we believe it did not originate from
any existing file. Instead, it appears to be raw data that was on the disk
before the intended I/O write was successfully completed.
This raises the question: how could a write I/O fail to make it to the disk?
A hardware problem seems unlikely, as different RAID cards and disks
are in use across the systems.
Most importantly, the corruption exhibits a distinct pattern:
It starts at an offset that aligns with the beginning of an XFS extent.
It ends at an offset that aligns with the system's PAGE_SIZE.
The starting address of an extent is an internal concept within the
filesystem, transparent to both user-space applications and lower-level
kernel modules. This makes it highly suspicious that the corruption
always begins at this specific boundary. This suggests a potential bug in
the XFS logic for handling append-writes to a file.
All we can do now is to do some desperate code analysis to see if we
can catch the bug.
======code analysis======
In kernel version 4.19, XFS handles extent I/O using the ioend structure,
which appears to represent a block of I/O to a continuous disk space.
This is broken down into multiple bio structures, as a single bio cannot
handle a very large I/O range:
| page 1| page 2 | ...| page N |
|<-------------ioend-------------->|
| bio 1 | bio 2 | bio 3 |
To manage a large write, a chain of bio structures is used:
bio 1 -> bio 2 -> bio 3
All bios in this chain share a single end_io callback, which should only
be triggered after all I/O operations in the chain have completed.
The kernel uses the bi_remaining atomic counter on the first bio in the
chain to track completion, like:
1 -> 2 -> 2
if bio 1 completes, it will become:
1 -> 1 -> 2
if bio 2 completes:
1 -> 1 -> 1
if bio 3 completes:
1 -> 1 -> 0
So it is time to trigger the end io callback since all io is done.
But how does it handle a series of out-of-order completions?
if bio 3 completes first, it will become:
1 -> 2 -> 1
if bio 2 completes, since it seems forget to CHECK IF THE
FIRST BIO REACH 0 and go to next bio directly,
---c code----
static struct bio *__bio_chain_endio(struct bio *bio)
{
struct bio *parent = bio->bi_private;
if (bio->bi_status && !parent->bi_status)
parent->bi_status = bio->bi_status;
bio_put(bio);
return parent;
}
static void bio_chain_endio(struct bio *bio)
{
bio_endio(__bio_chain_endio(bio));
}
----c code----
it will become:
1 -> 2 -> 0
So it is time to trigger the end io callback since all io is done, which is
not actually the case. but bio 1 is still in an unknown state.
> > After analysis, we believe the root cause is in the handling of chained
> > bios, specifically related to out-of-order io completion.
> >
> > Consider a bio chain where `bi_remaining` is decremented as each bio in
> > the chain completes.
> > For example,
> > if a chain consists of three bios (bio1 -> bio2 -> bio3) with
> > bi_remaining count:
> > 1->2->2
> > if the bio completes in the reverse order, there will be a problem.
> > if bio 3 completes first, it will become:
> > 1->2->1
> > then bio 2 completes:
> > 1->1->0
> >
> > Because `bi_remaining` has reached zero, the final `end_io` callback
> > for the entire chain is triggered, even though not all bios in the
> > chain have actually finished processing. This premature completion can
> > lead to stale data being exposed, as seen in our case.
>
> It sounds like there is a problem because bi_remaining is only
> incremented after already submittin a bio. Which code path do you
> see this with? iomap doesn't chain bios, so is this the buffer cache
> or log code? Or is there a remapping driver involved?
>
Yep. The commit below:
commit ae5535efd8c445ad6033ac0d5da0197897b148ea
Author: Christoph Hellwig <hch@lst.de>
Date: Thu Dec 7 08:27:05 2023 +0100
iomap: don't chain bios
changes the logic. Since there are still many code paths that use
bio_chain, I am including these cleanups with the fix. This provides a reason
to CC all related communities. That way, developers who are monitoring
this can help identify similar problems if someone asks for help in the future,
if that is the right analysis and fix.
Thanks,
Shida
^ permalink raw reply [flat|nested] 40+ messages in thread* Re: Fix potential data loss and corruption due to Incorrect BIO Chain Handling
2025-11-22 6:38 ` Stephen Zhang
@ 2025-11-24 6:22 ` Christoph Hellwig
2025-11-27 7:05 ` Stephen Zhang
0 siblings, 1 reply; 40+ messages in thread
From: Christoph Hellwig @ 2025-11-24 6:22 UTC (permalink / raw)
To: Stephen Zhang
Cc: Christoph Hellwig, linux-kernel, linux-block, nvdimm,
virtualization, linux-nvme, gfs2, ntfs3, linux-xfs, zhangshida
On Sat, Nov 22, 2025 at 02:38:59PM +0800, Stephen Zhang wrote:
> ======code analysis======
> In kernel version 4.19, XFS handles extent I/O using the ioend structure,
Linux 4.19 is more than four years old, and both the block I/O code
and the XFS/iomap code changed a lot since then.
> changes the logic. Since there are still many code paths that use
> bio_chain, I am including these cleanups with the fix. This provides a reason
> to CC all related communities. That way, developers who are monitoring
> this can help identify similar problems if someone asks for help in the future,
> if that is the right analysis and fix.
As many pointed out something in the analysis doesn't end up. How do
you even managed to call bio_chain_endio as almost no one should be
calling it. Are you using bcache? Are the others callers in the
obsolete kernel you are using? Are they calling it without calling
bio_endio first (which the bcache case does, and which is buggy).
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Fix potential data loss and corruption due to Incorrect BIO Chain Handling
2025-11-24 6:22 ` Christoph Hellwig
@ 2025-11-27 7:05 ` Stephen Zhang
2025-11-27 7:14 ` Christoph Hellwig
0 siblings, 1 reply; 40+ messages in thread
From: Stephen Zhang @ 2025-11-27 7:05 UTC (permalink / raw)
To: Christoph Hellwig, Ming Lei, Andreas Gruenbacher
Cc: linux-kernel, linux-block, nvdimm, virtualization, linux-nvme,
gfs2, ntfs3, linux-xfs, zhangshida
Christoph Hellwig <hch@infradead.org> 于2025年11月24日周一 14:22写道:
>
> On Sat, Nov 22, 2025 at 02:38:59PM +0800, Stephen Zhang wrote:
> > ======code analysis======
> > In kernel version 4.19, XFS handles extent I/O using the ioend structure,
>
> Linux 4.19 is more than four years old, and both the block I/O code
> and the XFS/iomap code changed a lot since then.
>
> > changes the logic. Since there are still many code paths that use
> > bio_chain, I am including these cleanups with the fix. This provides a reason
> > to CC all related communities. That way, developers who are monitoring
> > this can help identify similar problems if someone asks for help in the future,
> > if that is the right analysis and fix.
>
> As many pointed out something in the analysis doesn't end up. How do
> you even managed to call bio_chain_endio as almost no one should be
> calling it. Are you using bcache? Are the others callers in the
> obsolete kernel you are using? Are they calling it without calling
> bio_endio first (which the bcache case does, and which is buggy).
>
No, they are not using bcache.
This problem is now believed to be related to the following commit:
-------------
commit 9f9bc034b84958523689347ee2bdd9c660008e5e
Author: Brian Foster <bfoster@redhat.com>
Date: Fri Feb 1 09:14:22 2019 -0800
xfs: update fork seq counter on data fork changes
diff --git a/fs/xfs/libxfs/xfs_iext_tree.c b/fs/xfs/libxfs/xfs_iext_tree.c
index 771dd072015d..bc690f2409fa 100644
--- a/fs/xfs/libxfs/xfs_iext_tree.c
+++ b/fs/xfs/libxfs/xfs_iext_tree.c
@@ -614,16 +614,15 @@ xfs_iext_realloc_root(
}
static inline void xfs_iext_inc_seq(struct xfs_ifork *ifp, int state)
{
- if (state & BMAP_COWFORK)
- WRITE_ONCE(ifp->if_seq, READ_ONCE(ifp->if_seq) + 1);
+ WRITE_ONCE(ifp->if_seq, READ_ONCE(ifp->if_seq) + 1);
}
----------
Link: https://lore.kernel.org/linux-xfs/20190201143256.43232-3-bfoster@redhat.com/
---------
Without this commit, a race condition can occur between the EOF trim
worker, sequential buffer writes, and writeback. This race causes writeback
to use a stale iomap, which leads to I/O being sent to sectors that have
already been trimmed.
If there are no further objections or other insights regarding this issue,
I will proceed with creating a v2 of this series.
Thanks,
shida
^ permalink raw reply related [flat|nested] 40+ messages in thread* Re: Fix potential data loss and corruption due to Incorrect BIO Chain Handling
2025-11-27 7:05 ` Stephen Zhang
@ 2025-11-27 7:14 ` Christoph Hellwig
2025-11-27 7:40 ` Gao Xiang
2025-11-28 1:29 ` Stephen Zhang
0 siblings, 2 replies; 40+ messages in thread
From: Christoph Hellwig @ 2025-11-27 7:14 UTC (permalink / raw)
To: Stephen Zhang
Cc: Christoph Hellwig, Ming Lei, Andreas Gruenbacher, linux-kernel,
linux-block, nvdimm, virtualization, linux-nvme, gfs2, ntfs3,
linux-xfs, zhangshida
On Thu, Nov 27, 2025 at 03:05:29PM +0800, Stephen Zhang wrote:
> No, they are not using bcache.
Then please figure out how bio_chain_endio even gets called in this
setup. I think for mainline the approach should be to fix bcache
and eorfs to not call into ->bi_end_io and add a BUG_ON() to
bio_chain_endio to ensure no new callers appear. I
> If there are no further objections or other insights regarding this issue,
> I will proceed with creating a v2 of this series.
Not sure how that is helpful. You have a problem on a kernel from stone
age, can't explain what actually happens and propose something that is
mostly a no-op in mainline, with the callers that could even reach the
area being clear API misuse.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Fix potential data loss and corruption due to Incorrect BIO Chain Handling
2025-11-27 7:14 ` Christoph Hellwig
@ 2025-11-27 7:40 ` Gao Xiang
2025-11-27 14:46 ` Christoph Hellwig
2025-11-28 1:29 ` Stephen Zhang
1 sibling, 1 reply; 40+ messages in thread
From: Gao Xiang @ 2025-11-27 7:40 UTC (permalink / raw)
To: Christoph Hellwig, Stephen Zhang
Cc: Ming Lei, Andreas Gruenbacher, linux-kernel, linux-block, nvdimm,
virtualization, linux-nvme, gfs2, ntfs3, linux-xfs, zhangshida
Hi Christoph,
On 2025/11/27 15:14, Christoph Hellwig wrote:
> On Thu, Nov 27, 2025 at 03:05:29PM +0800, Stephen Zhang wrote:
>> No, they are not using bcache.
>
> Then please figure out how bio_chain_endio even gets called in this
> setup. I think for mainline the approach should be to fix bcache
> and eorfs to not call into ->bi_end_io and add a BUG_ON() to
Just saw this.
For erofs, let me fix this directly to use bio_endio() instead
and go through the erofs (although it doesn't matter in practice
since no chain i/os for erofs and bio interfaces are unique and
friendly to operate bvecs for both block or non-block I/Os
compared to awkward bvec interfaces) and I will Cc you, Ming
and Stephen then.
Thanks,
Gao Xiang
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Fix potential data loss and corruption due to Incorrect BIO Chain Handling
2025-11-27 7:40 ` Gao Xiang
@ 2025-11-27 14:46 ` Christoph Hellwig
2025-11-28 1:32 ` Stephen Zhang
0 siblings, 1 reply; 40+ messages in thread
From: Christoph Hellwig @ 2025-11-27 14:46 UTC (permalink / raw)
To: Gao Xiang
Cc: Christoph Hellwig, Stephen Zhang, Ming Lei, Andreas Gruenbacher,
linux-kernel, linux-block, nvdimm, virtualization, linux-nvme,
gfs2, ntfs3, linux-xfs, zhangshida
On Thu, Nov 27, 2025 at 03:40:20PM +0800, Gao Xiang wrote:
> For erofs, let me fix this directly to use bio_endio() instead
> and go through the erofs (although it doesn't matter in practice
> since no chain i/os for erofs and bio interfaces are unique and
> friendly to operate bvecs for both block or non-block I/Os
> compared to awkward bvec interfaces) and I will Cc you, Ming
> and Stephen then.
Thanks. I'll ping Coly for bcache.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Fix potential data loss and corruption due to Incorrect BIO Chain Handling
2025-11-27 14:46 ` Christoph Hellwig
@ 2025-11-28 1:32 ` Stephen Zhang
0 siblings, 0 replies; 40+ messages in thread
From: Stephen Zhang @ 2025-11-28 1:32 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Gao Xiang, Ming Lei, Andreas Gruenbacher, linux-kernel,
linux-block, nvdimm, virtualization, linux-nvme, gfs2, ntfs3,
linux-xfs, zhangshida
Christoph Hellwig <hch@infradead.org> 于2025年11月27日周四 22:46写道:
>
> On Thu, Nov 27, 2025 at 03:40:20PM +0800, Gao Xiang wrote:
> > For erofs, let me fix this directly to use bio_endio() instead
> > and go through the erofs (although it doesn't matter in practice
> > since no chain i/os for erofs and bio interfaces are unique and
> > friendly to operate bvecs for both block or non-block I/Os
> > compared to awkward bvec interfaces) and I will Cc you, Ming
> > and Stephen then.
>
> Thanks. I'll ping Coly for bcache.
>
I would like the opportunity to fix this issue in bcache. From my analysis, it
seems the solution may be as straightforward as replacing the ->bi_end_io
assignment with a call to bio_endio().
Thanks,
shida
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Fix potential data loss and corruption due to Incorrect BIO Chain Handling
2025-11-27 7:14 ` Christoph Hellwig
2025-11-27 7:40 ` Gao Xiang
@ 2025-11-28 1:29 ` Stephen Zhang
1 sibling, 0 replies; 40+ messages in thread
From: Stephen Zhang @ 2025-11-28 1:29 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Ming Lei, Andreas Gruenbacher, linux-kernel, linux-block, nvdimm,
virtualization, linux-nvme, gfs2, ntfs3, linux-xfs, zhangshida
Christoph Hellwig <hch@infradead.org> 于2025年11月27日周四 15:17写道:
>
> On Thu, Nov 27, 2025 at 03:05:29PM +0800, Stephen Zhang wrote:
> > No, they are not using bcache.
>
> Then please figure out how bio_chain_endio even gets called in this
> setup. I think for mainline the approach should be to fix bcache
> and eorfs to not call into ->bi_end_io and add a BUG_ON() to
> bio_chain_endio to ensure no new callers appear. I
>
Okay, thanks for the suggestion.
> > If there are no further objections or other insights regarding this issue,
> > I will proceed with creating a v2 of this series.
>
> Not sure how that is helpful. You have a problem on a kernel from stone
> age, can't explain what actually happens and propose something that is
> mostly a no-op in mainline, with the callers that could even reach the
> area being clear API misuse.
>
Analysis of the 4.19 kernel bug confirmed it was not caused by the
->bi_end_io call. Instead, this investigation led us to discover a different bug
in the upstream kernel. The v2 patch series is dedicated to fixing this newly
found upstream issue.
Thanks,
shida
>
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Fix potential data loss and corruption due to Incorrect BIO Chain Handling
2025-11-21 8:17 Fix potential data loss and corruption due to Incorrect BIO Chain Handling zhangshida
` (9 preceding siblings ...)
2025-11-21 10:37 ` Fix potential data loss and corruption due to Incorrect BIO Chain Handling Christoph Hellwig
@ 2025-11-22 3:35 ` Ming Lei
2025-11-22 6:42 ` Stephen Zhang
10 siblings, 1 reply; 40+ messages in thread
From: Ming Lei @ 2025-11-22 3:35 UTC (permalink / raw)
To: zhangshida
Cc: linux-kernel, linux-block, nvdimm, virtualization, linux-nvme,
gfs2, ntfs3, linux-xfs, zhangshida
On Fri, Nov 21, 2025 at 04:17:39PM +0800, zhangshida wrote:
> From: Shida Zhang <zhangshida@kylinos.cn>
>
> Hello everyone,
>
> We have recently encountered a severe data loss issue on kernel version 4.19,
> and we suspect the same underlying problem may exist in the latest kernel versions.
>
> Environment:
> * **Architecture:** arm64
> * **Page Size:** 64KB
> * **Filesystem:** XFS with a 4KB block size
>
> Scenario:
> The issue occurs while running a MySQL instance where one thread appends data
> to a log file, and a separate thread concurrently reads that file to perform
> CRC checks on its contents.
>
> Problem Description:
> Occasionally, the reading thread detects data corruption. Specifically, it finds
> that stale data has been exposed in the middle of the file.
>
> We have captured four instances of this corruption in our production environment.
> In each case, we observed a distinct pattern:
> The corruption starts at an offset that aligns with the beginning of an XFS extent.
> The corruption ends at an offset that is aligned to the system's `PAGE_SIZE` (64KB in our case).
>
> Corruption Instances:
> 1. Start:`0x73be000`, **End:** `0x73c0000` (Length: 8KB)
> 2. Start:`0x10791a000`, **End:** `0x107920000` (Length: 24KB)
> 3. Start:`0x14535a000`, **End:** `0x145b70000` (Length: 8280KB)
> 4. Start:`0x370d000`, **End:** `0x3710000` (Length: 12KB)
>
> After analysis, we believe the root cause is in the handling of chained bios, specifically
> related to out-of-order io completion.
>
> Consider a bio chain where `bi_remaining` is decremented as each bio in the chain completes.
> For example,
> if a chain consists of three bios (bio1 -> bio2 -> bio3) with
> bi_remaining count:
> 1->2->2
Right.
> if the bio completes in the reverse order, there will be a problem.
> if bio 3 completes first, it will become:
> 1->2->1
Yes.
> then bio 2 completes:
> 1->1->0
No, it is supposed to be 1->1->1.
When bio 1 completes, it will become 0->0->0
bio3's `__bi_remaining` won't drop to zero until bio2's reaches
zero, and bio2 won't be done until bio1 is ended.
Please look at bio_endio():
void bio_endio(struct bio *bio)
{
again:
if (!bio_remaining_done(bio))
return;
...
if (bio->bi_end_io == bio_chain_endio) {
bio = __bio_chain_endio(bio);
goto again;
}
...
}
Thanks,
Ming
^ permalink raw reply [flat|nested] 40+ messages in thread* Re: Fix potential data loss and corruption due to Incorrect BIO Chain Handling
2025-11-22 3:35 ` Ming Lei
@ 2025-11-22 6:42 ` Stephen Zhang
2025-11-22 7:46 ` Andreas Gruenbacher
2025-11-22 12:01 ` Ming Lei
0 siblings, 2 replies; 40+ messages in thread
From: Stephen Zhang @ 2025-11-22 6:42 UTC (permalink / raw)
To: Ming Lei
Cc: linux-kernel, linux-block, nvdimm, virtualization, linux-nvme,
gfs2, ntfs3, linux-xfs, zhangshida
Ming Lei <ming.lei@redhat.com> 于2025年11月22日周六 11:35写道:
>
> On Fri, Nov 21, 2025 at 04:17:39PM +0800, zhangshida wrote:
> > From: Shida Zhang <zhangshida@kylinos.cn>
> >
> > Hello everyone,
> >
> > We have recently encountered a severe data loss issue on kernel version 4.19,
> > and we suspect the same underlying problem may exist in the latest kernel versions.
> >
> > Environment:
> > * **Architecture:** arm64
> > * **Page Size:** 64KB
> > * **Filesystem:** XFS with a 4KB block size
> >
> > Scenario:
> > The issue occurs while running a MySQL instance where one thread appends data
> > to a log file, and a separate thread concurrently reads that file to perform
> > CRC checks on its contents.
> >
> > Problem Description:
> > Occasionally, the reading thread detects data corruption. Specifically, it finds
> > that stale data has been exposed in the middle of the file.
> >
> > We have captured four instances of this corruption in our production environment.
> > In each case, we observed a distinct pattern:
> > The corruption starts at an offset that aligns with the beginning of an XFS extent.
> > The corruption ends at an offset that is aligned to the system's `PAGE_SIZE` (64KB in our case).
> >
> > Corruption Instances:
> > 1. Start:`0x73be000`, **End:** `0x73c0000` (Length: 8KB)
> > 2. Start:`0x10791a000`, **End:** `0x107920000` (Length: 24KB)
> > 3. Start:`0x14535a000`, **End:** `0x145b70000` (Length: 8280KB)
> > 4. Start:`0x370d000`, **End:** `0x3710000` (Length: 12KB)
> >
> > After analysis, we believe the root cause is in the handling of chained bios, specifically
> > related to out-of-order io completion.
> >
> > Consider a bio chain where `bi_remaining` is decremented as each bio in the chain completes.
> > For example,
> > if a chain consists of three bios (bio1 -> bio2 -> bio3) with
> > bi_remaining count:
> > 1->2->2
>
> Right.
>
> > if the bio completes in the reverse order, there will be a problem.
> > if bio 3 completes first, it will become:
> > 1->2->1
>
> Yes.
>
> > then bio 2 completes:
> > 1->1->0
>
> No, it is supposed to be 1->1->1.
>
> When bio 1 completes, it will become 0->0->0
>
> bio3's `__bi_remaining` won't drop to zero until bio2's reaches
> zero, and bio2 won't be done until bio1 is ended.
>
> Please look at bio_endio():
>
> void bio_endio(struct bio *bio)
> {
> again:
> if (!bio_remaining_done(bio))
> return;
> ...
> if (bio->bi_end_io == bio_chain_endio) {
> bio = __bio_chain_endio(bio);
> goto again;
> }
> ...
> }
>
Exactly, bio_endio handle the process perfectly, but it seems to forget
to check if the very first `__bi_remaining` drops to zero and proceeds to
the next bio:
-----
static struct bio *__bio_chain_endio(struct bio *bio)
{
struct bio *parent = bio->bi_private;
if (bio->bi_status && !parent->bi_status)
parent->bi_status = bio->bi_status;
bio_put(bio);
return parent;
}
static void bio_chain_endio(struct bio *bio)
{
bio_endio(__bio_chain_endio(bio));
}
----
Thanks,
Shida
>
> Thanks,
> Ming
>
^ permalink raw reply [flat|nested] 40+ messages in thread* Re: Fix potential data loss and corruption due to Incorrect BIO Chain Handling
2025-11-22 6:42 ` Stephen Zhang
@ 2025-11-22 7:46 ` Andreas Gruenbacher
2025-11-22 12:01 ` Ming Lei
1 sibling, 0 replies; 40+ messages in thread
From: Andreas Gruenbacher @ 2025-11-22 7:46 UTC (permalink / raw)
To: Stephen Zhang
Cc: Ming Lei, linux-kernel, linux-block, nvdimm, virtualization,
linux-nvme, gfs2, ntfs3, linux-xfs, zhangshida
On Sat, Nov 22, 2025 at 7:52 AM Stephen Zhang <starzhangzsd@gmail.com> wrote:
> Ming Lei <ming.lei@redhat.com> 于2025年11月22日周六 11:35写道:
> >
> > On Fri, Nov 21, 2025 at 04:17:39PM +0800, zhangshida wrote:
> > > From: Shida Zhang <zhangshida@kylinos.cn>
> > >
> > > Hello everyone,
> > >
> > > We have recently encountered a severe data loss issue on kernel version 4.19,
> > > and we suspect the same underlying problem may exist in the latest kernel versions.
> > >
> > > Environment:
> > > * **Architecture:** arm64
> > > * **Page Size:** 64KB
> > > * **Filesystem:** XFS with a 4KB block size
> > >
> > > Scenario:
> > > The issue occurs while running a MySQL instance where one thread appends data
> > > to a log file, and a separate thread concurrently reads that file to perform
> > > CRC checks on its contents.
> > >
> > > Problem Description:
> > > Occasionally, the reading thread detects data corruption. Specifically, it finds
> > > that stale data has been exposed in the middle of the file.
> > >
> > > We have captured four instances of this corruption in our production environment.
> > > In each case, we observed a distinct pattern:
> > > The corruption starts at an offset that aligns with the beginning of an XFS extent.
> > > The corruption ends at an offset that is aligned to the system's `PAGE_SIZE` (64KB in our case).
> > >
> > > Corruption Instances:
> > > 1. Start:`0x73be000`, **End:** `0x73c0000` (Length: 8KB)
> > > 2. Start:`0x10791a000`, **End:** `0x107920000` (Length: 24KB)
> > > 3. Start:`0x14535a000`, **End:** `0x145b70000` (Length: 8280KB)
> > > 4. Start:`0x370d000`, **End:** `0x3710000` (Length: 12KB)
> > >
> > > After analysis, we believe the root cause is in the handling of chained bios, specifically
> > > related to out-of-order io completion.
> > >
> > > Consider a bio chain where `bi_remaining` is decremented as each bio in the chain completes.
> > > For example,
> > > if a chain consists of three bios (bio1 -> bio2 -> bio3) with
> > > bi_remaining count:
> > > 1->2->2
> >
> > Right.
> >
> > > if the bio completes in the reverse order, there will be a problem.
> > > if bio 3 completes first, it will become:
> > > 1->2->1
> >
> > Yes.
> >
> > > then bio 2 completes:
> > > 1->1->0
Currently, bio_chain_endio() will actually not decrement
__bi_remaining but it will call bio_put(bio 2) and bio_endio(parent),
which will lead to 1->2->0. And when bio 1 completes, bio 2 won't
exist anymore.
> > No, it is supposed to be 1->1->1.
> >
> > When bio 1 completes, it will become 0->0->0
> >
> > bio3's `__bi_remaining` won't drop to zero until bio2's reaches
> > zero, and bio2 won't be done until bio1 is ended.
> >
> > Please look at bio_endio():
> >
> > void bio_endio(struct bio *bio)
> > {
> > again:
> > if (!bio_remaining_done(bio))
> > return;
> > ...
> > if (bio->bi_end_io == bio_chain_endio) {
> > bio = __bio_chain_endio(bio);
> > goto again;
> > }
> > ...
> > }
> >
>
> Exactly, bio_endio handle the process perfectly, but it seems to forget
> to check if the very first `__bi_remaining` drops to zero and proceeds to
> the next bio:
> -----
> static struct bio *__bio_chain_endio(struct bio *bio)
> {
> struct bio *parent = bio->bi_private;
>
> if (bio->bi_status && !parent->bi_status)
> parent->bi_status = bio->bi_status;
> bio_put(bio);
> return parent;
> }
>
> static void bio_chain_endio(struct bio *bio)
> {
> bio_endio(__bio_chain_endio(bio));
> }
> ----
This bug could be fixed as follows:
static void bio_chain_endio(struct bio *bio)
{
+ if (!bio_remaining_done(bio))
+ return;
bio_endio(__bio_chain_endio(bio));
}
but bio_endio() already does all that, so bio_chain_endio() might just
as well just call bio_endio(bio) instead.
Thanks,
Andreas
^ permalink raw reply [flat|nested] 40+ messages in thread* Re: Fix potential data loss and corruption due to Incorrect BIO Chain Handling
2025-11-22 6:42 ` Stephen Zhang
2025-11-22 7:46 ` Andreas Gruenbacher
@ 2025-11-22 12:01 ` Ming Lei
2025-11-22 14:56 ` Andreas Gruenbacher
1 sibling, 1 reply; 40+ messages in thread
From: Ming Lei @ 2025-11-22 12:01 UTC (permalink / raw)
To: Stephen Zhang
Cc: linux-kernel, linux-block, nvdimm, virtualization, linux-nvme,
gfs2, ntfs3, linux-xfs, zhangshida
On Sat, Nov 22, 2025 at 02:42:43PM +0800, Stephen Zhang wrote:
> Ming Lei <ming.lei@redhat.com> 于2025年11月22日周六 11:35写道:
> >
> > On Fri, Nov 21, 2025 at 04:17:39PM +0800, zhangshida wrote:
> > > From: Shida Zhang <zhangshida@kylinos.cn>
> > >
> > > Hello everyone,
> > >
> > > We have recently encountered a severe data loss issue on kernel version 4.19,
> > > and we suspect the same underlying problem may exist in the latest kernel versions.
> > >
> > > Environment:
> > > * **Architecture:** arm64
> > > * **Page Size:** 64KB
> > > * **Filesystem:** XFS with a 4KB block size
> > >
> > > Scenario:
> > > The issue occurs while running a MySQL instance where one thread appends data
> > > to a log file, and a separate thread concurrently reads that file to perform
> > > CRC checks on its contents.
> > >
> > > Problem Description:
> > > Occasionally, the reading thread detects data corruption. Specifically, it finds
> > > that stale data has been exposed in the middle of the file.
> > >
> > > We have captured four instances of this corruption in our production environment.
> > > In each case, we observed a distinct pattern:
> > > The corruption starts at an offset that aligns with the beginning of an XFS extent.
> > > The corruption ends at an offset that is aligned to the system's `PAGE_SIZE` (64KB in our case).
> > >
> > > Corruption Instances:
> > > 1. Start:`0x73be000`, **End:** `0x73c0000` (Length: 8KB)
> > > 2. Start:`0x10791a000`, **End:** `0x107920000` (Length: 24KB)
> > > 3. Start:`0x14535a000`, **End:** `0x145b70000` (Length: 8280KB)
> > > 4. Start:`0x370d000`, **End:** `0x3710000` (Length: 12KB)
> > >
> > > After analysis, we believe the root cause is in the handling of chained bios, specifically
> > > related to out-of-order io completion.
> > >
> > > Consider a bio chain where `bi_remaining` is decremented as each bio in the chain completes.
> > > For example,
> > > if a chain consists of three bios (bio1 -> bio2 -> bio3) with
> > > bi_remaining count:
> > > 1->2->2
> >
> > Right.
> >
> > > if the bio completes in the reverse order, there will be a problem.
> > > if bio 3 completes first, it will become:
> > > 1->2->1
> >
> > Yes.
> >
> > > then bio 2 completes:
> > > 1->1->0
> >
> > No, it is supposed to be 1->1->1.
> >
> > When bio 1 completes, it will become 0->0->0
> >
> > bio3's `__bi_remaining` won't drop to zero until bio2's reaches
> > zero, and bio2 won't be done until bio1 is ended.
> >
> > Please look at bio_endio():
> >
> > void bio_endio(struct bio *bio)
> > {
> > again:
> > if (!bio_remaining_done(bio))
> > return;
> > ...
> > if (bio->bi_end_io == bio_chain_endio) {
> > bio = __bio_chain_endio(bio);
> > goto again;
> > }
> > ...
> > }
> >
>
> Exactly, bio_endio handle the process perfectly, but it seems to forget
> to check if the very first `__bi_remaining` drops to zero and proceeds to
> the next bio:
> -----
> static struct bio *__bio_chain_endio(struct bio *bio)
> {
> struct bio *parent = bio->bi_private;
>
> if (bio->bi_status && !parent->bi_status)
> parent->bi_status = bio->bi_status;
> bio_put(bio);
> return parent;
> }
>
> static void bio_chain_endio(struct bio *bio)
> {
> bio_endio(__bio_chain_endio(bio));
> }
bio_chain_endio() never gets called really, which can be thought as `flag`,
and it should have been defined as `WARN_ON_ONCE(1);` for not confusing people.
So I don't think upstream kernel has the issue you described.
Thanks,
Ming
^ permalink raw reply [flat|nested] 40+ messages in thread* Re: Fix potential data loss and corruption due to Incorrect BIO Chain Handling
2025-11-22 12:01 ` Ming Lei
@ 2025-11-22 14:56 ` Andreas Gruenbacher
2025-11-23 3:14 ` Stephen Zhang
2025-11-23 13:48 ` Ming Lei
0 siblings, 2 replies; 40+ messages in thread
From: Andreas Gruenbacher @ 2025-11-22 14:56 UTC (permalink / raw)
To: Ming Lei
Cc: Stephen Zhang, linux-kernel, linux-block, nvdimm, virtualization,
linux-nvme, gfs2, ntfs3, linux-xfs, zhangshida
On Sat, Nov 22, 2025 at 1:07 PM Ming Lei <ming.lei@redhat.com> wrote:
> > static void bio_chain_endio(struct bio *bio)
> > {
> > bio_endio(__bio_chain_endio(bio));
> > }
>
> bio_chain_endio() never gets called really, which can be thought as `flag`,
That's probably where this stops being relevant for the problem
reported by Stephen Zhang.
> and it should have been defined as `WARN_ON_ONCE(1);` for not confusing people.
But shouldn't bio_chain_endio() still be fixed to do the right thing
if called directly, or alternatively, just BUG()? Warning and still
doing the wrong thing seems a bit bizarre.
I also see direct bi_end_io calls in erofs_fileio_ki_complete(),
erofs_fscache_bio_endio(), and erofs_fscache_submit_bio(), so those
are at least confusing.
Thanks,
Andreas
^ permalink raw reply [flat|nested] 40+ messages in thread* Re: Fix potential data loss and corruption due to Incorrect BIO Chain Handling
2025-11-22 14:56 ` Andreas Gruenbacher
@ 2025-11-23 3:14 ` Stephen Zhang
2025-11-23 13:48 ` Ming Lei
1 sibling, 0 replies; 40+ messages in thread
From: Stephen Zhang @ 2025-11-23 3:14 UTC (permalink / raw)
To: Andreas Gruenbacher
Cc: Ming Lei, linux-kernel, linux-block, nvdimm, virtualization,
linux-nvme, gfs2, ntfs3, linux-xfs, zhangshida, linux-bcache
Andreas Gruenbacher <agruenba@redhat.com> 于2025年11月22日周六 22:57写道:
>
> On Sat, Nov 22, 2025 at 1:07 PM Ming Lei <ming.lei@redhat.com> wrote:
> > > static void bio_chain_endio(struct bio *bio)
> > > {
> > > bio_endio(__bio_chain_endio(bio));
> > > }
> >
> > bio_chain_endio() never gets called really, which can be thought as `flag`,
>
> That's probably where this stops being relevant for the problem
> reported by Stephen Zhang.
>
> > and it should have been defined as `WARN_ON_ONCE(1);` for not confusing people.
>
> But shouldn't bio_chain_endio() still be fixed to do the right thing
> if called directly, or alternatively, just BUG()? Warning and still
> doing the wrong thing seems a bit bizarre.
>
> I also see direct bi_end_io calls in erofs_fileio_ki_complete(),
> erofs_fscache_bio_endio(), and erofs_fscache_submit_bio(), so those
> are at least confusing.
>
> Thanks,
> Andreas
>
Thank you, Ming and Andreas, for the detailed explanation. I will
remember to add the `WARN()`/`BUG()` macros in `bio_chain_endio()`.
Following that discussion, I have identified a similar and suspicious
call in the
bcache driver.
Location:`drivers/md/bcache/request.c`
```c
static void detached_dev_do_request(struct bcache_device *d, struct bio *bio,
struct block_device *orig_bdev,
unsigned long start_time)
{
struct detached_dev_io_private *ddip;
struct cached_dev *dc = container_of(d, struct cached_dev, disk);
/*
* no need to call closure_get(&dc->disk.cl),
* because upper layer had already opened bcache device,
* which would call closure_get(&dc->disk.cl)
*/
ddip = kzalloc(sizeof(struct detached_dev_io_private), GFP_NOIO);
if (!ddip) {
bio->bi_status = BLK_STS_RESOURCE;
bio->bi_end_io(bio); // <-- POTENTIAL ISSUE
return;
}
// ...
}
```
Scenario Description:
1. A chained bio is created in the block layer.
2. This bio is intercepted by the bcache layer to be routed to the appropriate
backing disk.
3. The code path determines that the backing device is in a detached state,
leading to a call to `detached_dev_do_request()` to handle the I/O.
4. The memory allocation for the `ddip` structure fails.
5. In the error path, the function directly invokes `bio->bi_end_io(bio)`.
The Problem:
For a bio that is part of a chain, the `bi_end_io` function is likely set to
`bio_chain_endio`. Directly calling it in this context would corrupt the
`bi_remaining` reference count, exactly as described in our previous
discussion.
Is it a valid theoretical scenario?
And there is another call:
```
static void detached_dev_end_io(struct bio *bio)
{
struct detached_dev_io_private *ddip;
ddip = bio->bi_private;
bio->bi_end_io = ddip->bi_end_io;
bio->bi_private = ddip->bi_private;
/* Count on the bcache device */
bio_end_io_acct_remapped(bio, ddip->start_time, ddip->orig_bdev);
if (bio->bi_status) {
struct cached_dev *dc = container_of(ddip->d,
struct cached_dev, disk);
/* should count I/O error for backing device here */
bch_count_backing_io_errors(dc, bio);
}
kfree(ddip);
bio->bi_end_io(bio);
}
```
Would you mind me adding the bcache to the talk?
[Adding the bcache list to the CC]
Thanks,
Shida
^ permalink raw reply [flat|nested] 40+ messages in thread* Re: Fix potential data loss and corruption due to Incorrect BIO Chain Handling
2025-11-22 14:56 ` Andreas Gruenbacher
2025-11-23 3:14 ` Stephen Zhang
@ 2025-11-23 13:48 ` Ming Lei
2025-11-24 1:28 ` Stephen Zhang
1 sibling, 1 reply; 40+ messages in thread
From: Ming Lei @ 2025-11-23 13:48 UTC (permalink / raw)
To: Andreas Gruenbacher
Cc: Stephen Zhang, linux-kernel, linux-block, nvdimm, virtualization,
linux-nvme, gfs2, ntfs3, linux-xfs, zhangshida, Coly Li,
linux-bcache
On Sat, Nov 22, 2025 at 03:56:58PM +0100, Andreas Gruenbacher wrote:
> On Sat, Nov 22, 2025 at 1:07 PM Ming Lei <ming.lei@redhat.com> wrote:
> > > static void bio_chain_endio(struct bio *bio)
> > > {
> > > bio_endio(__bio_chain_endio(bio));
> > > }
> >
> > bio_chain_endio() never gets called really, which can be thought as `flag`,
>
> That's probably where this stops being relevant for the problem
> reported by Stephen Zhang.
>
> > and it should have been defined as `WARN_ON_ONCE(1);` for not confusing people.
>
> But shouldn't bio_chain_endio() still be fixed to do the right thing
> if called directly, or alternatively, just BUG()? Warning and still
> doing the wrong thing seems a bit bizarre.
IMO calling ->bi_end_io() directly shouldn't be encouraged.
The only in-tree direct call user could be bcache, so is this reported
issue triggered on bcache?
If bcache can't call bio_endio(), I think it is fine to fix
bio_chain_endio().
>
> I also see direct bi_end_io calls in erofs_fileio_ki_complete(),
> erofs_fscache_bio_endio(), and erofs_fscache_submit_bio(), so those
> are at least confusing.
All looks FS bio(non-chained), so bio_chain_endio() shouldn't be involved
in erofs code base.
Thanks,
Ming
^ permalink raw reply [flat|nested] 40+ messages in thread* Re: Fix potential data loss and corruption due to Incorrect BIO Chain Handling
2025-11-23 13:48 ` Ming Lei
@ 2025-11-24 1:28 ` Stephen Zhang
2025-11-24 2:00 ` Stephen Zhang
0 siblings, 1 reply; 40+ messages in thread
From: Stephen Zhang @ 2025-11-24 1:28 UTC (permalink / raw)
To: Ming Lei
Cc: Andreas Gruenbacher, linux-kernel, linux-block, nvdimm,
virtualization, linux-nvme, gfs2, ntfs3, linux-xfs, zhangshida,
Coly Li, linux-bcache
Ming Lei <ming.lei@redhat.com> 于2025年11月23日周日 21:49写道:
>
> On Sat, Nov 22, 2025 at 03:56:58PM +0100, Andreas Gruenbacher wrote:
> > On Sat, Nov 22, 2025 at 1:07 PM Ming Lei <ming.lei@redhat.com> wrote:
> > > > static void bio_chain_endio(struct bio *bio)
> > > > {
> > > > bio_endio(__bio_chain_endio(bio));
> > > > }
> > >
> > > bio_chain_endio() never gets called really, which can be thought as `flag`,
> >
> > That's probably where this stops being relevant for the problem
> > reported by Stephen Zhang.
> >
> > > and it should have been defined as `WARN_ON_ONCE(1);` for not confusing people.
> >
> > But shouldn't bio_chain_endio() still be fixed to do the right thing
> > if called directly, or alternatively, just BUG()? Warning and still
> > doing the wrong thing seems a bit bizarre.
>
> IMO calling ->bi_end_io() directly shouldn't be encouraged.
>
> The only in-tree direct call user could be bcache, so is this reported
> issue triggered on bcache?
>
> If bcache can't call bio_endio(), I think it is fine to fix
> bio_chain_endio().
>
> >
> > I also see direct bi_end_io calls in erofs_fileio_ki_complete(),
> > erofs_fscache_bio_endio(), and erofs_fscache_submit_bio(), so those
> > are at least confusing.
>
> All looks FS bio(non-chained), so bio_chain_endio() shouldn't be involved
> in erofs code base.
>
Okay, will add that.
Thanks,
Shida
>
> Thanks,
> Ming
>
^ permalink raw reply [flat|nested] 40+ messages in thread* Re: Fix potential data loss and corruption due to Incorrect BIO Chain Handling
2025-11-24 1:28 ` Stephen Zhang
@ 2025-11-24 2:00 ` Stephen Zhang
0 siblings, 0 replies; 40+ messages in thread
From: Stephen Zhang @ 2025-11-24 2:00 UTC (permalink / raw)
To: Ming Lei
Cc: Andreas Gruenbacher, linux-kernel, linux-block, nvdimm,
virtualization, linux-nvme, gfs2, ntfs3, linux-xfs, zhangshida,
Coly Li, linux-bcache
Stephen Zhang <starzhangzsd@gmail.com> 于2025年11月24日周一 09:28写道:
>
> Ming Lei <ming.lei@redhat.com> 于2025年11月23日周日 21:49写道:
> >
> > On Sat, Nov 22, 2025 at 03:56:58PM +0100, Andreas Gruenbacher wrote:
> > > On Sat, Nov 22, 2025 at 1:07 PM Ming Lei <ming.lei@redhat.com> wrote:
> > > > > static void bio_chain_endio(struct bio *bio)
> > > > > {
> > > > > bio_endio(__bio_chain_endio(bio));
> > > > > }
> > > >
> > > > bio_chain_endio() never gets called really, which can be thought as `flag`,
> > >
> > > That's probably where this stops being relevant for the problem
> > > reported by Stephen Zhang.
> > >
> > > > and it should have been defined as `WARN_ON_ONCE(1);` for not confusing people.
> > >
> > > But shouldn't bio_chain_endio() still be fixed to do the right thing
> > > if called directly, or alternatively, just BUG()? Warning and still
> > > doing the wrong thing seems a bit bizarre.
> >
> > IMO calling ->bi_end_io() directly shouldn't be encouraged.
> >
> > The only in-tree direct call user could be bcache, so is this reported
> > issue triggered on bcache?
> >
I need to confirm the details later. However, let's assume our analysis provides
a theoretical model that explains all the observed phenomena without any
inconsistencies. Furthermore, we have a real-world problem that exhibits all
these same phenomena exactly.
In such a scenario, the chances that our analysis is incorrect are very low.
Even if bcache is not part of the running configuration, our later invetigation
will revolve around that analysis.
Therefore, what I want to explore further is: does this analysis can
really hold up
and perfectly explain everything without inconsistencies, assuming we can
introduce as much complex runtime configuration as possible?
Thanks,
Shida
> > If bcache can't call bio_endio(), I think it is fine to fix
> > bio_chain_endio().
> >
> > >
> > > I also see direct bi_end_io calls in erofs_fileio_ki_complete(),
> > > erofs_fscache_bio_endio(), and erofs_fscache_submit_bio(), so those
> > > are at least confusing.
> >
> > All looks FS bio(non-chained), so bio_chain_endio() shouldn't be involved
> > in erofs code base.
> >
>
> Okay, will add that.
>
> Thanks,
> Shida
>
> >
> > Thanks,
> > Ming
> >
^ permalink raw reply [flat|nested] 40+ messages in thread