ext4 data corruption in 6.1 stable tree (was Re: [PATCH 5.15 000/297] 5.15.140-rc1 review)

public inbox for linux-ext4@vger.kernel.org
 help / color / mirror / Atom feed

* ext4 data corruption in 6.1 stable tree (was Re: [PATCH 5.15 000/297] 5.15.140-rc1 review)
       [not found]     ` <CAEUSe7_PUdRgJpY36jZxy84CbNX5TTnynqU8derf0ZBSDtUOqw@mail.gmail.com>
@ 2023-12-05 12:21       ` Jan Kara
  2023-12-05 17:55         ` Guenter Roeck
  0 siblings, 1 reply; 5+ messages in thread
From: Jan Kara @ 2023-12-05 12:21 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Jan Kara, Daniel Díaz, stable, patches, linux-kernel,
	torvalds, akpm, linux, shuah, patches, lkft-triage, pavel,
	jonathanh, f.fainelli, sudipm.mukherjee, srw, rwarsow, conor,
	chrubis, linux-ext4, Ted Tso

Hello!

On Mon 27-11-23 11:32:12, Daniel Díaz wrote:
> On Mon, 27 Nov 2023 at 09:56, Jan Kara <jack@suse.cz> wrote:
> > On Fri 24-11-23 23:45:09, Daniel Díaz wrote:
> > > On 24/11/23 11:50 a. m., Greg Kroah-Hartman wrote:
> > > > This is the start of the stable review cycle for the 5.15.140 release.
> > > > There are 297 patches in this series, all will be posted as a response
> > > > to this one.  If anyone has any issues with these being applied, please
> > > > let me know.
> > > >
> > > > Responses should be made by Sun, 26 Nov 2023 17:19:17 +0000.
> > > > Anything received after that time might be too late.
> > > >
> > > > The whole patch series can be found in one patch at:
> > > >     https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.15.140-rc1.gz
> > > > or in the git tree and branch at:
> > > >     git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.15.y
> > > > and the diffstat can be found below.
> > > >
> > > > thanks,
> > > >
> > > > greg k-h
> > >
> > > We are noticing a regression with ltp-syscalls' preadv03:
> >
> > Thanks for report!
> >
> > > -----8<-----
> > >   preadv03 preadv03
> > >   preadv03_64 preadv03_64
> > >   preadv03.c:102: TINFO: Using block size 512
> > >   preadv03.c:87: TPASS: preadv(O_DIRECT) read 512 bytes successfully with content 'a' expectedly
> > >   preadv03.c:87: TPASS: preadv(O_DIRECT) read 512 bytes successfully with content 'a' expectedly
> > >   preadv03.c:87: TPASS: preadv(O_DIRECT) read 512 bytes successfully with content 'b' expectedly
> > >   preadv03.c:102: TINFO: Using block size 512
> > >   preadv03.c:77: TFAIL: Buffer wrong at 0 have 62 expected 61
> > >   preadv03.c:77: TFAIL: Buffer wrong at 0 have 62 expected 61
> > >   preadv03.c:66: TFAIL: preadv(O_DIRECT) read 0 bytes, expected 512
> > >   preadv03.c:102: TINFO: Using block size 512
> > >   preadv03.c:77: TFAIL: Buffer wrong at 0 have 62 expected 61
> > >   preadv03.c:77: TFAIL: Buffer wrong at 0 have 62 expected 61
> > >   preadv03.c:66: TFAIL: preadv(O_DIRECT) read 0 bytes, expected 512
> > >   preadv03.c:102: TINFO: Using block size 512
> > >   preadv03.c:87: TPASS: preadv(O_DIRECT) read 512 bytes successfully with content 'a' expectedly
> > >   preadv03.c:87: TPASS: preadv(O_DIRECT) read 512 bytes successfully with content 'a' expectedly
> > >   preadv03.c:87: TPASS: preadv(O_DIRECT) read 512 bytes successfully with content 'b' expectedly
> > >   preadv03.c:102: TINFO: Using block size 512
> > >   preadv03.c:77: TFAIL: Buffer wrong at 0 have 62 expected 61
> > >   preadv03.c:77: TFAIL: Buffer wrong at 0 have 62 expected 61
> > >   preadv03.c:66: TFAIL: preadv(O_DIRECT) read 0 bytes, expected 512
> > >   preadv03.c:102: TINFO: Using block size 512
> > >   preadv03.c:77: TFAIL: Buffer wrong at 0 have 62 expected 61
> > >   preadv03.c:77: TFAIL: Buffer wrong at 0 have 62 expected 61
> > >   preadv03.c:66: TFAIL: preadv(O_DIRECT) read 0 bytes, expected 512
> > > ----->8-----
> > >
> > > This is seen in the following environments:
> > > * dragonboard-845c
> > > * juno-64k_page_size
> > > * qemu-arm64
> > > * qemu-armv7
> > > * qemu-i386
> > > * qemu-x86_64
> > > * x86_64-clang
> > >
> > > and on the following RC's:
> > > * v5.10.202-rc1
> > > * v5.15.140-rc1
> > > * v6.1.64-rc1
> >
> > Hum, even in 6.1? That's odd. Can you please test whether current upstream
> > vanilla kernel works for you with this test? Thanks!
> 
> Yes, this is working for us on mainline and next:
>   https://qa-reports.linaro.org/lkft/linux-mainline-master/tests/ltp-syscalls/preadv03
>   https://qa-reports.linaro.org/lkft/linux-next-master/tests/ltp-syscalls/preadv03
> c.fr. 6.1:
>   https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-6.1.y/tests/ltp-syscalls/preadv03
> 
> Greetings!

So I've got back to this and the failure is a subtle interaction between
iomap code and ext4 code. In particular that fact that commit 936e114a245b6
("iomap: update ki_pos a little later in iomap_dio_complete") is not in
stable causes that file position is not updated after direct IO write and
thus we direct IO writes are ending in wrong locations effectively
corrupting data. The subtle detail is that before this commit if ->end_io
handler returns non-zero value (which the new ext4 ->end_io handler does),
file pos doesn't get updated, after this commit it doesn't get updated only
if the return value is < 0.

The commit got merged in 6.5-rc1 so all stable kernels that have
91562895f803 ("ext4: properly sync file size update after O_SYNC direct
IO") before 6.5 are corrupting data - I've noticed at least 6.1 is still
carrying the problematic commit. Greg, please take out the commit from all
stable kernels before 6.5 as soon as possible, we'll figure out proper
backport once user data are not being corrupted anymore. Thanks!

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: ext4 data corruption in 6.1 stable tree (was Re: [PATCH 5.15 000/297] 5.15.140-rc1 review)
  2023-12-05 12:21       ` ext4 data corruption in 6.1 stable tree (was Re: [PATCH 5.15 000/297] 5.15.140-rc1 review) Jan Kara
@ 2023-12-05 17:55         ` Guenter Roeck
  2023-12-05 17:57           ` Greg Kroah-Hartman
  0 siblings, 1 reply; 5+ messages in thread
From: Guenter Roeck @ 2023-12-05 17:55 UTC (permalink / raw)
  To: Jan Kara
  Cc: Greg Kroah-Hartman, Daniel Díaz, stable, patches,
	linux-kernel, torvalds, akpm, shuah, patches, lkft-triage, pavel,
	jonathanh, f.fainelli, sudipm.mukherjee, srw, rwarsow, conor,
	chrubis, linux-ext4, Ted Tso

On Tue, Dec 05, 2023 at 01:21:22PM +0100, Jan Kara wrote:
[ ... ]
> 
> So I've got back to this and the failure is a subtle interaction between
> iomap code and ext4 code. In particular that fact that commit 936e114a245b6
> ("iomap: update ki_pos a little later in iomap_dio_complete") is not in
> stable causes that file position is not updated after direct IO write and
> thus we direct IO writes are ending in wrong locations effectively
> corrupting data. The subtle detail is that before this commit if ->end_io
> handler returns non-zero value (which the new ext4 ->end_io handler does),
> file pos doesn't get updated, after this commit it doesn't get updated only
> if the return value is < 0.
> 
> The commit got merged in 6.5-rc1 so all stable kernels that have
> 91562895f803 ("ext4: properly sync file size update after O_SYNC direct
> IO") before 6.5 are corrupting data - I've noticed at least 6.1 is still
> carrying the problematic commit. Greg, please take out the commit from all
> stable kernels before 6.5 as soon as possible, we'll figure out proper
> backport once user data are not being corrupted anymore. Thanks!
> 

Thanks a lot for the update.

Turns out this is causing a regression in chromeos-6.1, and reverting the
offending patch fixes the problem. I suspect anyone running v6.1.64+ may
have a problem.

Guenter

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: ext4 data corruption in 6.1 stable tree (was Re: [PATCH 5.15 000/297] 5.15.140-rc1 review)
  2023-12-05 17:55         ` Guenter Roeck
@ 2023-12-05 17:57           ` Greg Kroah-Hartman
  2023-12-11  8:28             ` Pavel Machek
  0 siblings, 1 reply; 5+ messages in thread
From: Greg Kroah-Hartman @ 2023-12-05 17:57 UTC (permalink / raw)
  To: Guenter Roeck
  Cc: Jan Kara, Daniel Díaz, stable, patches, linux-kernel,
	torvalds, akpm, shuah, patches, lkft-triage, pavel, jonathanh,
	f.fainelli, sudipm.mukherjee, srw, rwarsow, conor, chrubis,
	linux-ext4, Ted Tso

On Tue, Dec 05, 2023 at 09:55:08AM -0800, Guenter Roeck wrote:
> On Tue, Dec 05, 2023 at 01:21:22PM +0100, Jan Kara wrote:
> [ ... ]
> > 
> > So I've got back to this and the failure is a subtle interaction between
> > iomap code and ext4 code. In particular that fact that commit 936e114a245b6
> > ("iomap: update ki_pos a little later in iomap_dio_complete") is not in
> > stable causes that file position is not updated after direct IO write and
> > thus we direct IO writes are ending in wrong locations effectively
> > corrupting data. The subtle detail is that before this commit if ->end_io
> > handler returns non-zero value (which the new ext4 ->end_io handler does),
> > file pos doesn't get updated, after this commit it doesn't get updated only
> > if the return value is < 0.
> > 
> > The commit got merged in 6.5-rc1 so all stable kernels that have
> > 91562895f803 ("ext4: properly sync file size update after O_SYNC direct
> > IO") before 6.5 are corrupting data - I've noticed at least 6.1 is still
> > carrying the problematic commit. Greg, please take out the commit from all
> > stable kernels before 6.5 as soon as possible, we'll figure out proper
> > backport once user data are not being corrupted anymore. Thanks!
> > 
> 
> Thanks a lot for the update.
> 
> Turns out this is causing a regression in chromeos-6.1, and reverting the
> offending patch fixes the problem. I suspect anyone running v6.1.64+ may
> have a problem.

Jan, thanks for the report, and Guenter, thanks for letting me know as
well.  I'll go queue up the fix now and push out new -rc releases.

greg k-h

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: ext4 data corruption in 6.1 stable tree (was Re: [PATCH 5.15 000/297] 5.15.140-rc1 review)
  2023-12-05 17:57           ` Greg Kroah-Hartman
@ 2023-12-11  8:28             ` Pavel Machek
  2023-12-11 11:58               ` Jan Kara
  0 siblings, 1 reply; 5+ messages in thread
From: Pavel Machek @ 2023-12-11  8:28 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Guenter Roeck, Jan Kara, Daniel Díaz, stable, patches,
	linux-kernel, torvalds, akpm, shuah, patches, lkft-triage, pavel,
	jonathanh, f.fainelli, sudipm.mukherjee, srw, rwarsow, conor,
	chrubis, linux-ext4, Ted Tso

[-- Attachment #1: Type: text/plain, Size: 1980 bytes --]

Hi!

> > > So I've got back to this and the failure is a subtle interaction between
> > > iomap code and ext4 code. In particular that fact that commit 936e114a245b6
> > > ("iomap: update ki_pos a little later in iomap_dio_complete") is not in
> > > stable causes that file position is not updated after direct IO write and
> > > thus we direct IO writes are ending in wrong locations effectively
> > > corrupting data. The subtle detail is that before this commit if ->end_io
> > > handler returns non-zero value (which the new ext4 ->end_io handler does),
> > > file pos doesn't get updated, after this commit it doesn't get updated only
> > > if the return value is < 0.
> > > 
> > > The commit got merged in 6.5-rc1 so all stable kernels that have
> > > 91562895f803 ("ext4: properly sync file size update after O_SYNC direct
> > > IO") before 6.5 are corrupting data - I've noticed at least 6.1 is still
> > > carrying the problematic commit. Greg, please take out the commit from all
> > > stable kernels before 6.5 as soon as possible, we'll figure out proper
> > > backport once user data are not being corrupted anymore. Thanks!
> > > 
> > 
> > Thanks a lot for the update.
> > 
> > Turns out this is causing a regression in chromeos-6.1, and reverting the
> > offending patch fixes the problem. I suspect anyone running v6.1.64+ may
> > have a problem.
> 
> Jan, thanks for the report, and Guenter, thanks for letting me know as
> well.  I'll go queue up the fix now and push out new -rc releases.

Would someone have a brief summary here? I see 6.1.66 is out but I
don't see any "Fixes: 91562895f803" tags.

Plus, what is the severity of this? It is "data being corrupted when
using O_SYNC|O_DIRECT" or does metadata somehow get corrupted, too?

Thanks and best regards,
								Pavel
-- 
DENX Software Engineering GmbH,        Managing Director: Erika Unter
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: ext4 data corruption in 6.1 stable tree (was Re: [PATCH 5.15 000/297] 5.15.140-rc1 review)
  2023-12-11  8:28             ` Pavel Machek
@ 2023-12-11 11:58               ` Jan Kara
  0 siblings, 0 replies; 5+ messages in thread
From: Jan Kara @ 2023-12-11 11:58 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Greg Kroah-Hartman, Guenter Roeck, Jan Kara, Daniel Díaz,
	stable, patches, linux-kernel, torvalds, akpm, shuah, patches,
	lkft-triage, jonathanh, f.fainelli, sudipm.mukherjee, srw,
	rwarsow, conor, chrubis, linux-ext4, Ted Tso

On Mon 11-12-23 09:28:40, Pavel Machek wrote:
> > > > So I've got back to this and the failure is a subtle interaction between
> > > > iomap code and ext4 code. In particular that fact that commit 936e114a245b6
> > > > ("iomap: update ki_pos a little later in iomap_dio_complete") is not in
> > > > stable causes that file position is not updated after direct IO write and
> > > > thus we direct IO writes are ending in wrong locations effectively
> > > > corrupting data. The subtle detail is that before this commit if ->end_io
> > > > handler returns non-zero value (which the new ext4 ->end_io handler does),
> > > > file pos doesn't get updated, after this commit it doesn't get updated only
> > > > if the return value is < 0.
> > > > 
> > > > The commit got merged in 6.5-rc1 so all stable kernels that have
> > > > 91562895f803 ("ext4: properly sync file size update after O_SYNC direct
> > > > IO") before 6.5 are corrupting data - I've noticed at least 6.1 is still
> > > > carrying the problematic commit. Greg, please take out the commit from all
> > > > stable kernels before 6.5 as soon as possible, we'll figure out proper
> > > > backport once user data are not being corrupted anymore. Thanks!
> > > > 
> > > 
> > > Thanks a lot for the update.
> > > 
> > > Turns out this is causing a regression in chromeos-6.1, and reverting the
> > > offending patch fixes the problem. I suspect anyone running v6.1.64+ may
> > > have a problem.
> > 
> > Jan, thanks for the report, and Guenter, thanks for letting me know as
> > well.  I'll go queue up the fix now and push out new -rc releases.
> 
> Would someone have a brief summary here? I see 6.1.66 is out but I
> don't see any "Fixes: 91562895f803" tags.
> 
> Plus, what is the severity of this? It is "data being corrupted when
> using O_SYNC|O_DIRECT" or does metadata somehow get corrupted, too?

It is pure data corruption happening for ext4 direct IO writes because they
do not properly update current file position after the write.

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2023-12-11 11:58 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20231124172000.087816911@linuxfoundation.org>
     [not found] ` <81a11ebe-ea47-4e21-b5eb-536b1a723168@linaro.org>
     [not found]   ` <20231127155557.xv5ljrdxcfcigjfa@quack3>
     [not found]     ` <CAEUSe7_PUdRgJpY36jZxy84CbNX5TTnynqU8derf0ZBSDtUOqw@mail.gmail.com>
2023-12-05 12:21       ` ext4 data corruption in 6.1 stable tree (was Re: [PATCH 5.15 000/297] 5.15.140-rc1 review) Jan Kara
2023-12-05 17:55         ` Guenter Roeck
2023-12-05 17:57           ` Greg Kroah-Hartman
2023-12-11  8:28             ` Pavel Machek
2023-12-11 11:58               ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox