public inbox for linux-block@vger.kernel.org
 help / color / mirror / Atom feed
* Bug in commit aa511ff8218b ("badblocks: switch to the improved badblock handling
@ 2023-12-22 18:31 Ira Weiny
  2023-12-22 18:57 ` Ira Weiny
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Ira Weiny @ 2023-12-22 18:31 UTC (permalink / raw)
  To: Coly Li
  Cc: Dan Williams, Jens Axboe, Xiao Ni, Geliang Tang, Hannes Reinecke,
	NeilBrown, Vishal L Verma, linux-block, nvdimm, linux-kernel

Coly,

Yesterday I noticed that a few of our nvdimm tests were failing.  I bisected
the problem to the following commit.

aa511ff8218b ("badblocks: switch to the improved badblock handling code") 

Reverting this patch fixed our tests.

I've also dug into the code a bit and I believe the algorithm for
badblocks_check() is broken (not yet sure about the other calls).  At the
very least I see the bb->p pointer being indexed with '-1'.  :-(

I did notice that this work was due to a bug report in badblock_set().
Therefore, I'm not sure of that severity of that fix is vs a revert.  But
at this point I'm not seeing an easy fix so I'm in favor of a revert.

Thanks,
Ira

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Bug in commit aa511ff8218b ("badblocks: switch to the improved badblock handling
  2023-12-22 18:31 Bug in commit aa511ff8218b ("badblocks: switch to the improved badblock handling Ira Weiny
@ 2023-12-22 18:57 ` Ira Weiny
  2023-12-23  0:24   ` Ira Weiny
  2023-12-23  6:52 ` Coly Li
  2023-12-23  8:35 ` Linux regression tracking #adding (Thorsten Leemhuis)
  2 siblings, 1 reply; 9+ messages in thread
From: Ira Weiny @ 2023-12-22 18:57 UTC (permalink / raw)
  To: Ira Weiny, Coly Li
  Cc: Dan Williams, Jens Axboe, Xiao Ni, Geliang Tang, Hannes Reinecke,
	NeilBrown, Vishal L Verma, linux-block, nvdimm, linux-kernel

Ira Weiny wrote:
> Coly,
> 
> Yesterday I noticed that a few of our nvdimm tests were failing.  I bisected
> the problem to the following commit.
> 
> aa511ff8218b ("badblocks: switch to the improved badblock handling code") 
> 
> Reverting this patch fixed our tests.
> 
> I've also dug into the code a bit and I believe the algorithm for
> badblocks_check() is broken (not yet sure about the other calls).  At the
> very least I see the bb->p pointer being indexed with '-1'.  :-(
> 
> I did notice that this work was due to a bug report in badblock_set().
> Therefore, I'm not sure of that severity of that fix is vs a revert.  But
> at this point I'm not seeing an easy fix so I'm in favor of a revert.
> 

Dan and I were discussing this and it occurs to us that it may be easy for
you to stand up the test environment I'm using.

For CXL we have a run_qemu.sh project[1] which stands up a qemu
environment with the ndctl[2] tests in them.  Clone ndctl to ~/git/ndctl
so run_qemu.sh can find it.  Then start run_qemu.sh in a kernel tree like
this:

$ <path_to_run_qemu>/run_qemu.sh --cxl --nfit-test --nfit-debug [-r img]

[-r img] is optional but useful if you have changed the ndctl tests.

Once booted you can run the test suite with meson:

$ cd ndctl && meson test -C build

I've been running just our clear.sh test which shows the error.

$ cd ndctl/build && meson test clear.sh

Hope this helps,
Ira

[1] https://github.com/pmem/run_qemu
[2] https://github.com/pmem/ndctl

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Bug in commit aa511ff8218b ("badblocks: switch to the improved badblock handling
  2023-12-22 18:57 ` Ira Weiny
@ 2023-12-23  0:24   ` Ira Weiny
  2023-12-23  9:39     ` Coly Li
  0 siblings, 1 reply; 9+ messages in thread
From: Ira Weiny @ 2023-12-23  0:24 UTC (permalink / raw)
  To: Ira Weiny, Coly Li
  Cc: Dan Williams, Jens Axboe, Xiao Ni, Geliang Tang, Hannes Reinecke,
	NeilBrown, Vishal L Verma, linux-block, nvdimm, linux-kernel

Ira Weiny wrote:
> Ira Weiny wrote:
> > Coly,
> > 
> > Yesterday I noticed that a few of our nvdimm tests were failing.  I bisected
> > the problem to the following commit.
> > 
> > aa511ff8218b ("badblocks: switch to the improved badblock handling code") 
> > 
> > Reverting this patch fixed our tests.
> > 

[snip]

I added some prints[1] to try and see what is happening.  Perhaps this will
help you.

...
[   99.919237] IKW set_badblock 00000000aa44c55d 8000 1
[   99.921448] IKW set_badblock 00000000aa44c55d 8001 1
[   99.924051] IKW set_badblock 00000000aa44c55d 8002 1
[   99.926135] IKW set_badblock 00000000aa44c55d 8003 1
[   99.928516] IKW set_badblock 00000000aa44c55d 8004 1
[   99.930491] IKW set_badblock 00000000aa44c55d 8005 1
[   99.932894] IKW set_badblock 00000000aa44c55d 8006 1
[   99.936638] IKW set_badblock 00000000aa44c55d 8007 1
[  100.999297] IKW _badblocks_check() 00000000aa44c55d s 8000 num 1
[  101.000027]    IKW table count 1 shift 0
[  101.000644]    IKW 0: off 8000 end 8008
[  101.001271]    IKW prev 0, cnt 1
[  101.002481]    IKW start 8000, len 1
[  101.003464]    IKW front overlap 0
[  101.004256] IKW rv 1

...            ^^^^^^^^^
<This is a valid failure as part of the test>
...

[  101.148783] IKW set_badblock 00000000721b4f3d 8000 1
[  101.150629] IKW set_badblock 00000000721b4f3d 8001 1
[  101.152315] IKW set_badblock 00000000721b4f3d 8002 1
[  101.154544] IKW set_badblock 00000000721b4f3d 8003 1
[  101.156238] IKW set_badblock 00000000721b4f3d 8004 1
[  101.158310] IKW set_badblock 00000000721b4f3d 8005 1
[  101.160196] IKW set_badblock 00000000721b4f3d 8006 1
[  101.162158] IKW set_badblock 00000000721b4f3d 8007 1
[  101.163543] IKW _badblocks_check() 00000000721b4f3d s 0 num 8
[  101.164427]    IKW table count 1 shift 0
[  101.165310]    IKW 0: off 8000 end 8008
[  101.166398]    IKW prev -1, cnt 1
[  101.167178]    IKW start 0, len 8
[  101.168107] IKW rv 0
[  101.168858] IKW _badblocks_check() 00000000721b4f3d s 8 num 8
[  101.169814]    IKW table count 1 shift 0
[  101.170547]    IKW 0: off 8000 end 8008
[  101.171238]    IKW prev -1, cnt 1
[  101.171985]    IKW start 8, len 8
[  101.173007]    IKW front overlap -1  <== this is prev which is used to index bb->pages
[  101.174157]    IKW prev -1, cnt 1
[  101.175268]    IKW start 9, len 7
[  101.176557] IKW rv -1

...            ^^^^^^^^^
This is where the failure occurs.
...

I think overlap_front() is not working correctly in this case.  And from
my reading of the code I don't know how it would.  But overlap_front() is
used elsewhere and I'm not confident in making the change.

Hope this helps,
Ira


[1]
diff --git a/block/badblocks.c b/block/badblocks.c
index fc92d4e18aa3..21e22ee576e5 100644
--- a/block/badblocks.c
+++ b/block/badblocks.c
@@ -1280,6 +1280,16 @@ static int _badblocks_check(struct badblocks *bb, sector_t s, int sectors,
 	unsigned int seq;
 	int len, rv;
 	u64 *p;
+	int i;
+
+	printk(KERN_CRIT "IKW %s() %p s %llx num %d\n", __func__,
+		bb, s, sectors);
+
+	printk(KERN_CRIT "   IKW table count %d shift %d\n", bb->count, bb->shift);
+	for (i = 0; i < bb->count; i++) {
+		printk(KERN_CRIT "   IKW %d: off %llx end %llx\n", i,
+			BB_OFFSET(bb->page[i]), BB_END(bb->page[i]));
+	}
 
 	WARN_ON(bb->shift < 0 || sectors == 0);
 
@@ -1311,6 +1321,9 @@ static int _badblocks_check(struct badblocks *bb, sector_t s, int sectors,
 
 	prev = prev_badblocks(bb, &bad, hint);
 
+	printk(KERN_CRIT "   IKW prev %d, cnt %d\n", prev, bb->count);
+	printk(KERN_CRIT "   IKW start %llx, len %llx\n", bad.start, bad.len);
+
 	/* start after all badblocks */
 	if ((prev + 1) >= bb->count && !overlap_front(bb, prev, &bad)) {
 		len = sectors;
@@ -1318,6 +1331,7 @@ static int _badblocks_check(struct badblocks *bb, sector_t s, int sectors,
 	}
 
 	if (overlap_front(bb, prev, &bad)) {
+		printk(KERN_CRIT "   IKW front overlap %d\n", prev);
 		if (BB_ACK(p[prev]))
 			acked_badblocks++;
 		else
@@ -1365,6 +1379,7 @@ static int _badblocks_check(struct badblocks *bb, sector_t s, int sectors,
 	if (read_seqretry(&bb->lock, seq))
 		goto retry;
 
+	printk(KERN_CRIT "IKW rv %d\n", rv);
 	return rv;
 }
 
diff --git a/drivers/nvdimm/badrange.c b/drivers/nvdimm/badrange.c
index a002ea6fdd84..93ffd189bc75 100644
--- a/drivers/nvdimm/badrange.c
+++ b/drivers/nvdimm/badrange.c
@@ -167,6 +167,7 @@ static void set_badblock(struct badblocks *bb, sector_t s, int num)
 	dev_dbg(bb->dev, "Found a bad range (0x%llx, 0x%llx)\n",
 			(u64) s * 512, (u64) num * 512);
 	/* this isn't an error as the hardware will still throw an exception */
+	printk(KERN_CRIT "IKW %s %p %llx %x\n", __func__, bb, s, num);
 	if (badblocks_set(bb, s, num, 1))
 		dev_info_once(bb->dev, "%s: failed for sector %llx\n",
 				__func__, (u64) s);

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: Bug in commit aa511ff8218b ("badblocks: switch to the improved badblock handling
  2023-12-22 18:31 Bug in commit aa511ff8218b ("badblocks: switch to the improved badblock handling Ira Weiny
  2023-12-22 18:57 ` Ira Weiny
@ 2023-12-23  6:52 ` Coly Li
  2023-12-23  8:35 ` Linux regression tracking #adding (Thorsten Leemhuis)
  2 siblings, 0 replies; 9+ messages in thread
From: Coly Li @ 2023-12-23  6:52 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dan Williams, Jens Axboe, Xiao Ni, Geliang Tang, Hannes Reinecke,
	NeilBrown, Vishal L Verma, linux-block, nvdimm, linux-kernel



> 2023年12月23日 02:31,Ira Weiny <ira.weiny@intel.com> 写道:
> 
> Coly,
> 
> Yesterday I noticed that a few of our nvdimm tests were failing.  I bisected
> the problem to the following commit.
> 
> aa511ff8218b ("badblocks: switch to the improved badblock handling code") 
> 
> Reverting this patch fixed our tests.
> 
> I've also dug into the code a bit and I believe the algorithm for
> badblocks_check() is broken (not yet sure about the other calls).  At the
> very least I see the bb->p pointer being indexed with '-1'.  :-(
> 
> I did notice that this work was due to a bug report in badblock_set().
> Therefore, I'm not sure of that severity of that fix is vs a revert.  But
> at this point I'm not seeing an easy fix so I'm in favor of a revert.

Hi Ira,

Thanks for the information reported. Let me look into the situation now.

Coly Li


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Bug in commit aa511ff8218b ("badblocks: switch to the improved badblock handling
  2023-12-22 18:31 Bug in commit aa511ff8218b ("badblocks: switch to the improved badblock handling Ira Weiny
  2023-12-22 18:57 ` Ira Weiny
  2023-12-23  6:52 ` Coly Li
@ 2023-12-23  8:35 ` Linux regression tracking #adding (Thorsten Leemhuis)
  2024-01-07  8:48   ` Linux regression tracking #update (Thorsten Leemhuis)
  2 siblings, 1 reply; 9+ messages in thread
From: Linux regression tracking #adding (Thorsten Leemhuis) @ 2023-12-23  8:35 UTC (permalink / raw)
  To: Ira Weiny, Coly Li
  Cc: Dan Williams, Jens Axboe, Xiao Ni, Geliang Tang, Hannes Reinecke,
	NeilBrown, Vishal L Verma, linux-block, nvdimm, linux-kernel

[TLDR: I'm adding this report to the list of tracked Linux kernel
regressions; the text you find below is based on a few templates
paragraphs you might have encountered already in similar form.
See link in footer if these mails annoy you.]

On 22.12.23 19:31, Ira Weiny wrote:
> Coly,
> 
> Yesterday I noticed that a few of our nvdimm tests were failing.  I bisected
> the problem to the following commit.
> 
> aa511ff8218b ("badblocks: switch to the improved badblock handling code") 
> 
> Reverting this patch fixed our tests.
> 
> I've also dug into the code a bit and I believe the algorithm for
> badblocks_check() is broken (not yet sure about the other calls).  At the
> very least I see the bb->p pointer being indexed with '-1'.  :-(
> 
> I did notice that this work was due to a bug report in badblock_set().
> Therefore, I'm not sure of that severity of that fix is vs a revert.  But
> at this point I'm not seeing an easy fix so I'm in favor of a revert.

Thanks for the report. To be sure the issue doesn't fall through the
cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
tracking bot:

#regzbot ^introduced aa511ff8218b
#regzbot title badblocks: nvdimm tests were failing after switch to
impoved code
#regzbot ignore-activity

This isn't a regression? This issue or a fix for it are already
discussed somewhere else? It was fixed already? You want to clarify when
the regression started to happen? Or point out I got the title or
something else totally wrong? Then just reply and tell me -- ideally
while also telling regzbot about it, as explained by the page listed in
the footer of this mail.

Developers: When fixing the issue, remember to add 'Link:' tags pointing
to the report (the parent of this mail). See page linked in footer for
details.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Bug in commit aa511ff8218b ("badblocks: switch to the improved badblock handling
  2023-12-23  0:24   ` Ira Weiny
@ 2023-12-23  9:39     ` Coly Li
  2023-12-23 17:13       ` Ira Weiny
  0 siblings, 1 reply; 9+ messages in thread
From: Coly Li @ 2023-12-23  9:39 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dan Williams, Jens Axboe, Xiao Ni, Geliang Tang, Hannes Reinecke,
	NeilBrown, Vishal L Verma, linux-block, nvdimm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 5471 bytes --]

On Fri, Dec 22, 2023 at 04:24:16PM -0800, Ira Weiny wrote:
> Ira Weiny wrote:
> > Ira Weiny wrote:
> > > Coly,
> > > 
> > > Yesterday I noticed that a few of our nvdimm tests were failing.  I bisected
> > > the problem to the following commit.
> > > 
> > > aa511ff8218b ("badblocks: switch to the improved badblock handling code") 
> > > 
> > > Reverting this patch fixed our tests.
> > > 
> 
> [snip]
> 
> I added some prints[1] to try and see what is happening.  Perhaps this will
> help you.
> 
> ...
> [   99.919237] IKW set_badblock 00000000aa44c55d 8000 1
> [   99.921448] IKW set_badblock 00000000aa44c55d 8001 1
> [   99.924051] IKW set_badblock 00000000aa44c55d 8002 1
> [   99.926135] IKW set_badblock 00000000aa44c55d 8003 1
> [   99.928516] IKW set_badblock 00000000aa44c55d 8004 1
> [   99.930491] IKW set_badblock 00000000aa44c55d 8005 1
> [   99.932894] IKW set_badblock 00000000aa44c55d 8006 1
> [   99.936638] IKW set_badblock 00000000aa44c55d 8007 1
> [  100.999297] IKW _badblocks_check() 00000000aa44c55d s 8000 num 1
> [  101.000027]    IKW table count 1 shift 0
> [  101.000644]    IKW 0: off 8000 end 8008
> [  101.001271]    IKW prev 0, cnt 1
> [  101.002481]    IKW start 8000, len 1
> [  101.003464]    IKW front overlap 0
> [  101.004256] IKW rv 1
> 
> ...            ^^^^^^^^^
> <This is a valid failure as part of the test>
> ...
> 
> [  101.148783] IKW set_badblock 00000000721b4f3d 8000 1
> [  101.150629] IKW set_badblock 00000000721b4f3d 8001 1
> [  101.152315] IKW set_badblock 00000000721b4f3d 8002 1
> [  101.154544] IKW set_badblock 00000000721b4f3d 8003 1
> [  101.156238] IKW set_badblock 00000000721b4f3d 8004 1
> [  101.158310] IKW set_badblock 00000000721b4f3d 8005 1
> [  101.160196] IKW set_badblock 00000000721b4f3d 8006 1
> [  101.162158] IKW set_badblock 00000000721b4f3d 8007 1
> [  101.163543] IKW _badblocks_check() 00000000721b4f3d s 0 num 8
> [  101.164427]    IKW table count 1 shift 0
> [  101.165310]    IKW 0: off 8000 end 8008
> [  101.166398]    IKW prev -1, cnt 1
> [  101.167178]    IKW start 0, len 8
> [  101.168107] IKW rv 0
> [  101.168858] IKW _badblocks_check() 00000000721b4f3d s 8 num 8
> [  101.169814]    IKW table count 1 shift 0
> [  101.170547]    IKW 0: off 8000 end 8008
> [  101.171238]    IKW prev -1, cnt 1
> [  101.171985]    IKW start 8, len 8
> [  101.173007]    IKW front overlap -1  <== this is prev which is used to index bb->pages
> [  101.174157]    IKW prev -1, cnt 1
> [  101.175268]    IKW start 9, len 7
> [  101.176557] IKW rv -1
> 
> ...            ^^^^^^^^^
> This is where the failure occurs.
> ...
> 
> I think overlap_front() is not working correctly in this case.  And from
> my reading of the code I don't know how it would.  But overlap_front() is
> used elsewhere and I'm not confident in making the change.
> 
> Hope this helps,

Hi Ira,

The above information is accurate and very helpful, thank you!

From __badblocks_check(), the problematic code block is,
1303 re_check:
1304         bad.start = s;
1305         bad.len = sectors;
1306
1307         if (badblocks_empty(bb)) {
1308                 len = sectors;
1309                 goto update_sectors;
1310         }
1311
1312         prev = prev_badblocks(bb, &bad, hint);
1313
1314         /* start after all badblocks */
1315         if ((prev + 1) >= bb->count && !overlap_front(bb, prev, &bad)) {
1316                 len = sectors;
1317                 goto update_sectors;
1318         }
1319
1320         if (overlap_front(bb, prev, &bad)) {
1321                 if (BB_ACK(p[prev]))
1322                         acked_badblocks++;
1323                 else
1324                         unacked_badblocks++;
1325
1326                 if (BB_END(p[prev]) >= (s + sectors))
1327                         len = sectors;
1328                 else
1329                         len = BB_END(p[prev]) - s;
1330
1331                 if (set == 0) {
1332                         *first_bad = BB_OFFSET(p[prev]);
1333                         *bad_sectors = BB_LEN(p[prev]);
1334                         set = 1;
1335                 }
1336                 goto update_sectors;
1337         }
1338
1339         /* Not front overlap, but behind overlap */
1340         if ((prev + 1) < bb->count && overlap_behind(bb, &bad, prev + 1)) {
1341                 len = BB_OFFSET(p[prev + 1]) - bad.start;
1342                 hint = prev + 1;
1343                 goto update_sectors;
1344         }
1345
1346         /* not cover any badblocks range in the table */
1347         len = sectors;
1348
1349 update_sectors:

If the checking range is before all badblocks records in the badblocks table,
value -1 is returned from prev_badblock(). Code blocks between line 1314 and
line 1337 doesn't hanle the implicit '-1' value properly. Then counter
unacked_badblocks is increased at line 1324 mistakenly.

So the value prev should be checked and make sure '>= 0' before comparing
the checking range with a badblock record returned by prev_badblocks(). Other
wise it dones't make sense.

For badblocks_set() and badblocks_clear(), 'prev < 0' is explicitly checked,
value '-1' doesn't go though into following code.

Could you please apply and try the attached patch? Hope it may help a bit.

And now it is weekend time, you may be out of office and not able to access
the testing hardware. I will do more testing from myside and update more info
if necessary.

Thanks for the report and debug!

Coly Li

[debug patch snipped]

[-- Attachment #2: 0001-badblocks-avoid-checking-invalid-range-in-badblocks_.patch --]
[-- Type: text/plain, Size: 1445 bytes --]

From 0f81e197404653a4631f5d5857c63510e0534430 Mon Sep 17 00:00:00 2001
From: Coly Li <colyli@suse.de>
Date: Sat, 23 Dec 2023 17:24:51 +0800
Subject: [PATCH] badblocks: avoid checking invalid range in badblocks_check()

If prev_badblocks() returns '-1', it means no valid badblocks record
before the checking range. It doesn't make sense to check whether
the input checking range is overlapped with the non-existed invalid
front range.

This patch checkes whether 'prev >= 0' is true before calling
overlap_front(), to void such invalid operations.

Reported-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Coly Li <colyli@suse.de>
---
 block/badblocks.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/block/badblocks.c b/block/badblocks.c
index fc92d4e18aa3..db4ec8b9b2a8 100644
--- a/block/badblocks.c
+++ b/block/badblocks.c
@@ -1312,12 +1312,14 @@ static int _badblocks_check(struct badblocks *bb, sector_t s, int sectors,
 	prev = prev_badblocks(bb, &bad, hint);
 
 	/* start after all badblocks */
-	if ((prev + 1) >= bb->count && !overlap_front(bb, prev, &bad)) {
+	if ((prev >= 0) &&
+	    ((prev + 1) >= bb->count) && !overlap_front(bb, prev, &bad)) {
 		len = sectors;
 		goto update_sectors;
 	}
 
-	if (overlap_front(bb, prev, &bad)) {
+	/* Overlapped with front badblocks record */
+	if ((prev >= 0) && overlap_front(bb, prev, &bad)) {
 		if (BB_ACK(p[prev]))
 			acked_badblocks++;
 		else
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: Bug in commit aa511ff8218b ("badblocks: switch to the improved badblock handling
  2023-12-23  9:39     ` Coly Li
@ 2023-12-23 17:13       ` Ira Weiny
  2023-12-24  0:18         ` Coly Li
  0 siblings, 1 reply; 9+ messages in thread
From: Ira Weiny @ 2023-12-23 17:13 UTC (permalink / raw)
  To: Coly Li, Ira Weiny, linan666
  Cc: Dan Williams, Jens Axboe, Xiao Ni, Geliang Tang, Hannes Reinecke,
	NeilBrown, Vishal L Verma, linux-block, nvdimm, linux-kernel

Coly Li wrote:

[snip]

> 
> Hi Ira,
> 
> The above information is accurate and very helpful, thank you!
> 
> From __badblocks_check(), the problematic code block is,
> 1303 re_check:
> 1304         bad.start = s;
> 1305         bad.len = sectors;
> 1306
> 1307         if (badblocks_empty(bb)) {
> 1308                 len = sectors;
> 1309                 goto update_sectors;
> 1310         }
> 1311
> 1312         prev = prev_badblocks(bb, &bad, hint);
> 1313
> 1314         /* start after all badblocks */
> 1315         if ((prev + 1) >= bb->count && !overlap_front(bb, prev, &bad)) {
> 1316                 len = sectors;
> 1317                 goto update_sectors;
> 1318         }
> 1319
> 1320         if (overlap_front(bb, prev, &bad)) {
> 1321                 if (BB_ACK(p[prev]))
> 1322                         acked_badblocks++;
> 1323                 else
> 1324                         unacked_badblocks++;
> 1325
> 1326                 if (BB_END(p[prev]) >= (s + sectors))
> 1327                         len = sectors;
> 1328                 else
> 1329                         len = BB_END(p[prev]) - s;
> 1330
> 1331                 if (set == 0) {
> 1332                         *first_bad = BB_OFFSET(p[prev]);
> 1333                         *bad_sectors = BB_LEN(p[prev]);
> 1334                         set = 1;
> 1335                 }
> 1336                 goto update_sectors;
> 1337         }
> 1338
> 1339         /* Not front overlap, but behind overlap */
> 1340         if ((prev + 1) < bb->count && overlap_behind(bb, &bad, prev + 1)) {
> 1341                 len = BB_OFFSET(p[prev + 1]) - bad.start;
> 1342                 hint = prev + 1;
> 1343                 goto update_sectors;
> 1344         }
> 1345
> 1346         /* not cover any badblocks range in the table */
> 1347         len = sectors;
> 1348
> 1349 update_sectors:
> 
> If the checking range is before all badblocks records in the badblocks table,
> value -1 is returned from prev_badblock(). Code blocks between line 1314 and
> line 1337 doesn't hanle the implicit '-1' value properly. Then counter
> unacked_badblocks is increased at line 1324 mistakenly.
> 
> So the value prev should be checked and make sure '>= 0' before comparing
> the checking range with a badblock record returned by prev_badblocks(). Other
> wise it dones't make sense.
> 
> For badblocks_set() and badblocks_clear(), 'prev < 0' is explicitly checked,
> value '-1' doesn't go though into following code.
> 
> Could you please apply and try the attached patch? Hope it may help a bit.
> 
> And now it is weekend time, you may be out of office and not able to access
> the testing hardware. I will do more testing from myside and update more info
> if necessary.
> 
> Thanks for the report and debug!
> 
> Coly Li
> 
> [debug patch snipped]

This debug patch does fix our tests.  Thanks!

But Nan has submitted a series to fix this as well.[1]

I'm going to test his series as well.

Thanks!
Ira

[1] https://lore.kernel.org/linux-block/20231223063728.3229446-1-linan666@huaweicloud.com/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Bug in commit aa511ff8218b ("badblocks: switch to the improved badblock handling
  2023-12-23 17:13       ` Ira Weiny
@ 2023-12-24  0:18         ` Coly Li
  0 siblings, 0 replies; 9+ messages in thread
From: Coly Li @ 2023-12-24  0:18 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Li Nan, Dan Williams, Jens Axboe, Xiao Ni, Geliang Tang,
	Hannes Reinecke, NeilBrown, Vishal L Verma, Linux Block Devices,
	nvdimm, Linux Kernel Mailing List



> 2023年12月24日 01:13,Ira Weiny <ira.weiny@intel.com> 写道:
> 
> Coly Li wrote:
> 
> [snip]
> 
>> 
>> Hi Ira,
>> 
>> The above information is accurate and very helpful, thank you!
>> 
>> From __badblocks_check(), the problematic code block is,
>> 1303 re_check:
>> 1304         bad.start = s;
>> 1305         bad.len = sectors;
>> 1306
>> 1307         if (badblocks_empty(bb)) {
>> 1308                 len = sectors;
>> 1309                 goto update_sectors;
>> 1310         }
>> 1311
>> 1312         prev = prev_badblocks(bb, &bad, hint);
>> 1313
>> 1314         /* start after all badblocks */
>> 1315         if ((prev + 1) >= bb->count && !overlap_front(bb, prev, &bad)) {
>> 1316                 len = sectors;
>> 1317                 goto update_sectors;
>> 1318         }
>> 1319
>> 1320         if (overlap_front(bb, prev, &bad)) {
>> 1321                 if (BB_ACK(p[prev]))
>> 1322                         acked_badblocks++;
>> 1323                 else
>> 1324                         unacked_badblocks++;
>> 1325
>> 1326                 if (BB_END(p[prev]) >= (s + sectors))
>> 1327                         len = sectors;
>> 1328                 else
>> 1329                         len = BB_END(p[prev]) - s;
>> 1330
>> 1331                 if (set == 0) {
>> 1332                         *first_bad = BB_OFFSET(p[prev]);
>> 1333                         *bad_sectors = BB_LEN(p[prev]);
>> 1334                         set = 1;
>> 1335                 }
>> 1336                 goto update_sectors;
>> 1337         }
>> 1338
>> 1339         /* Not front overlap, but behind overlap */
>> 1340         if ((prev + 1) < bb->count && overlap_behind(bb, &bad, prev + 1)) {
>> 1341                 len = BB_OFFSET(p[prev + 1]) - bad.start;
>> 1342                 hint = prev + 1;
>> 1343                 goto update_sectors;
>> 1344         }
>> 1345
>> 1346         /* not cover any badblocks range in the table */
>> 1347         len = sectors;
>> 1348
>> 1349 update_sectors:
>> 
>> If the checking range is before all badblocks records in the badblocks table,
>> value -1 is returned from prev_badblock(). Code blocks between line 1314 and
>> line 1337 doesn't hanle the implicit '-1' value properly. Then counter
>> unacked_badblocks is increased at line 1324 mistakenly.
>> 
>> So the value prev should be checked and make sure '>= 0' before comparing
>> the checking range with a badblock record returned by prev_badblocks(). Other
>> wise it dones't make sense.
>> 
>> For badblocks_set() and badblocks_clear(), 'prev < 0' is explicitly checked,
>> value '-1' doesn't go though into following code.
>> 
>> Could you please apply and try the attached patch? Hope it may help a bit.
>> 
>> And now it is weekend time, you may be out of office and not able to access
>> the testing hardware. I will do more testing from myside and update more info
>> if necessary.
>> 
>> Thanks for the report and debug!
>> 
>> Coly Li
>> 
>> [debug patch snipped]
> 
> This debug patch does fix our tests.  Thanks!
> 
> But Nan has submitted a series to fix this as well.[1]
> 
> I'm going to test his series as well.

Hi Ira,

Thanks for the very quick response, and the positive result. Now I compose a official patch and submit to Jens.

Coly Li

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Bug in commit aa511ff8218b ("badblocks: switch to the improved badblock handling
  2023-12-23  8:35 ` Linux regression tracking #adding (Thorsten Leemhuis)
@ 2024-01-07  8:48   ` Linux regression tracking #update (Thorsten Leemhuis)
  0 siblings, 0 replies; 9+ messages in thread
From: Linux regression tracking #update (Thorsten Leemhuis) @ 2024-01-07  8:48 UTC (permalink / raw)
  To: Linux kernel regressions list; +Cc: linux-block, nvdimm, linux-kernel

On 23.12.23 09:35, Linux regression tracking #adding (Thorsten Leemhuis)
wrote:
> On 22.12.23 19:31, Ira Weiny wrote:
>> Coly,
>>
>> Yesterday I noticed that a few of our nvdimm tests were failing.  I bisected
>> the problem to the following commit.
>>
>> aa511ff8218b ("badblocks: switch to the improved badblock handling code") 
>>
>> Reverting this patch fixed our tests.
>
> #regzbot ^introduced aa511ff8218b

#regzbot introduced 3ea3354cb9f0
#regzbot fix: 146e843f6b0927
#regzbot ignore-activity

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2024-01-07  8:48 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-12-22 18:31 Bug in commit aa511ff8218b ("badblocks: switch to the improved badblock handling Ira Weiny
2023-12-22 18:57 ` Ira Weiny
2023-12-23  0:24   ` Ira Weiny
2023-12-23  9:39     ` Coly Li
2023-12-23 17:13       ` Ira Weiny
2023-12-24  0:18         ` Coly Li
2023-12-23  6:52 ` Coly Li
2023-12-23  8:35 ` Linux regression tracking #adding (Thorsten Leemhuis)
2024-01-07  8:48   ` Linux regression tracking #update (Thorsten Leemhuis)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox