[LSF/MM TOPIC] Badblocks checking/representation in filesystems

Linux-NVDIMM Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [LSF/MM TOPIC] Badblocks checking/representation in filesystems
@ 2017-01-13 21:40 Verma, Vishal L
  0 siblings, 0 replies; 19+ messages in thread
From: Verma, Vishal L @ 2017-01-13 21:40 UTC (permalink / raw)
  To: lsf-pc@lists.linux-foundation.org
  Cc: linux-block@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-nvdimm@lists.01.org

The current implementation of badblocks, where we consult the badblocks
list for every IO in the block driver works, and is a last option
failsafe, but from a user perspective, it isn't the easiest interface to
work with.

A while back, Dave Chinner had suggested a move towards smarter
handling, and I posted initial RFC patches [1], but since then the topic
hasn't really moved forward.

I'd like to propose and have a discussion about the following new
functionality:

1. Filesystems develop a native representation of badblocks. For
example, in xfs, this would (presumably) be linked to the reverse
mapping btree. The filesystem representation has the potential to be 
more efficient than the block driver doing the check, as the fs can
check the IO happening on a file against just that file's range. In
contrast, today, the block driver checks against the whole block device
range for every IO. On encountering badblocks, the filesystem can
generate a better notification/error message that points the user to 
(file, offset) as opposed to the block driver, which can only provide
(block-device, sector).

2. The block layer adds a notifier to badblock addition/removal
operations, which the filesystem subscribes to, and uses to maintain its
badblocks accounting. (This part is implemented as a proof of concept in
the RFC mentioned above [1]).

3. The filesystem has a way of telling the block driver (a flag? a
different/new interface?) that it is responsible for badblock checking
so that the driver doesn't have to do its check. The driver checking
will have to remain in place as a catch-all for filesystems/interfaces
that don't or aren't capable of doing the checks at a higher layer.

Additionally, I saw some discussion about logical depop on the lists
again, and I was involved with discussions last year about expanding the
the badblocks infrastructure for this use. If that is a topic again this
year, I'd like to be involved in it too.

I'm also interested in participating in any other pmem/NVDIMM related
discussions.

Thank you,
	-Vishal

[1]: http://www.linux.sgi.com/archives/xfs/2016-06/msg00299.html
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 19+ messages in thread

[parent not found: <at1mp6pou4lenesjdgh22k4p.1484345585589@email.android.com>]

[parent not found: <b9rbflutjt10mb4ofherta8j.1484345610771@email.android.com>]

* RE: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
       [not found] ` <b9rbflutjt10mb4ofherta8j.1484345610771@email.android.com>
@ 2017-01-14  0:00   ` Slava Dubeyko
  2017-01-14  0:49     ` Vishal Verma
  0 siblings, 1 reply; 19+ messages in thread
From: Slava Dubeyko @ 2017-01-14  0:00 UTC (permalink / raw)
  To: vishal.l.verma@intel.com
  Cc: linux-block@vger.kernel.org, Linux FS Devel,
	lsf-pc@lists.linux-foundation.org, Viacheslav Dubeyko,
	linux-nvdimm@lists.01.org

---- Original Message ----
Subject: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
Sent: Jan 13, 2017 1:40 PM
From: "Verma, Vishal L" <vishal.l.verma@intel.com>
To: lsf-pc@lists.linux-foundation.org
Cc: linux-nvdimm@lists.01.org, linux-block@vger.kernel.org, linux-fsdevel@vger.kernel.org

> The current implementation of badblocks, where we consult the badblocks
> list for every IO in the block driver works, and is a last option
> failsafe, but from a user perspective, it isn't the easiest interface to
> work with.

As I remember, FAT and HFS+ specifications contain description of bad blocks
(physical sectors) table. I believe that this table was used for the case of
floppy media. But, finally, this table becomes to be the completely obsolete
artefact because mostly storage devices are reliably enough. Why do you need
in exposing the bad blocks on the file system level?  Do you expect that next
generation of NVM memory will be so unreliable that file system needs to manage
bad blocks? What's about erasure coding schemes? Do file system really need to suffer
from the bad block issue? 

Usually, we are using LBAs and it is the responsibility of storage device to map
a bad physical block/page/sector into valid one. Do you mean that we have
access to physical NVM memory address directly? But it looks like that we can
have a "bad block" issue even we will access data into page cache's memory
page (if we will use NVM memory for page cache, of course). So, what do you
imply by "bad block" issue? 

> 
> A while back, Dave Chinner had suggested a move towards smarter
> handling, and I posted initial RFC patches [1], but since then the topic
> hasn't really moved forward.
> 
> I'd like to propose and have a discussion about the following new
> functionality:
> 
> 1. Filesystems develop a native representation of badblocks. For
> example, in xfs, this would (presumably) be linked to the reverse
> mapping btree. The filesystem representation has the potential to be 
> more efficient than the block driver doing the check, as the fs can
> check the IO happening on a file against just that file's range. 

What do you mean by "file system can check the IO happening on a file"?
Do you mean read or write operation? What's about metadata?

If we are talking about the discovering a bad block on read operation then
rare modern file system is able to survive as for the case of metadata as
for the case of user data. Let's imagine that we have really mature file
system driver then what does it mean to encounter a bad block? The failure
to read a logical block of some metadata (bad block) means that we are
unable to extract some part of a metadata structure. From file system
driver point of view, it looks like that our file system is corrupted, we need
to stop the file system operations and, finally, to check and recover file
system volume by means of fsck tool. If we find a bad block for some
user file then, again, it looks like an issue. Some file systems simply
return "unrecovered read error". Another one, theoretically, is able
to survive because of snapshots, for example. But, anyway, it will look
like as Read-Only mount state and the user will need to resolve such
trouble by hands.

If we are talking about discovering a bad block during write operation then,
again, we are in trouble. Usually, we are using asynchronous model
of write/flush operation. We are preparing the consistent state of all our
metadata structures in the memory, at first. The flush operations for metadata
and user data can be done in different times. And what should be done if we
discover bad block for any piece of metadata or user data? Simple tracking of
bad blocks is not enough at all. Let's consider user data, at first. If we cannot
write some file's block successfully then we have two ways: (1) forget about
this piece of data; (2) try to change the associated LBA for this piece of data.
The operation of re-allocation LBA number for discovered bad block
(user data case) sounds as real pain. Because you need to rebuild the metadata
that track the location of this part of file. And it sounds as practically
impossible operation, for the case of LFS file system, for example.
If we have trouble with flushing any part of metadata then it sounds as
complete disaster for any file system.

Are you really sure that file system should process bad block issue?

>In contrast, today, the block driver checks against the whole block device
> range for every IO. On encountering badblocks, the filesystem can
> generate a better notification/error message that points the user to 
> (file, offset) as opposed to the block driver, which can only provide
> (block-device, sector).
>
> 2. The block layer adds a notifier to badblock addition/removal
> operations, which the filesystem subscribes to, and uses to maintain its
> badblocks accounting. (This part is implemented as a proof of concept in
> the RFC mentioned above [1]).

I am not sure that any bad block notification during/after IO operation
is valuable for file system. Maybe, it could help if file system simply will
know about bad block beforehand the operation of logical block allocation.
But what subsystem will discover bad blocks before any IO operations?
How file system will receive information or some bad block table?
I am not convinced that suggested badblocks approach is really feasible.
Also I am not sure that file system should see the bad blocks at all.
Why hardware cannot manage this issue for us?

Thanks,
Vyacheslav Dubeyko.

Western Digital Corporation (and its subsidiaries) E-mail Confidentiality Notice & Disclaimer:

This e-mail and any files transmitted with it may contain confidential or legally privileged information of WDC and/or its affiliates, and are intended solely for the use of the individual or entity to which they are addressed. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited. If you have received this e-mail in error, please notify the sender immediately and delete the e-mail in its entirety from your system.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
  2017-01-14  0:00   ` Slava Dubeyko
@ 2017-01-14  0:49     ` Vishal Verma
  2017-01-16  2:27       ` Slava Dubeyko
  2017-01-17  6:33       ` Darrick J. Wong
  0 siblings, 2 replies; 19+ messages in thread
From: Vishal Verma @ 2017-01-14  0:49 UTC (permalink / raw)
  To: Slava Dubeyko
  Cc: linux-block@vger.kernel.org, Linux FS Devel,
	lsf-pc@lists.linux-foundation.org, Viacheslav Dubeyko,
	linux-nvdimm@lists.01.org

On 01/14, Slava Dubeyko wrote:
> 
> ---- Original Message ----
> Subject: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
> Sent: Jan 13, 2017 1:40 PM
> From: "Verma, Vishal L" <vishal.l.verma@intel.com>
> To: lsf-pc@lists.linux-foundation.org
> Cc: linux-nvdimm@lists.01.org, linux-block@vger.kernel.org, linux-fsdevel@vger.kernel.org
> 
> > The current implementation of badblocks, where we consult the badblocks
> > list for every IO in the block driver works, and is a last option
> > failsafe, but from a user perspective, it isn't the easiest interface to
> > work with.
> 
> As I remember, FAT and HFS+ specifications contain description of bad blocks
> (physical sectors) table. I believe that this table was used for the case of
> floppy media. But, finally, this table becomes to be the completely obsolete
> artefact because mostly storage devices are reliably enough. Why do you need
> in exposing the bad blocks on the file system level?  Do you expect that next
> generation of NVM memory will be so unreliable that file system needs to manage
> bad blocks? What's about erasure coding schemes? Do file system really need to suffer
> from the bad block issue? 
> 
> Usually, we are using LBAs and it is the responsibility of storage device to map
> a bad physical block/page/sector into valid one. Do you mean that we have
> access to physical NVM memory address directly? But it looks like that we can
> have a "bad block" issue even we will access data into page cache's memory
> page (if we will use NVM memory for page cache, of course). So, what do you
> imply by "bad block" issue? 

We don't have direct physical access to the device's address space, in
the sense the device is still free to perform remapping of chunks of NVM
underneath us. The problem is that when a block or address range (as
small as a cache line) goes bad, the device maintains a poison bit for
every affected cache line. Behind the scenes, it may have already
remapped the range, but the cache line poison has to be kept so that
there is a notification to the user/owner of the data that something has
been lost. Since NVM is byte addressable memory sitting on the memory
bus, such a poisoned cache line results in memory errors and SIGBUSes.
Compared to tradational storage where an app will get nice and friendly
(relatively speaking..) -EIOs. The whole badblocks implementation was
done so that the driver can intercept IO (i.e. reads) to _known_ bad
locations, and short-circuit them with an EIO. If the driver doesn't
catch these, the reads will turn into a memory bus access, and the
poison will cause a SIGBUS.

This effort is to try and make this badblock checking smarter - and try
and reduce the penalty on every IO to a smaller range, which only the
filesystem can do.

> 
> > 
> > A while back, Dave Chinner had suggested a move towards smarter
> > handling, and I posted initial RFC patches [1], but since then the topic
> > hasn't really moved forward.
> > 
> > I'd like to propose and have a discussion about the following new
> > functionality:
> > 
> > 1. Filesystems develop a native representation of badblocks. For
> > example, in xfs, this would (presumably) be linked to the reverse
> > mapping btree. The filesystem representation has the potential to be 
> > more efficient than the block driver doing the check, as the fs can
> > check the IO happening on a file against just that file's range. 
> 
> What do you mean by "file system can check the IO happening on a file"?
> Do you mean read or write operation? What's about metadata?

For the purpose described above, i.e. returning early EIOs when
possible, this will be limited to reads and metadata reads. If we're
about to do a metadata read, and realize the block(s) about to be read
are on the badblocks list, then we do the same thing as when we discover
other kinds of metadata corruption.

> 
> If we are talking about the discovering a bad block on read operation then
> rare modern file system is able to survive as for the case of metadata as
> for the case of user data. Let's imagine that we have really mature file
> system driver then what does it mean to encounter a bad block? The failure
> to read a logical block of some metadata (bad block) means that we are
> unable to extract some part of a metadata structure. From file system
> driver point of view, it looks like that our file system is corrupted, we need
> to stop the file system operations and, finally, to check and recover file
> system volume by means of fsck tool. If we find a bad block for some
> user file then, again, it looks like an issue. Some file systems simply
> return "unrecovered read error". Another one, theoretically, is able
> to survive because of snapshots, for example. But, anyway, it will look
> like as Read-Only mount state and the user will need to resolve such
> trouble by hands.

As far as I can tell, all of these things remain the same. The goal here
isn't to survive more NVM badblocks than we would've before, and lost
data or lost metadata will continue to have the same consequences as
before, and will need the same recovery actions/intervention as before.
The goal is to make the failure model similar to what users expect
today, and as much as possible make recovery actions too similarly
intuitive.

> 
> If we are talking about discovering a bad block during write operation then,
> again, we are in trouble. Usually, we are using asynchronous model
> of write/flush operation. We are preparing the consistent state of all our
> metadata structures in the memory, at first. The flush operations for metadata
> and user data can be done in different times. And what should be done if we
> discover bad block for any piece of metadata or user data? Simple tracking of
> bad blocks is not enough at all. Let's consider user data, at first. If we cannot
> write some file's block successfully then we have two ways: (1) forget about
> this piece of data; (2) try to change the associated LBA for this piece of data.
> The operation of re-allocation LBA number for discovered bad block
> (user data case) sounds as real pain. Because you need to rebuild the metadata
> that track the location of this part of file. And it sounds as practically
> impossible operation, for the case of LFS file system, for example.
> If we have trouble with flushing any part of metadata then it sounds as
> complete disaster for any file system.

Writes can get more complicated in certain cases. If it is a regular
page cache writeback, or any aligned write that goes through the block
driver, that is completely fine. The block driver will check that the
block was previously marked as bad, do a "clear poison" operation
(defined in the ACPI spec), which tells the firmware that the poison bit
is not OK to be cleared, and writes the new data. This also removes the
block from the badblocks list, and in this scheme, triggers a
notification to the filesystem that it too can remove the block from its
accounting. mmap writes and DAX can get more complicated, and at times
they will just trigger a SIGBUS, and there's no way around that.

> 
> Are you really sure that file system should process bad block issue?
> 
> >In contrast, today, the block driver checks against the whole block device
> > range for every IO. On encountering badblocks, the filesystem can
> > generate a better notification/error message that points the user to 
> > (file, offset) as opposed to the block driver, which can only provide
> > (block-device, sector).
> >
> > 2. The block layer adds a notifier to badblock addition/removal
> > operations, which the filesystem subscribes to, and uses to maintain its
> > badblocks accounting. (This part is implemented as a proof of concept in
> > the RFC mentioned above [1]).
> 
> I am not sure that any bad block notification during/after IO operation
> is valuable for file system. Maybe, it could help if file system simply will
> know about bad block beforehand the operation of logical block allocation.
> But what subsystem will discover bad blocks before any IO operations?
> How file system will receive information or some bad block table?

The driver populates its badblocks lists whenever an Address Range Scrub
is started (also via ACPI methods). This is always done at
initialization time, so that it can build an in-memory representation of
the badblocks. Additionally, this can also be triggered manually. And
finally badblocks can also get populated for new latent errors when a
machine check exception occurs. All of these can trigger notification to
the file system without actual user reads happening.

> I am not convinced that suggested badblocks approach is really feasible.
> Also I am not sure that file system should see the bad blocks at all.
> Why hardware cannot manage this issue for us?

Hardware does manage the actual badblocks issue for us in the sense that
when it discovers a badblock it will do the remapping. But since this is
on the memory bus, and has different error signatures than applications
are used to, we want to make the error handling similar to the existing
storage model.

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
  2017-01-14  0:49     ` Vishal Verma
@ 2017-01-16  2:27       ` Slava Dubeyko
  2017-01-17  6:33       ` Darrick J. Wong
  1 sibling, 0 replies; 19+ messages in thread
From: Slava Dubeyko @ 2017-01-16  2:27 UTC (permalink / raw)
  To: Vishal Verma
  Cc: linux-block@vger.kernel.org, Linux FS Devel,
	lsf-pc@lists.linux-foundation.org, Viacheslav Dubeyko,
	linux-nvdimm@lists.01.org

-----Original Message-----
From: Vishal Verma [mailto:vishal.l.verma@intel.com] 
Sent: Friday, January 13, 2017 4:49 PM
To: Slava Dubeyko <Vyacheslav.Dubeyko@wdc.com>
Cc: lsf-pc@lists.linux-foundation.org; linux-nvdimm@lists.01.org; linux-block@vger.kernel.org; Linux FS Devel <linux-fsdevel@vger.kernel.org>; Viacheslav Dubeyko <slava@dubeyko.com>
Subject: Re: [LSF/MM TOPIC] Badblocks checking/representation in filesystems

<skipped>

> We don't have direct physical access to the device's address space, in the sense
> the device is still free to perform remapping of chunks of NVM underneath us.
> The problem is that when a block or address range (as small as a cache line) goes bad,
> the device maintains a poison bit for every affected cache line. Behind the scenes,
> it may have already remapped the range, but the cache line poison has to be kept so that
> there is a notification to the user/owner of the data that something has been lost.
> Since NVM is byte addressable memory sitting on the memory bus, such a poisoned
> cache line results in memory errors and SIGBUSes.
> Compared to tradational storage where an app will get nice and friendly (relatively speaking..) -EIOs.
> The whole badblocks implementation was done so that the driver can intercept IO (i.e. reads)
> to _known_ bad locations, and short-circuit them with an EIO. If the driver doesn't catch these,
> the reads will turn into a memory bus access, and the poison will cause a SIGBUS.
>
> This effort is to try and make this badblock checking smarter - and try and reduce the penalty
> on every IO to a smaller range, which only the filesystem can do.

I still slightly puzzled and I cannot understand why the situation looks like a dead end.
As far as I can see, first of all, a NVM device is able to use hardware-based LDPC,
Reed-Solomon error correction or any other fancy code. It could provide some error
correction basis. Also it can provide the way of estimation of BER value. So, if a NVM memory's
address range degrades gradually (during weeks or months) then, practically, it's possible
to remap and to migrate the affected address ranges in the background. Otherwise,
if a NVM memory so unreliable that address range is able to degrade during seconds or minutes
then who will use such NVM memory?

OK. Let's imagine that NVM memory device hasn't any internal error correction hardware-based
scheme. Next level of defense could be any erasure coding scheme on device driver level. So, any
piece of data can be protected by parities. And device driver will be responsible for management
of erasure coding scheme. It will increase latency of read operation for the case of necessity
to recover the affected memory page. But, finally, all recovering activity will be behind the scene
and file system will be unaware about such recovering activity.

If you are going not to provide any erasure coding or error correction scheme then it's really
bad case. The fsck tool is not regular case tool but the last resort. If you are going to rely on
the fsck tool then simply forget about using your hardware. Some file systems haven't the fsck
tool at all. Some guys really believe that file system has to work without support of the fsck tool.
Even if a mature file system has reliable fsck tool then the probability of file system recovering
is very low in the case of serious metadata corruptions. So, it means that you are trying to suggest
the technique when we will lose the whole file system volumes on regular basis without any hope
to recover data. Even if file system has snapshots then, again, we haven't hope because we can
suffer from read error and for operation with snapshot.

But if we will have support of any erasure coding scheme and NVM device discovers poisoned
cache line for some memory page then, I suppose, that such situation could looks like as page fault
and memory subsystem will need to re-read the page with background recovery of memory page's
content.

It sounds for me that we simply have some poorly designed hardware. And it is impossible
to push such issue on file system level. I believe that such issue can be managed by block
device or DAX subsystem in the presence of any erasure coding scheme. Otherwise, no
file system is able to survive in such wild environment. Because, I assume that any file
system volume will be in unrecoverable state in 50% (or significantly more) cases of bad block
discovering. Because any affection of metadata block can be resulted in severely inconsistent
state of file system's metadata structures. And it's very non-trivial task to recover the consistent
state of file system's metadata structures in the case of losing some part of it.

> > > 
> > > A while back, Dave Chinner had suggested a move towards smarter 
> > > handling, and I posted initial RFC patches [1], but since then the 
> > > topic hasn't really moved forward.
> > > 
> > > I'd like to propose and have a discussion about the following new
> > > functionality:
> > > 
> > > 1. Filesystems develop a native representation of badblocks. For 
> > > example, in xfs, this would (presumably) be linked to the reverse 
> > > mapping btree. The filesystem representation has the potential to be 
> > > more efficient than the block driver doing the check, as the fs can 
> > > check the IO happening on a file against just that file's range.
> > 
> > What do you mean by "file system can check the IO happening on a file"?
> > Do you mean read or write operation? What's about metadata?
>
> For the purpose described above, i.e. returning early EIOs when possible,
> this will be limited to reads and metadata reads. If we're about to do a metadata
> read, and realize the block(s) about to be read are on the badblocks list, then
> we do the same thing as when we discover other kinds of metadata corruption.

Frankly speaking, I cannot follow how badblock list is able to help the file system
driver to survive. Every time when file system driver encounters the bad block
presence then it stops the activity with: (1) unrecovered read error; (2) remount
in RO mode; (3) simple crash. It means that it needs to unmount a file system volume
(if driver hasn't crashed) and to run fsck tool. So, file system driver cannot gain
from tracking bad blocks in the special list because, mostly, it will stop the regular
operation in the case of access the bad block. Even if the file system driver extracts
the badblock list from some low-level driver then what can be done by file system
driver? Let's imagine that file system driver knows that LBA#N is bad then the best
behavior will be simply panic or remount in RO state, nothing more.

<skipped>

> As far as I can tell, all of these things remain the same. The goal here isn't to survive
> more NVM badblocks than we would've before, and lost data or
> lost metadata will continue to have the same consequences as before, and
> will need the same recovery actions/intervention as before.
> The goal is to make the failure model similar to what users expect
> today, and as much as possible make recovery actions too similarly intuitive.

OK. Nowadays, user expects that hardware is reliably enough. It's the same
situation like for NAND flash. NAND flash can have bad erase blocks. But FTL
hides this reality from a file system. Otherwise, file system should be
NAND flash oriented and to be able to manage bad erase blocks presence.
Your suggestion will increase probability of unrecoverable state of file
system volume dramatically. So, it's hard to see the point for such approach.

> Writes can get more complicated in certain cases. If it is a regular page cache
> writeback, or any aligned write that goes through the block driver, that is completely
> fine. The block driver will check that the block was previously marked as bad,
> do a "clear poison" operation (defined in the ACPI spec), which tells the firmware that
> the poison bit is not OK to be cleared, and writes the new data. This also removes
> the block from the badblocks list, and in this scheme, triggers a notification to
> the filesystem that it too can remove the block from its accounting.
> mmap writes and DAX can get more complicated, and at times they will just
>trigger a SIGBUS, and there's no way around that.

If page cache writeback finishes with writing data in valid location then
no troubles here at all. But I assume that critical point will on the read path.
Because, we still will have the same troubles as I mentioned above.

<skipped>

> Hardware does manage the actual badblocks issue for us
> in the sense that when it discovers a badblock it will do the remapping.
> But since this is on the memory bus, and has different error signatures
> than applications are used to, we want to make the error handling
> similar to the existing storage model.

So, if hardware is able to do the remapping of bad portions of memory page
then it is possible to see the valid logical page always. The key point here
that hardware controller should manage migration of data from aged/pre-bad
NVM memory ranges into valid ones. Or it needs to use some fancy
error-correction techniques or erasure coding schemes.

Thanks,
Vyacheslav Dubeyko.

Western Digital Corporation (and its subsidiaries) E-mail Confidentiality Notice & Disclaimer:

This e-mail and any files transmitted with it may contain confidential or legally privileged information of WDC and/or its affiliates, and are intended solely for the use of the individual or entity to which they are addressed. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited. If you have received this e-mail in error, please notify the sender immediately and delete the e-mail in its entirety from your system.

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
  2017-01-14  0:49     ` Vishal Verma
  2017-01-16  2:27       ` Slava Dubeyko
@ 2017-01-17  6:33       ` Darrick J. Wong
  2017-01-17 21:35         ` Vishal Verma
  1 sibling, 1 reply; 19+ messages in thread
From: Darrick J. Wong @ 2017-01-17  6:33 UTC (permalink / raw)
  To: Vishal Verma
  Cc: Slava Dubeyko, linux-nvdimm@lists.01.org,
	linux-block@vger.kernel.org, Viacheslav Dubeyko, Linux FS Devel,
	lsf-pc@lists.linux-foundation.org

On Fri, Jan 13, 2017 at 05:49:10PM -0700, Vishal Verma wrote:
> On 01/14, Slava Dubeyko wrote:
> > 
> > ---- Original Message ----
> > Subject: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
> > Sent: Jan 13, 2017 1:40 PM
> > From: "Verma, Vishal L" <vishal.l.verma@intel.com>
> > To: lsf-pc@lists.linux-foundation.org
> > Cc: linux-nvdimm@lists.01.org, linux-block@vger.kernel.org, linux-fsdevel@vger.kernel.org
> > 
> > > The current implementation of badblocks, where we consult the badblocks
> > > list for every IO in the block driver works, and is a last option
> > > failsafe, but from a user perspective, it isn't the easiest interface to
> > > work with.
> > 
> > As I remember, FAT and HFS+ specifications contain description of bad blocks
> > (physical sectors) table. I believe that this table was used for the case of
> > floppy media. But, finally, this table becomes to be the completely obsolete
> > artefact because mostly storage devices are reliably enough. Why do you need

ext4 has a badblocks inode to own all the bad spots on disk, but ISTR it
doesn't support(??) extents or 64-bit filesystems, and might just be a
vestigial organ at this point.  XFS doesn't have anything to track bad
blocks currently....

> > in exposing the bad blocks on the file system level?  Do you expect that next
> > generation of NVM memory will be so unreliable that file system needs to manage
> > bad blocks? What's about erasure coding schemes? Do file system really need to suffer
> > from the bad block issue? 
> > 
> > Usually, we are using LBAs and it is the responsibility of storage device to map
> > a bad physical block/page/sector into valid one. Do you mean that we have
> > access to physical NVM memory address directly? But it looks like that we can
> > have a "bad block" issue even we will access data into page cache's memory
> > page (if we will use NVM memory for page cache, of course). So, what do you
> > imply by "bad block" issue? 
> 
> We don't have direct physical access to the device's address space, in
> the sense the device is still free to perform remapping of chunks of NVM
> underneath us. The problem is that when a block or address range (as
> small as a cache line) goes bad, the device maintains a poison bit for
> every affected cache line. Behind the scenes, it may have already
> remapped the range, but the cache line poison has to be kept so that
> there is a notification to the user/owner of the data that something has
> been lost. Since NVM is byte addressable memory sitting on the memory
> bus, such a poisoned cache line results in memory errors and SIGBUSes.
> Compared to tradational storage where an app will get nice and friendly
> (relatively speaking..) -EIOs. The whole badblocks implementation was
> done so that the driver can intercept IO (i.e. reads) to _known_ bad
> locations, and short-circuit them with an EIO. If the driver doesn't
> catch these, the reads will turn into a memory bus access, and the
> poison will cause a SIGBUS.

"driver" ... you mean XFS?  Or do you mean the thing that makes pmem
look kind of like a traditional block device? :)

> This effort is to try and make this badblock checking smarter - and try
> and reduce the penalty on every IO to a smaller range, which only the
> filesystem can do.

Though... now that XFS merged the reverse mapping support, I've been
wondering if there'll be a resubmission of the device errors callback?
It still would be useful to be able to inform the user that part of
their fs has gone bad, or, better yet, if the buffer is still in memory
someplace else, just write it back out.

Or I suppose if we had some kind of raid1 set up between memories we
could read one of the other copies and rewrite it into the failing
region immediately.

> > > A while back, Dave Chinner had suggested a move towards smarter
> > > handling, and I posted initial RFC patches [1], but since then the topic
> > > hasn't really moved forward.
> > > 
> > > I'd like to propose and have a discussion about the following new
> > > functionality:
> > > 
> > > 1. Filesystems develop a native representation of badblocks. For
> > > example, in xfs, this would (presumably) be linked to the reverse
> > > mapping btree. The filesystem representation has the potential to be 
> > > more efficient than the block driver doing the check, as the fs can
> > > check the IO happening on a file against just that file's range. 

OTOH that means we'd have to check /every/ file IO request against the
rmapbt, which will make things reaaaaaally slow.  I suspect it might be
preferable just to let the underlying pmem driver throw an error at us.

(Or possibly just cache the bad extents in memory.)

> > What do you mean by "file system can check the IO happening on a file"?
> > Do you mean read or write operation? What's about metadata?
> 
> For the purpose described above, i.e. returning early EIOs when
> possible, this will be limited to reads and metadata reads. If we're
> about to do a metadata read, and realize the block(s) about to be read
> are on the badblocks list, then we do the same thing as when we discover
> other kinds of metadata corruption.

...fail and shut down? :)

Actually, for metadata either we look at the xfs_bufs to see if it's in
memory (XFS doesn't directly access metadata) and write it back out; or
we could fire up the online repair tool to rebuild the metadata.

> > If we are talking about the discovering a bad block on read operation then
> > rare modern file system is able to survive as for the case of metadata as
> > for the case of user data. Let's imagine that we have really mature file
> > system driver then what does it mean to encounter a bad block? The failure
> > to read a logical block of some metadata (bad block) means that we are
> > unable to extract some part of a metadata structure. From file system
> > driver point of view, it looks like that our file system is corrupted, we need
> > to stop the file system operations and, finally, to check and recover file
> > system volume by means of fsck tool. If we find a bad block for some
> > user file then, again, it looks like an issue. Some file systems simply
> > return "unrecovered read error". Another one, theoretically, is able
> > to survive because of snapshots, for example. But, anyway, it will look
> > like as Read-Only mount state and the user will need to resolve such
> > trouble by hands.
> 
> As far as I can tell, all of these things remain the same. The goal here
> isn't to survive more NVM badblocks than we would've before, and lost
> data or lost metadata will continue to have the same consequences as
> before, and will need the same recovery actions/intervention as before.
> The goal is to make the failure model similar to what users expect
> today, and as much as possible make recovery actions too similarly
> intuitive.
> 
> > 
> > If we are talking about discovering a bad block during write operation then,
> > again, we are in trouble. Usually, we are using asynchronous model
> > of write/flush operation. We are preparing the consistent state of all our
> > metadata structures in the memory, at first. The flush operations for metadata
> > and user data can be done in different times. And what should be done if we
> > discover bad block for any piece of metadata or user data? Simple tracking of
> > bad blocks is not enough at all. Let's consider user data, at first. If we cannot
> > write some file's block successfully then we have two ways: (1) forget about
> > this piece of data; (2) try to change the associated LBA for this piece of data.
> > The operation of re-allocation LBA number for discovered bad block
> > (user data case) sounds as real pain. Because you need to rebuild the metadata
> > that track the location of this part of file. And it sounds as practically
> > impossible operation, for the case of LFS file system, for example.
> > If we have trouble with flushing any part of metadata then it sounds as
> > complete disaster for any file system.
> 
> Writes can get more complicated in certain cases. If it is a regular
> page cache writeback, or any aligned write that goes through the block
> driver, that is completely fine. The block driver will check that the
> block was previously marked as bad, do a "clear poison" operation
> (defined in the ACPI spec), which tells the firmware that the poison bit
> is not OK to be cleared, and writes the new data. This also removes the
> block from the badblocks list, and in this scheme, triggers a
> notification to the filesystem that it too can remove the block from its
> accounting. mmap writes and DAX can get more complicated, and at times
> they will just trigger a SIGBUS, and there's no way around that.
> 
> > 
> > Are you really sure that file system should process bad block issue?
> > 
> > >In contrast, today, the block driver checks against the whole block device
> > > range for every IO. On encountering badblocks, the filesystem can
> > > generate a better notification/error message that points the user to 
> > > (file, offset) as opposed to the block driver, which can only provide
> > > (block-device, sector).

<shrug> We can do the translation with the backref info...

> > > 2. The block layer adds a notifier to badblock addition/removal
> > > operations, which the filesystem subscribes to, and uses to maintain its
> > > badblocks accounting. (This part is implemented as a proof of concept in
> > > the RFC mentioned above [1]).
> > 
> > I am not sure that any bad block notification during/after IO operation
> > is valuable for file system. Maybe, it could help if file system simply will
> > know about bad block beforehand the operation of logical block allocation.
> > But what subsystem will discover bad blocks before any IO operations?
> > How file system will receive information or some bad block table?
> 
> The driver populates its badblocks lists whenever an Address Range Scrub
> is started (also via ACPI methods). This is always done at
> initialization time, so that it can build an in-memory representation of
> the badblocks. Additionally, this can also be triggered manually. And
> finally badblocks can also get populated for new latent errors when a
> machine check exception occurs. All of these can trigger notification to
> the file system without actual user reads happening.
> 
> > I am not convinced that suggested badblocks approach is really feasible.
> > Also I am not sure that file system should see the bad blocks at all.
> > Why hardware cannot manage this issue for us?
> 
> Hardware does manage the actual badblocks issue for us in the sense that
> when it discovers a badblock it will do the remapping. But since this is
> on the memory bus, and has different error signatures than applications
> are used to, we want to make the error handling similar to the existing
> storage model.

Yes please and thank you, to the "error handling similar to the existing
storage model".  Even better if this just gets added to a layer
underneath the fs so that IO to bad regions returns EIO. 8-)

(Sleeeeep...)

--D

> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
  2017-01-17  6:33       ` Darrick J. Wong
@ 2017-01-17 21:35         ` Vishal Verma
  2017-01-17 22:15           ` Andiry Xu
  0 siblings, 1 reply; 19+ messages in thread
From: Vishal Verma @ 2017-01-17 21:35 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Slava Dubeyko, linux-nvdimm@lists.01.org,
	linux-block@vger.kernel.org, Viacheslav Dubeyko, Linux FS Devel,
	lsf-pc@lists.linux-foundation.org

On 01/16, Darrick J. Wong wrote:
> On Fri, Jan 13, 2017 at 05:49:10PM -0700, Vishal Verma wrote:
> > On 01/14, Slava Dubeyko wrote:
> > > 
> > > ---- Original Message ----
> > > Subject: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
> > > Sent: Jan 13, 2017 1:40 PM
> > > From: "Verma, Vishal L" <vishal.l.verma@intel.com>
> > > To: lsf-pc@lists.linux-foundation.org
> > > Cc: linux-nvdimm@lists.01.org, linux-block@vger.kernel.org, linux-fsdevel@vger.kernel.org
> > > 
> > > > The current implementation of badblocks, where we consult the badblocks
> > > > list for every IO in the block driver works, and is a last option
> > > > failsafe, but from a user perspective, it isn't the easiest interface to
> > > > work with.
> > > 
> > > As I remember, FAT and HFS+ specifications contain description of bad blocks
> > > (physical sectors) table. I believe that this table was used for the case of
> > > floppy media. But, finally, this table becomes to be the completely obsolete
> > > artefact because mostly storage devices are reliably enough. Why do you need
> 
> ext4 has a badblocks inode to own all the bad spots on disk, but ISTR it
> doesn't support(??) extents or 64-bit filesystems, and might just be a
> vestigial organ at this point.  XFS doesn't have anything to track bad
> blocks currently....
> 
> > > in exposing the bad blocks on the file system level?  Do you expect that next
> > > generation of NVM memory will be so unreliable that file system needs to manage
> > > bad blocks? What's about erasure coding schemes? Do file system really need to suffer
> > > from the bad block issue? 
> > > 
> > > Usually, we are using LBAs and it is the responsibility of storage device to map
> > > a bad physical block/page/sector into valid one. Do you mean that we have
> > > access to physical NVM memory address directly? But it looks like that we can
> > > have a "bad block" issue even we will access data into page cache's memory
> > > page (if we will use NVM memory for page cache, of course). So, what do you
> > > imply by "bad block" issue? 
> > 
> > We don't have direct physical access to the device's address space, in
> > the sense the device is still free to perform remapping of chunks of NVM
> > underneath us. The problem is that when a block or address range (as
> > small as a cache line) goes bad, the device maintains a poison bit for
> > every affected cache line. Behind the scenes, it may have already
> > remapped the range, but the cache line poison has to be kept so that
> > there is a notification to the user/owner of the data that something has
> > been lost. Since NVM is byte addressable memory sitting on the memory
> > bus, such a poisoned cache line results in memory errors and SIGBUSes.
> > Compared to tradational storage where an app will get nice and friendly
> > (relatively speaking..) -EIOs. The whole badblocks implementation was
> > done so that the driver can intercept IO (i.e. reads) to _known_ bad
> > locations, and short-circuit them with an EIO. If the driver doesn't
> > catch these, the reads will turn into a memory bus access, and the
> > poison will cause a SIGBUS.
> 
> "driver" ... you mean XFS?  Or do you mean the thing that makes pmem
> look kind of like a traditional block device? :)

Yes, the thing that makes pmem look like a block device :) --
drivers/nvdimm/pmem.c

> 
> > This effort is to try and make this badblock checking smarter - and try
> > and reduce the penalty on every IO to a smaller range, which only the
> > filesystem can do.
> 
> Though... now that XFS merged the reverse mapping support, I've been
> wondering if there'll be a resubmission of the device errors callback?
> It still would be useful to be able to inform the user that part of
> their fs has gone bad, or, better yet, if the buffer is still in memory
> someplace else, just write it back out.
> 
> Or I suppose if we had some kind of raid1 set up between memories we
> could read one of the other copies and rewrite it into the failing
> region immediately.

Yes, that is kind of what I was hoping to accomplish via this
discussion. How much would filesystems want to be involved in this sort
of badblocks handling, if at all. I can refresh my patches that provide
the fs notification, but that's the easy bit, and a starting point.

> 
> > > > A while back, Dave Chinner had suggested a move towards smarter
> > > > handling, and I posted initial RFC patches [1], but since then the topic
> > > > hasn't really moved forward.
> > > > 
> > > > I'd like to propose and have a discussion about the following new
> > > > functionality:
> > > > 
> > > > 1. Filesystems develop a native representation of badblocks. For
> > > > example, in xfs, this would (presumably) be linked to the reverse
> > > > mapping btree. The filesystem representation has the potential to be 
> > > > more efficient than the block driver doing the check, as the fs can
> > > > check the IO happening on a file against just that file's range. 
> 
> OTOH that means we'd have to check /every/ file IO request against the
> rmapbt, which will make things reaaaaaally slow.  I suspect it might be
> preferable just to let the underlying pmem driver throw an error at us.
> 
> (Or possibly just cache the bad extents in memory.)

Interesting - this would be a good discussion to have. My motivation for
this was the reasoning that the pmem driver has to check every single IO
against badblocks, and maybe the fs can do a better job. But if you
think the fs will actually be slower, we should try to somehow benchmark
that!

> 
> > > What do you mean by "file system can check the IO happening on a file"?
> > > Do you mean read or write operation? What's about metadata?
> > 
> > For the purpose described above, i.e. returning early EIOs when
> > possible, this will be limited to reads and metadata reads. If we're
> > about to do a metadata read, and realize the block(s) about to be read
> > are on the badblocks list, then we do the same thing as when we discover
> > other kinds of metadata corruption.
> 
> ...fail and shut down? :)
> 
> Actually, for metadata either we look at the xfs_bufs to see if it's in
> memory (XFS doesn't directly access metadata) and write it back out; or
> we could fire up the online repair tool to rebuild the metadata.

Agreed, I was just stressing that this scenario does not change from
status quo, and really recovering from corruption isn't the problem
we're trying to solve here :)

> 
> > > If we are talking about the discovering a bad block on read operation then
> > > rare modern file system is able to survive as for the case of metadata as
> > > for the case of user data. Let's imagine that we have really mature file
> > > system driver then what does it mean to encounter a bad block? The failure
> > > to read a logical block of some metadata (bad block) means that we are
> > > unable to extract some part of a metadata structure. From file system
> > > driver point of view, it looks like that our file system is corrupted, we need
> > > to stop the file system operations and, finally, to check and recover file
> > > system volume by means of fsck tool. If we find a bad block for some
> > > user file then, again, it looks like an issue. Some file systems simply
> > > return "unrecovered read error". Another one, theoretically, is able
> > > to survive because of snapshots, for example. But, anyway, it will look
> > > like as Read-Only mount state and the user will need to resolve such
> > > trouble by hands.
> > 
> > As far as I can tell, all of these things remain the same. The goal here
> > isn't to survive more NVM badblocks than we would've before, and lost
> > data or lost metadata will continue to have the same consequences as
> > before, and will need the same recovery actions/intervention as before.
> > The goal is to make the failure model similar to what users expect
> > today, and as much as possible make recovery actions too similarly
> > intuitive.
> > 
> > > 
> > > If we are talking about discovering a bad block during write operation then,
> > > again, we are in trouble. Usually, we are using asynchronous model
> > > of write/flush operation. We are preparing the consistent state of all our
> > > metadata structures in the memory, at first. The flush operations for metadata
> > > and user data can be done in different times. And what should be done if we
> > > discover bad block for any piece of metadata or user data? Simple tracking of
> > > bad blocks is not enough at all. Let's consider user data, at first. If we cannot
> > > write some file's block successfully then we have two ways: (1) forget about
> > > this piece of data; (2) try to change the associated LBA for this piece of data.
> > > The operation of re-allocation LBA number for discovered bad block
> > > (user data case) sounds as real pain. Because you need to rebuild the metadata
> > > that track the location of this part of file. And it sounds as practically
> > > impossible operation, for the case of LFS file system, for example.
> > > If we have trouble with flushing any part of metadata then it sounds as
> > > complete disaster for any file system.
> > 
> > Writes can get more complicated in certain cases. If it is a regular
> > page cache writeback, or any aligned write that goes through the block
> > driver, that is completely fine. The block driver will check that the
> > block was previously marked as bad, do a "clear poison" operation
> > (defined in the ACPI spec), which tells the firmware that the poison bit
> > is not OK to be cleared, and writes the new data. This also removes the
> > block from the badblocks list, and in this scheme, triggers a
> > notification to the filesystem that it too can remove the block from its
> > accounting. mmap writes and DAX can get more complicated, and at times
> > they will just trigger a SIGBUS, and there's no way around that.
> > 
> > > 
> > > Are you really sure that file system should process bad block issue?
> > > 
> > > >In contrast, today, the block driver checks against the whole block device
> > > > range for every IO. On encountering badblocks, the filesystem can
> > > > generate a better notification/error message that points the user to 
> > > > (file, offset) as opposed to the block driver, which can only provide
> > > > (block-device, sector).
> 
> <shrug> We can do the translation with the backref info...

Yes we should at least do that. I'm guessing this would happen in XFS
when it gets an EIO from an IO submission? The bio submission path in
the fs is probably not synchronous (correct?), but whenever it gets the
EIO, I'm guessing we just print a loud error message after doing the
backref lookup..

> 
> > > > 2. The block layer adds a notifier to badblock addition/removal
> > > > operations, which the filesystem subscribes to, and uses to maintain its
> > > > badblocks accounting. (This part is implemented as a proof of concept in
> > > > the RFC mentioned above [1]).
> > > 
> > > I am not sure that any bad block notification during/after IO operation
> > > is valuable for file system. Maybe, it could help if file system simply will
> > > know about bad block beforehand the operation of logical block allocation.
> > > But what subsystem will discover bad blocks before any IO operations?
> > > How file system will receive information or some bad block table?
> > 
> > The driver populates its badblocks lists whenever an Address Range Scrub
> > is started (also via ACPI methods). This is always done at
> > initialization time, so that it can build an in-memory representation of
> > the badblocks. Additionally, this can also be triggered manually. And
> > finally badblocks can also get populated for new latent errors when a
> > machine check exception occurs. All of these can trigger notification to
> > the file system without actual user reads happening.
> > 
> > > I am not convinced that suggested badblocks approach is really feasible.
> > > Also I am not sure that file system should see the bad blocks at all.
> > > Why hardware cannot manage this issue for us?
> > 
> > Hardware does manage the actual badblocks issue for us in the sense that
> > when it discovers a badblock it will do the remapping. But since this is
> > on the memory bus, and has different error signatures than applications
> > are used to, we want to make the error handling similar to the existing
> > storage model.
> 
> Yes please and thank you, to the "error handling similar to the existing
> storage model".  Even better if this just gets added to a layer
> underneath the fs so that IO to bad regions returns EIO. 8-)

This (if this just gets added to a layer underneath the fs so that IO to bad
regions returns EIO) already happens :)  See pmem_do_bvec() in
drivers/nvdimm/pmem.c, where we return EIO for a known badblock on a
read. I'm wondering if this can be improved..

> 
> (Sleeeeep...)
> 
> --D
> 
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
  2017-01-17 21:35         ` Vishal Verma
@ 2017-01-17 22:15           ` Andiry Xu
  2017-01-17 22:37             ` Vishal Verma
  2017-01-18  0:16             ` Andreas Dilger
  0 siblings, 2 replies; 19+ messages in thread
From: Andiry Xu @ 2017-01-17 22:15 UTC (permalink / raw)
  To: Vishal Verma
  Cc: Slava Dubeyko, Darrick J. Wong, linux-nvdimm@lists.01.org,
	linux-block@vger.kernel.org, Viacheslav Dubeyko, Linux FS Devel,
	lsf-pc@lists.linux-foundation.org

Hi,

On Tue, Jan 17, 2017 at 1:35 PM, Vishal Verma <vishal.l.verma@intel.com> wrote:
> On 01/16, Darrick J. Wong wrote:
>> On Fri, Jan 13, 2017 at 05:49:10PM -0700, Vishal Verma wrote:
>> > On 01/14, Slava Dubeyko wrote:
>> > >
>> > > ---- Original Message ----
>> > > Subject: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
>> > > Sent: Jan 13, 2017 1:40 PM
>> > > From: "Verma, Vishal L" <vishal.l.verma@intel.com>
>> > > To: lsf-pc@lists.linux-foundation.org
>> > > Cc: linux-nvdimm@lists.01.org, linux-block@vger.kernel.org, linux-fsdevel@vger.kernel.org
>> > >
>> > > > The current implementation of badblocks, where we consult the badblocks
>> > > > list for every IO in the block driver works, and is a last option
>> > > > failsafe, but from a user perspective, it isn't the easiest interface to
>> > > > work with.
>> > >
>> > > As I remember, FAT and HFS+ specifications contain description of bad blocks
>> > > (physical sectors) table. I believe that this table was used for the case of
>> > > floppy media. But, finally, this table becomes to be the completely obsolete
>> > > artefact because mostly storage devices are reliably enough. Why do you need
>>
>> ext4 has a badblocks inode to own all the bad spots on disk, but ISTR it
>> doesn't support(??) extents or 64-bit filesystems, and might just be a
>> vestigial organ at this point.  XFS doesn't have anything to track bad
>> blocks currently....
>>
>> > > in exposing the bad blocks on the file system level?  Do you expect that next
>> > > generation of NVM memory will be so unreliable that file system needs to manage
>> > > bad blocks? What's about erasure coding schemes? Do file system really need to suffer
>> > > from the bad block issue?
>> > >
>> > > Usually, we are using LBAs and it is the responsibility of storage device to map
>> > > a bad physical block/page/sector into valid one. Do you mean that we have
>> > > access to physical NVM memory address directly? But it looks like that we can
>> > > have a "bad block" issue even we will access data into page cache's memory
>> > > page (if we will use NVM memory for page cache, of course). So, what do you
>> > > imply by "bad block" issue?
>> >
>> > We don't have direct physical access to the device's address space, in
>> > the sense the device is still free to perform remapping of chunks of NVM
>> > underneath us. The problem is that when a block or address range (as
>> > small as a cache line) goes bad, the device maintains a poison bit for
>> > every affected cache line. Behind the scenes, it may have already
>> > remapped the range, but the cache line poison has to be kept so that
>> > there is a notification to the user/owner of the data that something has
>> > been lost. Since NVM is byte addressable memory sitting on the memory
>> > bus, such a poisoned cache line results in memory errors and SIGBUSes.
>> > Compared to tradational storage where an app will get nice and friendly
>> > (relatively speaking..) -EIOs. The whole badblocks implementation was
>> > done so that the driver can intercept IO (i.e. reads) to _known_ bad
>> > locations, and short-circuit them with an EIO. If the driver doesn't
>> > catch these, the reads will turn into a memory bus access, and the
>> > poison will cause a SIGBUS.
>>
>> "driver" ... you mean XFS?  Or do you mean the thing that makes pmem
>> look kind of like a traditional block device? :)
>
> Yes, the thing that makes pmem look like a block device :) --
> drivers/nvdimm/pmem.c
>
>>
>> > This effort is to try and make this badblock checking smarter - and try
>> > and reduce the penalty on every IO to a smaller range, which only the
>> > filesystem can do.
>>
>> Though... now that XFS merged the reverse mapping support, I've been
>> wondering if there'll be a resubmission of the device errors callback?
>> It still would be useful to be able to inform the user that part of
>> their fs has gone bad, or, better yet, if the buffer is still in memory
>> someplace else, just write it back out.
>>
>> Or I suppose if we had some kind of raid1 set up between memories we
>> could read one of the other copies and rewrite it into the failing
>> region immediately.
>
> Yes, that is kind of what I was hoping to accomplish via this
> discussion. How much would filesystems want to be involved in this sort
> of badblocks handling, if at all. I can refresh my patches that provide
> the fs notification, but that's the easy bit, and a starting point.
>

I have some questions. Why moving badblock handling to file system
level avoid the checking phase? In file system level for each I/O I
still have to check the badblock list, right? Do you mean during mount
it can go through the pmem device and locates all the data structures
mangled by badblocks and handle them accordingly, so that during
normal running the badblocks will never be accessed? Or, if there is
replicataion/snapshot support, use a copy to recover the badblocks?

How about operations bypass the file system, i.e. mmap?


>>
>> > > > A while back, Dave Chinner had suggested a move towards smarter
>> > > > handling, and I posted initial RFC patches [1], but since then the topic
>> > > > hasn't really moved forward.
>> > > >
>> > > > I'd like to propose and have a discussion about the following new
>> > > > functionality:
>> > > >
>> > > > 1. Filesystems develop a native representation of badblocks. For
>> > > > example, in xfs, this would (presumably) be linked to the reverse
>> > > > mapping btree. The filesystem representation has the potential to be
>> > > > more efficient than the block driver doing the check, as the fs can
>> > > > check the IO happening on a file against just that file's range.
>>
>> OTOH that means we'd have to check /every/ file IO request against the
>> rmapbt, which will make things reaaaaaally slow.  I suspect it might be
>> preferable just to let the underlying pmem driver throw an error at us.
>>
>> (Or possibly just cache the bad extents in memory.)
>
> Interesting - this would be a good discussion to have. My motivation for
> this was the reasoning that the pmem driver has to check every single IO
> against badblocks, and maybe the fs can do a better job. But if you
> think the fs will actually be slower, we should try to somehow benchmark
> that!
>
>>
>> > > What do you mean by "file system can check the IO happening on a file"?
>> > > Do you mean read or write operation? What's about metadata?
>> >
>> > For the purpose described above, i.e. returning early EIOs when
>> > possible, this will be limited to reads and metadata reads. If we're
>> > about to do a metadata read, and realize the block(s) about to be read
>> > are on the badblocks list, then we do the same thing as when we discover
>> > other kinds of metadata corruption.
>>
>> ...fail and shut down? :)
>>
>> Actually, for metadata either we look at the xfs_bufs to see if it's in
>> memory (XFS doesn't directly access metadata) and write it back out; or
>> we could fire up the online repair tool to rebuild the metadata.
>
> Agreed, I was just stressing that this scenario does not change from
> status quo, and really recovering from corruption isn't the problem
> we're trying to solve here :)
>
>>
>> > > If we are talking about the discovering a bad block on read operation then
>> > > rare modern file system is able to survive as for the case of metadata as
>> > > for the case of user data. Let's imagine that we have really mature file
>> > > system driver then what does it mean to encounter a bad block? The failure
>> > > to read a logical block of some metadata (bad block) means that we are
>> > > unable to extract some part of a metadata structure. From file system
>> > > driver point of view, it looks like that our file system is corrupted, we need
>> > > to stop the file system operations and, finally, to check and recover file
>> > > system volume by means of fsck tool. If we find a bad block for some
>> > > user file then, again, it looks like an issue. Some file systems simply
>> > > return "unrecovered read error". Another one, theoretically, is able
>> > > to survive because of snapshots, for example. But, anyway, it will look
>> > > like as Read-Only mount state and the user will need to resolve such
>> > > trouble by hands.
>> >
>> > As far as I can tell, all of these things remain the same. The goal here
>> > isn't to survive more NVM badblocks than we would've before, and lost
>> > data or lost metadata will continue to have the same consequences as
>> > before, and will need the same recovery actions/intervention as before.
>> > The goal is to make the failure model similar to what users expect
>> > today, and as much as possible make recovery actions too similarly
>> > intuitive.
>> >
>> > >
>> > > If we are talking about discovering a bad block during write operation then,
>> > > again, we are in trouble. Usually, we are using asynchronous model
>> > > of write/flush operation. We are preparing the consistent state of all our
>> > > metadata structures in the memory, at first. The flush operations for metadata
>> > > and user data can be done in different times. And what should be done if we
>> > > discover bad block for any piece of metadata or user data? Simple tracking of
>> > > bad blocks is not enough at all. Let's consider user data, at first. If we cannot
>> > > write some file's block successfully then we have two ways: (1) forget about
>> > > this piece of data; (2) try to change the associated LBA for this piece of data.
>> > > The operation of re-allocation LBA number for discovered bad block
>> > > (user data case) sounds as real pain. Because you need to rebuild the metadata
>> > > that track the location of this part of file. And it sounds as practically
>> > > impossible operation, for the case of LFS file system, for example.
>> > > If we have trouble with flushing any part of metadata then it sounds as
>> > > complete disaster for any file system.
>> >
>> > Writes can get more complicated in certain cases. If it is a regular
>> > page cache writeback, or any aligned write that goes through the block
>> > driver, that is completely fine. The block driver will check that the
>> > block was previously marked as bad, do a "clear poison" operation
>> > (defined in the ACPI spec), which tells the firmware that the poison bit
>> > is not OK to be cleared, and writes the new data. This also removes the
>> > block from the badblocks list, and in this scheme, triggers a
>> > notification to the filesystem that it too can remove the block from its
>> > accounting. mmap writes and DAX can get more complicated, and at times
>> > they will just trigger a SIGBUS, and there's no way around that.
>> >
>> > >
>> > > Are you really sure that file system should process bad block issue?
>> > >
>> > > >In contrast, today, the block driver checks against the whole block device
>> > > > range for every IO. On encountering badblocks, the filesystem can
>> > > > generate a better notification/error message that points the user to
>> > > > (file, offset) as opposed to the block driver, which can only provide
>> > > > (block-device, sector).
>>
>> <shrug> We can do the translation with the backref info...
>
> Yes we should at least do that. I'm guessing this would happen in XFS
> when it gets an EIO from an IO submission? The bio submission path in
> the fs is probably not synchronous (correct?), but whenever it gets the
> EIO, I'm guessing we just print a loud error message after doing the
> backref lookup..
>
>>
>> > > > 2. The block layer adds a notifier to badblock addition/removal
>> > > > operations, which the filesystem subscribes to, and uses to maintain its
>> > > > badblocks accounting. (This part is implemented as a proof of concept in
>> > > > the RFC mentioned above [1]).
>> > >
>> > > I am not sure that any bad block notification during/after IO operation
>> > > is valuable for file system. Maybe, it could help if file system simply will
>> > > know about bad block beforehand the operation of logical block allocation.
>> > > But what subsystem will discover bad blocks before any IO operations?
>> > > How file system will receive information or some bad block table?
>> >
>> > The driver populates its badblocks lists whenever an Address Range Scrub
>> > is started (also via ACPI methods). This is always done at
>> > initialization time, so that it can build an in-memory representation of
>> > the badblocks. Additionally, this can also be triggered manually. And
>> > finally badblocks can also get populated for new latent errors when a
>> > machine check exception occurs. All of these can trigger notification to
>> > the file system without actual user reads happening.
>> >
>> > > I am not convinced that suggested badblocks approach is really feasible.
>> > > Also I am not sure that file system should see the bad blocks at all.
>> > > Why hardware cannot manage this issue for us?
>> >
>> > Hardware does manage the actual badblocks issue for us in the sense that
>> > when it discovers a badblock it will do the remapping. But since this is
>> > on the memory bus, and has different error signatures than applications
>> > are used to, we want to make the error handling similar to the existing
>> > storage model.
>>
>> Yes please and thank you, to the "error handling similar to the existing
>> storage model".  Even better if this just gets added to a layer
>> underneath the fs so that IO to bad regions returns EIO. 8-)
>
> This (if this just gets added to a layer underneath the fs so that IO to bad
> regions returns EIO) already happens :)  See pmem_do_bvec() in
> drivers/nvdimm/pmem.c, where we return EIO for a known badblock on a
> read. I'm wondering if this can be improved..
>

The pmem_do_bvec() read logic is like this:

pmem_do_bvec()
    if (is_bad_pmem())
        return -EIO;
    else
        memcpy_from_pmem();

Note memcpy_from_pmem() is calling memcpy_mcsafe(). Does this imply
that even if a block is not in the badblock list, it still can be bad
and causes MCE? Does the badblock list get changed during file system
running? If that is the case, should the file system get a
notification when it gets changed? If a block is good when I first
read it, can I still trust it to be good for the second access?

Thanks,
Andiry

>>
>> (Sleeeeep...)
>>
>> --D
>>
>> >
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
>> > the body of a message to majordomo@vger.kernel.org
>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
  2017-01-17 22:15           ` Andiry Xu
@ 2017-01-17 22:37             ` Vishal Verma
  2017-01-17 23:20               ` Andiry Xu
  2017-01-18  0:16             ` Andreas Dilger
  1 sibling, 1 reply; 19+ messages in thread
From: Vishal Verma @ 2017-01-17 22:37 UTC (permalink / raw)
  To: Andiry Xu
  Cc: Slava Dubeyko, Darrick J. Wong, linux-nvdimm@lists.01.org,
	linux-block@vger.kernel.org, Viacheslav Dubeyko, Linux FS Devel,
	lsf-pc@lists.linux-foundation.org

On 01/17, Andiry Xu wrote:
> Hi,
> 
> On Tue, Jan 17, 2017 at 1:35 PM, Vishal Verma <vishal.l.verma@intel.com> wrote:
> > On 01/16, Darrick J. Wong wrote:
> >> On Fri, Jan 13, 2017 at 05:49:10PM -0700, Vishal Verma wrote:
> >> > On 01/14, Slava Dubeyko wrote:
> >> > >
> >> > > ---- Original Message ----
> >> > > Subject: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
> >> > > Sent: Jan 13, 2017 1:40 PM
> >> > > From: "Verma, Vishal L" <vishal.l.verma@intel.com>
> >> > > To: lsf-pc@lists.linux-foundation.org
> >> > > Cc: linux-nvdimm@lists.01.org, linux-block@vger.kernel.org, linux-fsdevel@vger.kernel.org
> >> > >
> >> > > > The current implementation of badblocks, where we consult the badblocks
> >> > > > list for every IO in the block driver works, and is a last option
> >> > > > failsafe, but from a user perspective, it isn't the easiest interface to
> >> > > > work with.
> >> > >
> >> > > As I remember, FAT and HFS+ specifications contain description of bad blocks
> >> > > (physical sectors) table. I believe that this table was used for the case of
> >> > > floppy media. But, finally, this table becomes to be the completely obsolete
> >> > > artefact because mostly storage devices are reliably enough. Why do you need
> >>
> >> ext4 has a badblocks inode to own all the bad spots on disk, but ISTR it
> >> doesn't support(??) extents or 64-bit filesystems, and might just be a
> >> vestigial organ at this point.  XFS doesn't have anything to track bad
> >> blocks currently....
> >>
> >> > > in exposing the bad blocks on the file system level?  Do you expect that next
> >> > > generation of NVM memory will be so unreliable that file system needs to manage
> >> > > bad blocks? What's about erasure coding schemes? Do file system really need to suffer
> >> > > from the bad block issue?
> >> > >
> >> > > Usually, we are using LBAs and it is the responsibility of storage device to map
> >> > > a bad physical block/page/sector into valid one. Do you mean that we have
> >> > > access to physical NVM memory address directly? But it looks like that we can
> >> > > have a "bad block" issue even we will access data into page cache's memory
> >> > > page (if we will use NVM memory for page cache, of course). So, what do you
> >> > > imply by "bad block" issue?
> >> >
> >> > We don't have direct physical access to the device's address space, in
> >> > the sense the device is still free to perform remapping of chunks of NVM
> >> > underneath us. The problem is that when a block or address range (as
> >> > small as a cache line) goes bad, the device maintains a poison bit for
> >> > every affected cache line. Behind the scenes, it may have already
> >> > remapped the range, but the cache line poison has to be kept so that
> >> > there is a notification to the user/owner of the data that something has
> >> > been lost. Since NVM is byte addressable memory sitting on the memory
> >> > bus, such a poisoned cache line results in memory errors and SIGBUSes.
> >> > Compared to tradational storage where an app will get nice and friendly
> >> > (relatively speaking..) -EIOs. The whole badblocks implementation was
> >> > done so that the driver can intercept IO (i.e. reads) to _known_ bad
> >> > locations, and short-circuit them with an EIO. If the driver doesn't
> >> > catch these, the reads will turn into a memory bus access, and the
> >> > poison will cause a SIGBUS.
> >>
> >> "driver" ... you mean XFS?  Or do you mean the thing that makes pmem
> >> look kind of like a traditional block device? :)
> >
> > Yes, the thing that makes pmem look like a block device :) --
> > drivers/nvdimm/pmem.c
> >
> >>
> >> > This effort is to try and make this badblock checking smarter - and try
> >> > and reduce the penalty on every IO to a smaller range, which only the
> >> > filesystem can do.
> >>
> >> Though... now that XFS merged the reverse mapping support, I've been
> >> wondering if there'll be a resubmission of the device errors callback?
> >> It still would be useful to be able to inform the user that part of
> >> their fs has gone bad, or, better yet, if the buffer is still in memory
> >> someplace else, just write it back out.
> >>
> >> Or I suppose if we had some kind of raid1 set up between memories we
> >> could read one of the other copies and rewrite it into the failing
> >> region immediately.
> >
> > Yes, that is kind of what I was hoping to accomplish via this
> > discussion. How much would filesystems want to be involved in this sort
> > of badblocks handling, if at all. I can refresh my patches that provide
> > the fs notification, but that's the easy bit, and a starting point.
> >
> 
> I have some questions. Why moving badblock handling to file system
> level avoid the checking phase? In file system level for each I/O I
> still have to check the badblock list, right? Do you mean during mount
> it can go through the pmem device and locates all the data structures
> mangled by badblocks and handle them accordingly, so that during
> normal running the badblocks will never be accessed? Or, if there is
> replicataion/snapshot support, use a copy to recover the badblocks?
> 
> How about operations bypass the file system, i.e. mmap?

Hi Andiry,

I do mean that in the filesystem, for every IO, the badblocks will be
checked. Currently, the pmem driver does this, and the hope is that the
filesystem can do a better job at it. The driver unconditionally checks
every IO for badblocks on the whole device. Depending on how the
badblocks are represented in the filesystem, we might be able to quickly
tell if a file/range has existing badblocks, and error out the IO
accordingly.

At mount the the fs would read the existing badblocks on the block
device, and build its own representation of them. Then during normal
use, if the underlying badblocks change, the fs would get a notification
that would allow it to also update its own representation.

Yes, if there is replication etc support in the filesystem, we could try
to recover using that, but I haven't put much thought in that direction.

Like I said in a previous reply, mmap can be a tricky case, and other
than handling the machine check exception, there may not be anything
else we can do.. 
If the range we're faulting on has known errors in badblocks, the fault
will fail with SIGBUS (see where pmem_direct_access() fails due to
badblocks). For latent errors that are not known in badblocks, if the
platform has MCE recovery, there is an MCE handler for pmem currently,
that will add that address to badblocks. If MCE recovery is absent, then
the system will crash/reboot, and the next time the driver populates
badblocks, that address will appear in it.
> 
> >>
> >> > > > A while back, Dave Chinner had suggested a move towards smarter
> >> > > > handling, and I posted initial RFC patches [1], but since then the topic
> >> > > > hasn't really moved forward.
> >> > > >
> >> > > > I'd like to propose and have a discussion about the following new
> >> > > > functionality:
> >> > > >
> >> > > > 1. Filesystems develop a native representation of badblocks. For
> >> > > > example, in xfs, this would (presumably) be linked to the reverse
> >> > > > mapping btree. The filesystem representation has the potential to be
> >> > > > more efficient than the block driver doing the check, as the fs can
> >> > > > check the IO happening on a file against just that file's range.
> >>
> >> OTOH that means we'd have to check /every/ file IO request against the
> >> rmapbt, which will make things reaaaaaally slow.  I suspect it might be
> >> preferable just to let the underlying pmem driver throw an error at us.
> >>
> >> (Or possibly just cache the bad extents in memory.)
> >
> > Interesting - this would be a good discussion to have. My motivation for
> > this was the reasoning that the pmem driver has to check every single IO
> > against badblocks, and maybe the fs can do a better job. But if you
> > think the fs will actually be slower, we should try to somehow benchmark
> > that!
> >
> >>
> >> > > What do you mean by "file system can check the IO happening on a file"?
> >> > > Do you mean read or write operation? What's about metadata?
> >> >
> >> > For the purpose described above, i.e. returning early EIOs when
> >> > possible, this will be limited to reads and metadata reads. If we're
> >> > about to do a metadata read, and realize the block(s) about to be read
> >> > are on the badblocks list, then we do the same thing as when we discover
> >> > other kinds of metadata corruption.
> >>
> >> ...fail and shut down? :)
> >>
> >> Actually, for metadata either we look at the xfs_bufs to see if it's in
> >> memory (XFS doesn't directly access metadata) and write it back out; or
> >> we could fire up the online repair tool to rebuild the metadata.
> >
> > Agreed, I was just stressing that this scenario does not change from
> > status quo, and really recovering from corruption isn't the problem
> > we're trying to solve here :)
> >
> >>
> >> > > If we are talking about the discovering a bad block on read operation then
> >> > > rare modern file system is able to survive as for the case of metadata as
> >> > > for the case of user data. Let's imagine that we have really mature file
> >> > > system driver then what does it mean to encounter a bad block? The failure
> >> > > to read a logical block of some metadata (bad block) means that we are
> >> > > unable to extract some part of a metadata structure. From file system
> >> > > driver point of view, it looks like that our file system is corrupted, we need
> >> > > to stop the file system operations and, finally, to check and recover file
> >> > > system volume by means of fsck tool. If we find a bad block for some
> >> > > user file then, again, it looks like an issue. Some file systems simply
> >> > > return "unrecovered read error". Another one, theoretically, is able
> >> > > to survive because of snapshots, for example. But, anyway, it will look
> >> > > like as Read-Only mount state and the user will need to resolve such
> >> > > trouble by hands.
> >> >
> >> > As far as I can tell, all of these things remain the same. The goal here
> >> > isn't to survive more NVM badblocks than we would've before, and lost
> >> > data or lost metadata will continue to have the same consequences as
> >> > before, and will need the same recovery actions/intervention as before.
> >> > The goal is to make the failure model similar to what users expect
> >> > today, and as much as possible make recovery actions too similarly
> >> > intuitive.
> >> >
> >> > >
> >> > > If we are talking about discovering a bad block during write operation then,
> >> > > again, we are in trouble. Usually, we are using asynchronous model
> >> > > of write/flush operation. We are preparing the consistent state of all our
> >> > > metadata structures in the memory, at first. The flush operations for metadata
> >> > > and user data can be done in different times. And what should be done if we
> >> > > discover bad block for any piece of metadata or user data? Simple tracking of
> >> > > bad blocks is not enough at all. Let's consider user data, at first. If we cannot
> >> > > write some file's block successfully then we have two ways: (1) forget about
> >> > > this piece of data; (2) try to change the associated LBA for this piece of data.
> >> > > The operation of re-allocation LBA number for discovered bad block
> >> > > (user data case) sounds as real pain. Because you need to rebuild the metadata
> >> > > that track the location of this part of file. And it sounds as practically
> >> > > impossible operation, for the case of LFS file system, for example.
> >> > > If we have trouble with flushing any part of metadata then it sounds as
> >> > > complete disaster for any file system.
> >> >
> >> > Writes can get more complicated in certain cases. If it is a regular
> >> > page cache writeback, or any aligned write that goes through the block
> >> > driver, that is completely fine. The block driver will check that the
> >> > block was previously marked as bad, do a "clear poison" operation
> >> > (defined in the ACPI spec), which tells the firmware that the poison bit
> >> > is not OK to be cleared, and writes the new data. This also removes the
> >> > block from the badblocks list, and in this scheme, triggers a
> >> > notification to the filesystem that it too can remove the block from its
> >> > accounting. mmap writes and DAX can get more complicated, and at times
> >> > they will just trigger a SIGBUS, and there's no way around that.
> >> >
> >> > >
> >> > > Are you really sure that file system should process bad block issue?
> >> > >
> >> > > >In contrast, today, the block driver checks against the whole block device
> >> > > > range for every IO. On encountering badblocks, the filesystem can
> >> > > > generate a better notification/error message that points the user to
> >> > > > (file, offset) as opposed to the block driver, which can only provide
> >> > > > (block-device, sector).
> >>
> >> <shrug> We can do the translation with the backref info...
> >
> > Yes we should at least do that. I'm guessing this would happen in XFS
> > when it gets an EIO from an IO submission? The bio submission path in
> > the fs is probably not synchronous (correct?), but whenever it gets the
> > EIO, I'm guessing we just print a loud error message after doing the
> > backref lookup..
> >
> >>
> >> > > > 2. The block layer adds a notifier to badblock addition/removal
> >> > > > operations, which the filesystem subscribes to, and uses to maintain its
> >> > > > badblocks accounting. (This part is implemented as a proof of concept in
> >> > > > the RFC mentioned above [1]).
> >> > >
> >> > > I am not sure that any bad block notification during/after IO operation
> >> > > is valuable for file system. Maybe, it could help if file system simply will
> >> > > know about bad block beforehand the operation of logical block allocation.
> >> > > But what subsystem will discover bad blocks before any IO operations?
> >> > > How file system will receive information or some bad block table?
> >> >
> >> > The driver populates its badblocks lists whenever an Address Range Scrub
> >> > is started (also via ACPI methods). This is always done at
> >> > initialization time, so that it can build an in-memory representation of
> >> > the badblocks. Additionally, this can also be triggered manually. And
> >> > finally badblocks can also get populated for new latent errors when a
> >> > machine check exception occurs. All of these can trigger notification to
> >> > the file system without actual user reads happening.
> >> >
> >> > > I am not convinced that suggested badblocks approach is really feasible.
> >> > > Also I am not sure that file system should see the bad blocks at all.
> >> > > Why hardware cannot manage this issue for us?
> >> >
> >> > Hardware does manage the actual badblocks issue for us in the sense that
> >> > when it discovers a badblock it will do the remapping. But since this is
> >> > on the memory bus, and has different error signatures than applications
> >> > are used to, we want to make the error handling similar to the existing
> >> > storage model.
> >>
> >> Yes please and thank you, to the "error handling similar to the existing
> >> storage model".  Even better if this just gets added to a layer
> >> underneath the fs so that IO to bad regions returns EIO. 8-)
> >
> > This (if this just gets added to a layer underneath the fs so that IO to bad
> > regions returns EIO) already happens :)  See pmem_do_bvec() in
> > drivers/nvdimm/pmem.c, where we return EIO for a known badblock on a
> > read. I'm wondering if this can be improved..
> >
> 
> The pmem_do_bvec() read logic is like this:
> 
> pmem_do_bvec()
>     if (is_bad_pmem())
>         return -EIO;
>     else
>         memcpy_from_pmem();
> 
> Note memcpy_from_pmem() is calling memcpy_mcsafe(). Does this imply
> that even if a block is not in the badblock list, it still can be bad
> and causes MCE? Does the badblock list get changed during file system
> running? If that is the case, should the file system get a
> notification when it gets changed? If a block is good when I first
> read it, can I still trust it to be good for the second access?

Yes, if a block is not in the badblocks list, it can still cause an
MCE. This is the latent error case I described above. For a simple read()
via the pmem driver, this will get handled by memcpy_mcsafe. For mmap,
an MCE is inevitable.

Yes the badblocks list may change while a filesystem is running. The RFC
patches[1] I linked to add a notification for the filesystem when this
happens.

No, if the media, for some reason, 'dvelops' a bad cell, a second
consecutive read does have a chance of being bad. Once a location has
been marked as bad, it will stay bad till the ACPI clear error 'DSM' has
been called to mark it as clean.

[1]: http://www.linux.sgi.com/archives/xfs/2016-06/msg00299.html


> 
> Thanks,
> Andiry
> 
> >>
> >> (Sleeeeep...)
> >>
> >> --D
> >>
> >> >
> >> > --
> >> > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> >> > the body of a message to majordomo@vger.kernel.org
> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
  2017-01-17 22:37             ` Vishal Verma
@ 2017-01-17 23:20               ` Andiry Xu
  2017-01-17 23:51                 ` Vishal Verma
  0 siblings, 1 reply; 19+ messages in thread
From: Andiry Xu @ 2017-01-17 23:20 UTC (permalink / raw)
  To: Vishal Verma
  Cc: Slava Dubeyko, Darrick J. Wong, linux-nvdimm@lists.01.org,
	linux-block@vger.kernel.org, Viacheslav Dubeyko, Linux FS Devel,
	lsf-pc@lists.linux-foundation.org

On Tue, Jan 17, 2017 at 2:37 PM, Vishal Verma <vishal.l.verma@intel.com> wrote:
> On 01/17, Andiry Xu wrote:
>> Hi,
>>
>> On Tue, Jan 17, 2017 at 1:35 PM, Vishal Verma <vishal.l.verma@intel.com> wrote:
>> > On 01/16, Darrick J. Wong wrote:
>> >> On Fri, Jan 13, 2017 at 05:49:10PM -0700, Vishal Verma wrote:
>> >> > On 01/14, Slava Dubeyko wrote:
>> >> > >
>> >> > > ---- Original Message ----
>> >> > > Subject: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
>> >> > > Sent: Jan 13, 2017 1:40 PM
>> >> > > From: "Verma, Vishal L" <vishal.l.verma@intel.com>
>> >> > > To: lsf-pc@lists.linux-foundation.org
>> >> > > Cc: linux-nvdimm@lists.01.org, linux-block@vger.kernel.org, linux-fsdevel@vger.kernel.org
>> >> > >
>> >> > > > The current implementation of badblocks, where we consult the badblocks
>> >> > > > list for every IO in the block driver works, and is a last option
>> >> > > > failsafe, but from a user perspective, it isn't the easiest interface to
>> >> > > > work with.
>> >> > >
>> >> > > As I remember, FAT and HFS+ specifications contain description of bad blocks
>> >> > > (physical sectors) table. I believe that this table was used for the case of
>> >> > > floppy media. But, finally, this table becomes to be the completely obsolete
>> >> > > artefact because mostly storage devices are reliably enough. Why do you need
>> >>
>> >> ext4 has a badblocks inode to own all the bad spots on disk, but ISTR it
>> >> doesn't support(??) extents or 64-bit filesystems, and might just be a
>> >> vestigial organ at this point.  XFS doesn't have anything to track bad
>> >> blocks currently....
>> >>
>> >> > > in exposing the bad blocks on the file system level?  Do you expect that next
>> >> > > generation of NVM memory will be so unreliable that file system needs to manage
>> >> > > bad blocks? What's about erasure coding schemes? Do file system really need to suffer
>> >> > > from the bad block issue?
>> >> > >
>> >> > > Usually, we are using LBAs and it is the responsibility of storage device to map
>> >> > > a bad physical block/page/sector into valid one. Do you mean that we have
>> >> > > access to physical NVM memory address directly? But it looks like that we can
>> >> > > have a "bad block" issue even we will access data into page cache's memory
>> >> > > page (if we will use NVM memory for page cache, of course). So, what do you
>> >> > > imply by "bad block" issue?
>> >> >
>> >> > We don't have direct physical access to the device's address space, in
>> >> > the sense the device is still free to perform remapping of chunks of NVM
>> >> > underneath us. The problem is that when a block or address range (as
>> >> > small as a cache line) goes bad, the device maintains a poison bit for
>> >> > every affected cache line. Behind the scenes, it may have already
>> >> > remapped the range, but the cache line poison has to be kept so that
>> >> > there is a notification to the user/owner of the data that something has
>> >> > been lost. Since NVM is byte addressable memory sitting on the memory
>> >> > bus, such a poisoned cache line results in memory errors and SIGBUSes.
>> >> > Compared to tradational storage where an app will get nice and friendly
>> >> > (relatively speaking..) -EIOs. The whole badblocks implementation was
>> >> > done so that the driver can intercept IO (i.e. reads) to _known_ bad
>> >> > locations, and short-circuit them with an EIO. If the driver doesn't
>> >> > catch these, the reads will turn into a memory bus access, and the
>> >> > poison will cause a SIGBUS.
>> >>
>> >> "driver" ... you mean XFS?  Or do you mean the thing that makes pmem
>> >> look kind of like a traditional block device? :)
>> >
>> > Yes, the thing that makes pmem look like a block device :) --
>> > drivers/nvdimm/pmem.c
>> >
>> >>
>> >> > This effort is to try and make this badblock checking smarter - and try
>> >> > and reduce the penalty on every IO to a smaller range, which only the
>> >> > filesystem can do.
>> >>
>> >> Though... now that XFS merged the reverse mapping support, I've been
>> >> wondering if there'll be a resubmission of the device errors callback?
>> >> It still would be useful to be able to inform the user that part of
>> >> their fs has gone bad, or, better yet, if the buffer is still in memory
>> >> someplace else, just write it back out.
>> >>
>> >> Or I suppose if we had some kind of raid1 set up between memories we
>> >> could read one of the other copies and rewrite it into the failing
>> >> region immediately.
>> >
>> > Yes, that is kind of what I was hoping to accomplish via this
>> > discussion. How much would filesystems want to be involved in this sort
>> > of badblocks handling, if at all. I can refresh my patches that provide
>> > the fs notification, but that's the easy bit, and a starting point.
>> >
>>
>> I have some questions. Why moving badblock handling to file system
>> level avoid the checking phase? In file system level for each I/O I
>> still have to check the badblock list, right? Do you mean during mount
>> it can go through the pmem device and locates all the data structures
>> mangled by badblocks and handle them accordingly, so that during
>> normal running the badblocks will never be accessed? Or, if there is
>> replicataion/snapshot support, use a copy to recover the badblocks?
>>
>> How about operations bypass the file system, i.e. mmap?
>
> Hi Andiry,
>
> I do mean that in the filesystem, for every IO, the badblocks will be
> checked. Currently, the pmem driver does this, and the hope is that the
> filesystem can do a better job at it. The driver unconditionally checks
> every IO for badblocks on the whole device. Depending on how the
> badblocks are represented in the filesystem, we might be able to quickly
> tell if a file/range has existing badblocks, and error out the IO
> accordingly.
>
> At mount the the fs would read the existing badblocks on the block
> device, and build its own representation of them. Then during normal
> use, if the underlying badblocks change, the fs would get a notification
> that would allow it to also update its own representation.
>
> Yes, if there is replication etc support in the filesystem, we could try
> to recover using that, but I haven't put much thought in that direction.
>
> Like I said in a previous reply, mmap can be a tricky case, and other
> than handling the machine check exception, there may not be anything
> else we can do..
> If the range we're faulting on has known errors in badblocks, the fault
> will fail with SIGBUS (see where pmem_direct_access() fails due to
> badblocks). For latent errors that are not known in badblocks, if the
> platform has MCE recovery, there is an MCE handler for pmem currently,
> that will add that address to badblocks. If MCE recovery is absent, then
> the system will crash/reboot, and the next time the driver populates
> badblocks, that address will appear in it.

Thank you for reply. That is very clear.

>>
>> >>
>> >> > > > A while back, Dave Chinner had suggested a move towards smarter
>> >> > > > handling, and I posted initial RFC patches [1], but since then the topic
>> >> > > > hasn't really moved forward.
>> >> > > >
>> >> > > > I'd like to propose and have a discussion about the following new
>> >> > > > functionality:
>> >> > > >
>> >> > > > 1. Filesystems develop a native representation of badblocks. For
>> >> > > > example, in xfs, this would (presumably) be linked to the reverse
>> >> > > > mapping btree. The filesystem representation has the potential to be
>> >> > > > more efficient than the block driver doing the check, as the fs can
>> >> > > > check the IO happening on a file against just that file's range.
>> >>
>> >> OTOH that means we'd have to check /every/ file IO request against the
>> >> rmapbt, which will make things reaaaaaally slow.  I suspect it might be
>> >> preferable just to let the underlying pmem driver throw an error at us.
>> >>
>> >> (Or possibly just cache the bad extents in memory.)
>> >
>> > Interesting - this would be a good discussion to have. My motivation for
>> > this was the reasoning that the pmem driver has to check every single IO
>> > against badblocks, and maybe the fs can do a better job. But if you
>> > think the fs will actually be slower, we should try to somehow benchmark
>> > that!
>> >
>> >>
>> >> > > What do you mean by "file system can check the IO happening on a file"?
>> >> > > Do you mean read or write operation? What's about metadata?
>> >> >
>> >> > For the purpose described above, i.e. returning early EIOs when
>> >> > possible, this will be limited to reads and metadata reads. If we're
>> >> > about to do a metadata read, and realize the block(s) about to be read
>> >> > are on the badblocks list, then we do the same thing as when we discover
>> >> > other kinds of metadata corruption.
>> >>
>> >> ...fail and shut down? :)
>> >>
>> >> Actually, for metadata either we look at the xfs_bufs to see if it's in
>> >> memory (XFS doesn't directly access metadata) and write it back out; or
>> >> we could fire up the online repair tool to rebuild the metadata.
>> >
>> > Agreed, I was just stressing that this scenario does not change from
>> > status quo, and really recovering from corruption isn't the problem
>> > we're trying to solve here :)
>> >
>> >>
>> >> > > If we are talking about the discovering a bad block on read operation then
>> >> > > rare modern file system is able to survive as for the case of metadata as
>> >> > > for the case of user data. Let's imagine that we have really mature file
>> >> > > system driver then what does it mean to encounter a bad block? The failure
>> >> > > to read a logical block of some metadata (bad block) means that we are
>> >> > > unable to extract some part of a metadata structure. From file system
>> >> > > driver point of view, it looks like that our file system is corrupted, we need
>> >> > > to stop the file system operations and, finally, to check and recover file
>> >> > > system volume by means of fsck tool. If we find a bad block for some
>> >> > > user file then, again, it looks like an issue. Some file systems simply
>> >> > > return "unrecovered read error". Another one, theoretically, is able
>> >> > > to survive because of snapshots, for example. But, anyway, it will look
>> >> > > like as Read-Only mount state and the user will need to resolve such
>> >> > > trouble by hands.
>> >> >
>> >> > As far as I can tell, all of these things remain the same. The goal here
>> >> > isn't to survive more NVM badblocks than we would've before, and lost
>> >> > data or lost metadata will continue to have the same consequences as
>> >> > before, and will need the same recovery actions/intervention as before.
>> >> > The goal is to make the failure model similar to what users expect
>> >> > today, and as much as possible make recovery actions too similarly
>> >> > intuitive.
>> >> >
>> >> > >
>> >> > > If we are talking about discovering a bad block during write operation then,
>> >> > > again, we are in trouble. Usually, we are using asynchronous model
>> >> > > of write/flush operation. We are preparing the consistent state of all our
>> >> > > metadata structures in the memory, at first. The flush operations for metadata
>> >> > > and user data can be done in different times. And what should be done if we
>> >> > > discover bad block for any piece of metadata or user data? Simple tracking of
>> >> > > bad blocks is not enough at all. Let's consider user data, at first. If we cannot
>> >> > > write some file's block successfully then we have two ways: (1) forget about
>> >> > > this piece of data; (2) try to change the associated LBA for this piece of data.
>> >> > > The operation of re-allocation LBA number for discovered bad block
>> >> > > (user data case) sounds as real pain. Because you need to rebuild the metadata
>> >> > > that track the location of this part of file. And it sounds as practically
>> >> > > impossible operation, for the case of LFS file system, for example.
>> >> > > If we have trouble with flushing any part of metadata then it sounds as
>> >> > > complete disaster for any file system.
>> >> >
>> >> > Writes can get more complicated in certain cases. If it is a regular
>> >> > page cache writeback, or any aligned write that goes through the block
>> >> > driver, that is completely fine. The block driver will check that the
>> >> > block was previously marked as bad, do a "clear poison" operation
>> >> > (defined in the ACPI spec), which tells the firmware that the poison bit
>> >> > is not OK to be cleared, and writes the new data. This also removes the
>> >> > block from the badblocks list, and in this scheme, triggers a
>> >> > notification to the filesystem that it too can remove the block from its
>> >> > accounting. mmap writes and DAX can get more complicated, and at times
>> >> > they will just trigger a SIGBUS, and there's no way around that.
>> >> >
>> >> > >
>> >> > > Are you really sure that file system should process bad block issue?
>> >> > >
>> >> > > >In contrast, today, the block driver checks against the whole block device
>> >> > > > range for every IO. On encountering badblocks, the filesystem can
>> >> > > > generate a better notification/error message that points the user to
>> >> > > > (file, offset) as opposed to the block driver, which can only provide
>> >> > > > (block-device, sector).
>> >>
>> >> <shrug> We can do the translation with the backref info...
>> >
>> > Yes we should at least do that. I'm guessing this would happen in XFS
>> > when it gets an EIO from an IO submission? The bio submission path in
>> > the fs is probably not synchronous (correct?), but whenever it gets the
>> > EIO, I'm guessing we just print a loud error message after doing the
>> > backref lookup..
>> >
>> >>
>> >> > > > 2. The block layer adds a notifier to badblock addition/removal
>> >> > > > operations, which the filesystem subscribes to, and uses to maintain its
>> >> > > > badblocks accounting. (This part is implemented as a proof of concept in
>> >> > > > the RFC mentioned above [1]).
>> >> > >
>> >> > > I am not sure that any bad block notification during/after IO operation
>> >> > > is valuable for file system. Maybe, it could help if file system simply will
>> >> > > know about bad block beforehand the operation of logical block allocation.
>> >> > > But what subsystem will discover bad blocks before any IO operations?
>> >> > > How file system will receive information or some bad block table?
>> >> >
>> >> > The driver populates its badblocks lists whenever an Address Range Scrub
>> >> > is started (also via ACPI methods). This is always done at
>> >> > initialization time, so that it can build an in-memory representation of
>> >> > the badblocks. Additionally, this can also be triggered manually. And
>> >> > finally badblocks can also get populated for new latent errors when a
>> >> > machine check exception occurs. All of these can trigger notification to
>> >> > the file system without actual user reads happening.
>> >> >
>> >> > > I am not convinced that suggested badblocks approach is really feasible.
>> >> > > Also I am not sure that file system should see the bad blocks at all.
>> >> > > Why hardware cannot manage this issue for us?
>> >> >
>> >> > Hardware does manage the actual badblocks issue for us in the sense that
>> >> > when it discovers a badblock it will do the remapping. But since this is
>> >> > on the memory bus, and has different error signatures than applications
>> >> > are used to, we want to make the error handling similar to the existing
>> >> > storage model.
>> >>
>> >> Yes please and thank you, to the "error handling similar to the existing
>> >> storage model".  Even better if this just gets added to a layer
>> >> underneath the fs so that IO to bad regions returns EIO. 8-)
>> >
>> > This (if this just gets added to a layer underneath the fs so that IO to bad
>> > regions returns EIO) already happens :)  See pmem_do_bvec() in
>> > drivers/nvdimm/pmem.c, where we return EIO for a known badblock on a
>> > read. I'm wondering if this can be improved..
>> >
>>
>> The pmem_do_bvec() read logic is like this:
>>
>> pmem_do_bvec()
>>     if (is_bad_pmem())
>>         return -EIO;
>>     else
>>         memcpy_from_pmem();
>>
>> Note memcpy_from_pmem() is calling memcpy_mcsafe(). Does this imply
>> that even if a block is not in the badblock list, it still can be bad
>> and causes MCE? Does the badblock list get changed during file system
>> running? If that is the case, should the file system get a
>> notification when it gets changed? If a block is good when I first
>> read it, can I still trust it to be good for the second access?
>
> Yes, if a block is not in the badblocks list, it can still cause an
> MCE. This is the latent error case I described above. For a simple read()
> via the pmem driver, this will get handled by memcpy_mcsafe. For mmap,
> an MCE is inevitable.
>
> Yes the badblocks list may change while a filesystem is running. The RFC
> patches[1] I linked to add a notification for the filesystem when this
> happens.
>

This is really bad and it makes file system implementation much more
complicated. And badblock notification does not help very much,
because any block can be bad potentially, no matter it is in badblock
list or not. And file system has to perform checking for every read,
using memcpy_mcsafe. This is disaster for file system like NOVA, which
uses pointer de-reference to access data structures on pmem. Now if I
want to read a field in an inode on pmem, I have to copy it to DRAM
first and make sure memcpy_mcsafe() does not report anything wrong.

> No, if the media, for some reason, 'dvelops' a bad cell, a second
> consecutive read does have a chance of being bad. Once a location has
> been marked as bad, it will stay bad till the ACPI clear error 'DSM' has
> been called to mark it as clean.
>

I wonder what happens to write in this case? If a block is bad but not
reported in badblock list. Now I write to it without reading first. Do
I clear the poison with the write? Or still require a ACPI DSM?

> [1]: http://www.linux.sgi.com/archives/xfs/2016-06/msg00299.html
>

Thank you for the patchset. I will look into it.

Thanks,
Andiry

>
>>
>> Thanks,
>> Andiry
>>
>> >>
>> >> (Sleeeeep...)
>> >>
>> >> --D
>> >>
>> >> >
>> >> > --
>> >> > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
>> >> > the body of a message to majordomo@vger.kernel.org
>> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
>> > the body of a message to majordomo@vger.kernel.org
>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
  2017-01-17 23:20               ` Andiry Xu
@ 2017-01-17 23:51                 ` Vishal Verma
  2017-01-18  1:58                   ` Andiry Xu
  0 siblings, 1 reply; 19+ messages in thread
From: Vishal Verma @ 2017-01-17 23:51 UTC (permalink / raw)
  To: Andiry Xu
  Cc: Slava Dubeyko, Darrick J. Wong, linux-nvdimm@lists.01.org,
	linux-block@vger.kernel.org, Viacheslav Dubeyko, Linux FS Devel,
	lsf-pc@lists.linux-foundation.org

On 01/17, Andiry Xu wrote:

<snip>

> >>
> >> The pmem_do_bvec() read logic is like this:
> >>
> >> pmem_do_bvec()
> >>     if (is_bad_pmem())
> >>         return -EIO;
> >>     else
> >>         memcpy_from_pmem();
> >>
> >> Note memcpy_from_pmem() is calling memcpy_mcsafe(). Does this imply
> >> that even if a block is not in the badblock list, it still can be bad
> >> and causes MCE? Does the badblock list get changed during file system
> >> running? If that is the case, should the file system get a
> >> notification when it gets changed? If a block is good when I first
> >> read it, can I still trust it to be good for the second access?
> >
> > Yes, if a block is not in the badblocks list, it can still cause an
> > MCE. This is the latent error case I described above. For a simple read()
> > via the pmem driver, this will get handled by memcpy_mcsafe. For mmap,
> > an MCE is inevitable.
> >
> > Yes the badblocks list may change while a filesystem is running. The RFC
> > patches[1] I linked to add a notification for the filesystem when this
> > happens.
> >
> 
> This is really bad and it makes file system implementation much more
> complicated. And badblock notification does not help very much,
> because any block can be bad potentially, no matter it is in badblock
> list or not. And file system has to perform checking for every read,
> using memcpy_mcsafe. This is disaster for file system like NOVA, which
> uses pointer de-reference to access data structures on pmem. Now if I
> want to read a field in an inode on pmem, I have to copy it to DRAM
> first and make sure memcpy_mcsafe() does not report anything wrong.

You have a good point, and I don't know if I have an answer for this..
Assuming a system with MCE recovery, maybe NOVA can add a mce handler
similar to nfit_handle_mce(), and handle errors as they happen, but I'm
being very hand-wavey here and don't know how much/how well that might
work..

> 
> > No, if the media, for some reason, 'dvelops' a bad cell, a second
> > consecutive read does have a chance of being bad. Once a location has
> > been marked as bad, it will stay bad till the ACPI clear error 'DSM' has
> > been called to mark it as clean.
> >
> 
> I wonder what happens to write in this case? If a block is bad but not
> reported in badblock list. Now I write to it without reading first. Do
> I clear the poison with the write? Or still require a ACPI DSM?

With writes, my understanding is there is still a possibility that an
internal read-modify-write can happen, and cause a MCE (this is the same
as writing to a bad DRAM cell, which can also cause an MCE). You can't
really use the ACPI DSM preemptively because you don't know whether the
location was bad. The error flow will be something like write causes the
MCE, a badblock gets added (either through the mce handler or after the
next reboot), and the recovery path is now the same as a regular badblock.

> 
> > [1]: http://www.linux.sgi.com/archives/xfs/2016-06/msg00299.html
> >
> 
> Thank you for the patchset. I will look into it.
> 
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
  2017-01-17 23:51                 ` Vishal Verma
@ 2017-01-18  1:58                   ` Andiry Xu
       [not found]                     ` <CAOvWMLZCt39EDg-1uppVVUeRG40JvOo9sKLY2XMuynZdnc0W9w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 19+ messages in thread
From: Andiry Xu @ 2017-01-18  1:58 UTC (permalink / raw)
  To: Vishal Verma
  Cc: Slava Dubeyko, Darrick J. Wong, linux-nvdimm@lists.01.org,
	linux-block@vger.kernel.org, Viacheslav Dubeyko, Linux FS Devel,
	lsf-pc@lists.linux-foundation.org

On Tue, Jan 17, 2017 at 3:51 PM, Vishal Verma <vishal.l.verma@intel.com> wrote:
> On 01/17, Andiry Xu wrote:
>
> <snip>
>
>> >>
>> >> The pmem_do_bvec() read logic is like this:
>> >>
>> >> pmem_do_bvec()
>> >>     if (is_bad_pmem())
>> >>         return -EIO;
>> >>     else
>> >>         memcpy_from_pmem();
>> >>
>> >> Note memcpy_from_pmem() is calling memcpy_mcsafe(). Does this imply
>> >> that even if a block is not in the badblock list, it still can be bad
>> >> and causes MCE? Does the badblock list get changed during file system
>> >> running? If that is the case, should the file system get a
>> >> notification when it gets changed? If a block is good when I first
>> >> read it, can I still trust it to be good for the second access?
>> >
>> > Yes, if a block is not in the badblocks list, it can still cause an
>> > MCE. This is the latent error case I described above. For a simple read()
>> > via the pmem driver, this will get handled by memcpy_mcsafe. For mmap,
>> > an MCE is inevitable.
>> >
>> > Yes the badblocks list may change while a filesystem is running. The RFC
>> > patches[1] I linked to add a notification for the filesystem when this
>> > happens.
>> >
>>
>> This is really bad and it makes file system implementation much more
>> complicated. And badblock notification does not help very much,
>> because any block can be bad potentially, no matter it is in badblock
>> list or not. And file system has to perform checking for every read,
>> using memcpy_mcsafe. This is disaster for file system like NOVA, which
>> uses pointer de-reference to access data structures on pmem. Now if I
>> want to read a field in an inode on pmem, I have to copy it to DRAM
>> first and make sure memcpy_mcsafe() does not report anything wrong.
>
> You have a good point, and I don't know if I have an answer for this..
> Assuming a system with MCE recovery, maybe NOVA can add a mce handler
> similar to nfit_handle_mce(), and handle errors as they happen, but I'm
> being very hand-wavey here and don't know how much/how well that might
> work..
>
>>
>> > No, if the media, for some reason, 'dvelops' a bad cell, a second
>> > consecutive read does have a chance of being bad. Once a location has
>> > been marked as bad, it will stay bad till the ACPI clear error 'DSM' has
>> > been called to mark it as clean.
>> >
>>
>> I wonder what happens to write in this case? If a block is bad but not
>> reported in badblock list. Now I write to it without reading first. Do
>> I clear the poison with the write? Or still require a ACPI DSM?
>
> With writes, my understanding is there is still a possibility that an
> internal read-modify-write can happen, and cause a MCE (this is the same
> as writing to a bad DRAM cell, which can also cause an MCE). You can't
> really use the ACPI DSM preemptively because you don't know whether the
> location was bad. The error flow will be something like write causes the
> MCE, a badblock gets added (either through the mce handler or after the
> next reboot), and the recovery path is now the same as a regular badblock.
>

This is different from my understanding. Right now write_pmem() in
pmem_do_bvec() does not use memcpy_mcsafe(). If the block is bad it
clears poison and writes to pmem again. Seems to me writing to bad
blocks does not cause MCE. Do we need memcpy_mcsafe for pmem stores?

Thanks,
Andiry

>>
>> > [1]: http://www.linux.sgi.com/archives/xfs/2016-06/msg00299.html
>> >
>>
>> Thank you for the patchset. I will look into it.
>>
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 19+ messages in thread

[parent not found: <CAOvWMLZCt39EDg-1uppVVUeRG40JvOo9sKLY2XMuynZdnc0W9w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
       [not found]                     ` <CAOvWMLZCt39EDg-1uppVVUeRG40JvOo9sKLY2XMuynZdnc0W9w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2017-01-20  0:32                       ` Verma, Vishal L
  0 siblings, 0 replies; 19+ messages in thread
From: Verma, Vishal L @ 2017-01-20  0:32 UTC (permalink / raw)
  To: andiry-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org
  Cc: Vyacheslav.Dubeyko-Sjgp3cTcYWE@public.gmane.org,
	darrick.wong-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org,
	linux-nvdimm-y27Ovi1pjclAfugRpC6u6w@public.gmane.org,
	linux-block-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	slava-yeENwD64cLxBDgjK7y7TUQ@public.gmane.org,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	lsf-pc-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org

On Tue, 2017-01-17 at 17:58 -0800, Andiry Xu wrote:
> On Tue, Jan 17, 2017 at 3:51 PM, Vishal Verma <vishal.l.verma@intel.co
> m> wrote:
> > On 01/17, Andiry Xu wrote:
> > 
> > <snip>
> > 
> > > > > 
> > > > > The pmem_do_bvec() read logic is like this:
> > > > > 
> > > > > pmem_do_bvec()
> > > > >     if (is_bad_pmem())
> > > > >         return -EIO;
> > > > >     else
> > > > >         memcpy_from_pmem();
> > > > > 
> > > > > Note memcpy_from_pmem() is calling memcpy_mcsafe(). Does this
> > > > > imply
> > > > > that even if a block is not in the badblock list, it still can
> > > > > be bad
> > > > > and causes MCE? Does the badblock list get changed during file
> > > > > system
> > > > > running? If that is the case, should the file system get a
> > > > > notification when it gets changed? If a block is good when I
> > > > > first
> > > > > read it, can I still trust it to be good for the second
> > > > > access?
> > > > 
> > > > Yes, if a block is not in the badblocks list, it can still cause
> > > > an
> > > > MCE. This is the latent error case I described above. For a
> > > > simple read()
> > > > via the pmem driver, this will get handled by memcpy_mcsafe. For
> > > > mmap,
> > > > an MCE is inevitable.
> > > > 
> > > > Yes the badblocks list may change while a filesystem is running.
> > > > The RFC
> > > > patches[1] I linked to add a notification for the filesystem
> > > > when this
> > > > happens.
> > > > 
> > > 
> > > This is really bad and it makes file system implementation much
> > > more
> > > complicated. And badblock notification does not help very much,
> > > because any block can be bad potentially, no matter it is in
> > > badblock
> > > list or not. And file system has to perform checking for every
> > > read,
> > > using memcpy_mcsafe. This is disaster for file system like NOVA,
> > > which
> > > uses pointer de-reference to access data structures on pmem. Now
> > > if I
> > > want to read a field in an inode on pmem, I have to copy it to
> > > DRAM
> > > first and make sure memcpy_mcsafe() does not report anything
> > > wrong.
> > 
> > You have a good point, and I don't know if I have an answer for
> > this..
> > Assuming a system with MCE recovery, maybe NOVA can add a mce
> > handler
> > similar to nfit_handle_mce(), and handle errors as they happen, but
> > I'm
> > being very hand-wavey here and don't know how much/how well that
> > might
> > work..
> > 
> > > 
> > > > No, if the media, for some reason, 'dvelops' a bad cell, a
> > > > second
> > > > consecutive read does have a chance of being bad. Once a
> > > > location has
> > > > been marked as bad, it will stay bad till the ACPI clear error
> > > > 'DSM' has
> > > > been called to mark it as clean.
> > > > 
> > > 
> > > I wonder what happens to write in this case? If a block is bad but
> > > not
> > > reported in badblock list. Now I write to it without reading
> > > first. Do
> > > I clear the poison with the write? Or still require a ACPI DSM?
> > 
> > With writes, my understanding is there is still a possibility that
> > an
> > internal read-modify-write can happen, and cause a MCE (this is the
> > same
> > as writing to a bad DRAM cell, which can also cause an MCE). You
> > can't
> > really use the ACPI DSM preemptively because you don't know whether
> > the
> > location was bad. The error flow will be something like write causes
> > the
> > MCE, a badblock gets added (either through the mce handler or after
> > the
> > next reboot), and the recovery path is now the same as a regular
> > badblock.
> > 
> 
> This is different from my understanding. Right now write_pmem() in
> pmem_do_bvec() does not use memcpy_mcsafe(). If the block is bad it
> clears poison and writes to pmem again. Seems to me writing to bad
> blocks does not cause MCE. Do we need memcpy_mcsafe for pmem stores?

You are right, writes don't use memcpy_mcsafe, and will not directly
cause an MCE. However a write can cause an asynchronous 'CMCI' -
corrected machine check interrupt, but this is not critical, and wont be
a memory error as the core didn't consume poison. memcpy_mcsafe cannot
protect against this because the write is 'posted' and the CMCI is not
synchronous. Note that this is only in the latent error or memmap-store
case.

> 
> Thanks,
> Andiry
> 
> > > 
> > > > [1]: http://www.linux.sgi.com/archives/xfs/2016-06/msg00299.html
> > > > 
> > > 
> > > Thank you for the patchset. I will look into it.
> > > 
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
  2017-01-17 22:15           ` Andiry Xu
  2017-01-17 22:37             ` Vishal Verma
@ 2017-01-18  0:16             ` Andreas Dilger
  2017-01-18  2:01               ` Andiry Xu
  1 sibling, 1 reply; 19+ messages in thread
From: Andreas Dilger @ 2017-01-18  0:16 UTC (permalink / raw)
  To: Andiry Xu
  Cc: Vishal Verma, Darrick J. Wong, Slava Dubeyko,
	lsf-pc@lists.linux-foundation.org, linux-nvdimm@lists.01.org,
	linux-block@vger.kernel.org, Linux FS Devel, Viacheslav Dubeyko

[-- Attachment #1: Type: text/plain, Size: 5863 bytes --]

On Jan 17, 2017, at 3:15 PM, Andiry Xu <andiry@gmail.com> wrote:
> On Tue, Jan 17, 2017 at 1:35 PM, Vishal Verma <vishal.l.verma@intel.com> wrote:
>> On 01/16, Darrick J. Wong wrote:
>>> On Fri, Jan 13, 2017 at 05:49:10PM -0700, Vishal Verma wrote:
>>>> On 01/14, Slava Dubeyko wrote:
>>>>> 
>>>>> ---- Original Message ----
>>>>> Subject: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
>>>>> Sent: Jan 13, 2017 1:40 PM
>>>>> From: "Verma, Vishal L" <vishal.l.verma@intel.com>
>>>>> To: lsf-pc@lists.linux-foundation.org
>>>>> Cc: linux-nvdimm@lists.01.org, linux-block@vger.kernel.org, linux-fsdevel@vger.kernel.org
>>>>> 
>>>>>> The current implementation of badblocks, where we consult the
>>>>>> badblocks list for every IO in the block driver works, and is a
>>>>>> last option failsafe, but from a user perspective, it isn't the
>>>>>> easiest interface to work with.
>>>>> 
>>>>> As I remember, FAT and HFS+ specifications contain description of bad blocks
>>>>> (physical sectors) table. I believe that this table was used for the case of
>>>>> floppy media. But, finally, this table becomes to be the completely obsolete
>>>>> artefact because mostly storage devices are reliably enough. Why do you need
>>> 
>>> ext4 has a badblocks inode to own all the bad spots on disk, but ISTR it
>>> doesn't support(??) extents or 64-bit filesystems, and might just be a
>>> vestigial organ at this point.  XFS doesn't have anything to track bad
>>> blocks currently....
>>> 
>>>>> in exposing the bad blocks on the file system level?  Do you expect that next
>>>>> generation of NVM memory will be so unreliable that file system needs to manage
>>>>> bad blocks? What's about erasure coding schemes? Do file system really need to suffer
>>>>> from the bad block issue?
>>>>> 
>>>>> Usually, we are using LBAs and it is the responsibility of storage device to map
>>>>> a bad physical block/page/sector into valid one. Do you mean that we have
>>>>> access to physical NVM memory address directly? But it looks like that we can
>>>>> have a "bad block" issue even we will access data into page cache's memory
>>>>> page (if we will use NVM memory for page cache, of course). So, what do you
>>>>> imply by "bad block" issue?
>>>> 
>>>> We don't have direct physical access to the device's address space, in
>>>> the sense the device is still free to perform remapping of chunks of NVM
>>>> underneath us. The problem is that when a block or address range (as
>>>> small as a cache line) goes bad, the device maintains a poison bit for
>>>> every affected cache line. Behind the scenes, it may have already
>>>> remapped the range, but the cache line poison has to be kept so that
>>>> there is a notification to the user/owner of the data that something has
>>>> been lost. Since NVM is byte addressable memory sitting on the memory
>>>> bus, such a poisoned cache line results in memory errors and SIGBUSes.
>>>> Compared to tradational storage where an app will get nice and friendly
>>>> (relatively speaking..) -EIOs. The whole badblocks implementation was
>>>> done so that the driver can intercept IO (i.e. reads) to _known_ bad
>>>> locations, and short-circuit them with an EIO. If the driver doesn't
>>>> catch these, the reads will turn into a memory bus access, and the
>>>> poison will cause a SIGBUS.
>>> 
>>> "driver" ... you mean XFS?  Or do you mean the thing that makes pmem
>>> look kind of like a traditional block device? :)
>> 
>> Yes, the thing that makes pmem look like a block device :) --
>> drivers/nvdimm/pmem.c
>> 
>>> 
>>>> This effort is to try and make this badblock checking smarter - and try
>>>> and reduce the penalty on every IO to a smaller range, which only the
>>>> filesystem can do.
>>> 
>>> Though... now that XFS merged the reverse mapping support, I've been
>>> wondering if there'll be a resubmission of the device errors callback?
>>> It still would be useful to be able to inform the user that part of
>>> their fs has gone bad, or, better yet, if the buffer is still in memory
>>> someplace else, just write it back out.
>>> 
>>> Or I suppose if we had some kind of raid1 set up between memories we
>>> could read one of the other copies and rewrite it into the failing
>>> region immediately.
>> 
>> Yes, that is kind of what I was hoping to accomplish via this
>> discussion. How much would filesystems want to be involved in this sort
>> of badblocks handling, if at all. I can refresh my patches that provide
>> the fs notification, but that's the easy bit, and a starting point.
>> 
> 
> I have some questions. Why moving badblock handling to file system
> level avoid the checking phase? In file system level for each I/O I
> still have to check the badblock list, right? Do you mean during mount
> it can go through the pmem device and locates all the data structures
> mangled by badblocks and handle them accordingly, so that during
> normal running the badblocks will never be accessed? Or, if there is
> replicataion/snapshot support, use a copy to recover the badblocks?

With ext4 badblocks, the main outcome is that the bad blocks would be
pemanently marked in the allocation bitmap as being used, and they would
never be allocated to a file, so they should never be accessed unless
doing a full device scan (which ext4 and e2fsck never do).  That would
avoid the need to check every I/O against the bad blocks list, if the
driver knows that the filesystem will handle this.

The one caveat is that ext4 only allows 32-bit block numbers in the
badblocks list, since this feature hasn't been used in a long time.
This is good for up to 16TB filesystems, but if there was a demand to
use this feature again it would be possible allow 64-bit block numbers.

Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
  2017-01-18  0:16             ` Andreas Dilger
@ 2017-01-18  2:01               ` Andiry Xu
  2017-01-18  3:08                 ` Lu Zhang
  2017-01-20  0:55                 ` Verma, Vishal L
  0 siblings, 2 replies; 19+ messages in thread
From: Andiry Xu @ 2017-01-18  2:01 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Slava Dubeyko, Darrick J. Wong, linux-nvdimm@lists.01.org,
	linux-block@vger.kernel.org, Viacheslav Dubeyko, Linux FS Devel,
	lsf-pc@lists.linux-foundation.org

On Tue, Jan 17, 2017 at 4:16 PM, Andreas Dilger <adilger@dilger.ca> wrote:
> On Jan 17, 2017, at 3:15 PM, Andiry Xu <andiry@gmail.com> wrote:
>> On Tue, Jan 17, 2017 at 1:35 PM, Vishal Verma <vishal.l.verma@intel.com> wrote:
>>> On 01/16, Darrick J. Wong wrote:
>>>> On Fri, Jan 13, 2017 at 05:49:10PM -0700, Vishal Verma wrote:
>>>>> On 01/14, Slava Dubeyko wrote:
>>>>>>
>>>>>> ---- Original Message ----
>>>>>> Subject: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
>>>>>> Sent: Jan 13, 2017 1:40 PM
>>>>>> From: "Verma, Vishal L" <vishal.l.verma@intel.com>
>>>>>> To: lsf-pc@lists.linux-foundation.org
>>>>>> Cc: linux-nvdimm@lists.01.org, linux-block@vger.kernel.org, linux-fsdevel@vger.kernel.org
>>>>>>
>>>>>>> The current implementation of badblocks, where we consult the
>>>>>>> badblocks list for every IO in the block driver works, and is a
>>>>>>> last option failsafe, but from a user perspective, it isn't the
>>>>>>> easiest interface to work with.
>>>>>>
>>>>>> As I remember, FAT and HFS+ specifications contain description of bad blocks
>>>>>> (physical sectors) table. I believe that this table was used for the case of
>>>>>> floppy media. But, finally, this table becomes to be the completely obsolete
>>>>>> artefact because mostly storage devices are reliably enough. Why do you need
>>>>
>>>> ext4 has a badblocks inode to own all the bad spots on disk, but ISTR it
>>>> doesn't support(??) extents or 64-bit filesystems, and might just be a
>>>> vestigial organ at this point.  XFS doesn't have anything to track bad
>>>> blocks currently....
>>>>
>>>>>> in exposing the bad blocks on the file system level?  Do you expect that next
>>>>>> generation of NVM memory will be so unreliable that file system needs to manage
>>>>>> bad blocks? What's about erasure coding schemes? Do file system really need to suffer
>>>>>> from the bad block issue?
>>>>>>
>>>>>> Usually, we are using LBAs and it is the responsibility of storage device to map
>>>>>> a bad physical block/page/sector into valid one. Do you mean that we have
>>>>>> access to physical NVM memory address directly? But it looks like that we can
>>>>>> have a "bad block" issue even we will access data into page cache's memory
>>>>>> page (if we will use NVM memory for page cache, of course). So, what do you
>>>>>> imply by "bad block" issue?
>>>>>
>>>>> We don't have direct physical access to the device's address space, in
>>>>> the sense the device is still free to perform remapping of chunks of NVM
>>>>> underneath us. The problem is that when a block or address range (as
>>>>> small as a cache line) goes bad, the device maintains a poison bit for
>>>>> every affected cache line. Behind the scenes, it may have already
>>>>> remapped the range, but the cache line poison has to be kept so that
>>>>> there is a notification to the user/owner of the data that something has
>>>>> been lost. Since NVM is byte addressable memory sitting on the memory
>>>>> bus, such a poisoned cache line results in memory errors and SIGBUSes.
>>>>> Compared to tradational storage where an app will get nice and friendly
>>>>> (relatively speaking..) -EIOs. The whole badblocks implementation was
>>>>> done so that the driver can intercept IO (i.e. reads) to _known_ bad
>>>>> locations, and short-circuit them with an EIO. If the driver doesn't
>>>>> catch these, the reads will turn into a memory bus access, and the
>>>>> poison will cause a SIGBUS.
>>>>
>>>> "driver" ... you mean XFS?  Or do you mean the thing that makes pmem
>>>> look kind of like a traditional block device? :)
>>>
>>> Yes, the thing that makes pmem look like a block device :) --
>>> drivers/nvdimm/pmem.c
>>>
>>>>
>>>>> This effort is to try and make this badblock checking smarter - and try
>>>>> and reduce the penalty on every IO to a smaller range, which only the
>>>>> filesystem can do.
>>>>
>>>> Though... now that XFS merged the reverse mapping support, I've been
>>>> wondering if there'll be a resubmission of the device errors callback?
>>>> It still would be useful to be able to inform the user that part of
>>>> their fs has gone bad, or, better yet, if the buffer is still in memory
>>>> someplace else, just write it back out.
>>>>
>>>> Or I suppose if we had some kind of raid1 set up between memories we
>>>> could read one of the other copies and rewrite it into the failing
>>>> region immediately.
>>>
>>> Yes, that is kind of what I was hoping to accomplish via this
>>> discussion. How much would filesystems want to be involved in this sort
>>> of badblocks handling, if at all. I can refresh my patches that provide
>>> the fs notification, but that's the easy bit, and a starting point.
>>>
>>
>> I have some questions. Why moving badblock handling to file system
>> level avoid the checking phase? In file system level for each I/O I
>> still have to check the badblock list, right? Do you mean during mount
>> it can go through the pmem device and locates all the data structures
>> mangled by badblocks and handle them accordingly, so that during
>> normal running the badblocks will never be accessed? Or, if there is
>> replicataion/snapshot support, use a copy to recover the badblocks?
>
> With ext4 badblocks, the main outcome is that the bad blocks would be
> pemanently marked in the allocation bitmap as being used, and they would
> never be allocated to a file, so they should never be accessed unless
> doing a full device scan (which ext4 and e2fsck never do).  That would
> avoid the need to check every I/O against the bad blocks list, if the
> driver knows that the filesystem will handle this.
>

Thank you for explanation. However this only works for free blocks,
right? What about allocated blocks, like file data and metadata?

Thanks,
Andiry

> The one caveat is that ext4 only allows 32-bit block numbers in the
> badblocks list, since this feature hasn't been used in a long time.
> This is good for up to 16TB filesystems, but if there was a demand to
> use this feature again it would be possible allow 64-bit block numbers.
>
> Cheers, Andreas
>
>
>
>
>
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
  2017-01-18  2:01               ` Andiry Xu
@ 2017-01-18  3:08                 ` Lu Zhang
  2017-01-20  0:46                   ` Vishal Verma
  2017-01-20  0:55                 ` Verma, Vishal L
  1 sibling, 1 reply; 19+ messages in thread
From: Lu Zhang @ 2017-01-18  3:08 UTC (permalink / raw)
  To: Andiry Xu
  Cc: Andreas Dilger, Slava Dubeyko, Darrick J. Wong,
	linux-nvdimm@lists.01.org, linux-block@vger.kernel.org,
	Viacheslav Dubeyko, Linux FS Devel,
	lsf-pc@lists.linux-foundation.org

I'm curious about the fault model and corresponding hardware ECC mechanisms
for NVDIMMs. In my understanding for memory accesses to trigger MCE, it
means the memory controller finds a detectable but uncorrectable error
(DUE). So if there is no hardware ECC support the media errors won't even
be noticed, not to mention badblocks or machine checks.

Current hardware ECC support for DRAM usually employs (72, 64) single-bit
error correction mechanism, and for advanced ECCs there are techniques like
Chipkill or SDDC which can tolerate a single DRAM chip failure. What is the
expected ECC mode for NVDIMMs, assuming that PCM or 3dXpoint based
technology might have higher error rates?

If DUE does happen and is flagged to the file system via MCE (somehow...),
and the fs finds that the error corrupts its allocated data page, or
metadata, now if the fs wants to recover its data the intuition is that
there needs to be a stronger error correction mechanism to correct the
hardware-uncorrectable errors. So knowing the hardware ECC baseline is
helpful for the file system to understand how severe are the faults in
badblocks, and develop its recovery methods.

Regards,
Lu

On Tue, Jan 17, 2017 at 6:01 PM, Andiry Xu <andiry@gmail.com> wrote:

> On Tue, Jan 17, 2017 at 4:16 PM, Andreas Dilger <adilger@dilger.ca> wrote:
> > On Jan 17, 2017, at 3:15 PM, Andiry Xu <andiry@gmail.com> wrote:
> >> On Tue, Jan 17, 2017 at 1:35 PM, Vishal Verma <vishal.l.verma@intel.com>
> wrote:
> >>> On 01/16, Darrick J. Wong wrote:
> >>>> On Fri, Jan 13, 2017 at 05:49:10PM -0700, Vishal Verma wrote:
> >>>>> On 01/14, Slava Dubeyko wrote:
> >>>>>>
> >>>>>> ---- Original Message ----
> >>>>>> Subject: [LSF/MM TOPIC] Badblocks checking/representation in
> filesystems
> >>>>>> Sent: Jan 13, 2017 1:40 PM
> >>>>>> From: "Verma, Vishal L" <vishal.l.verma@intel.com>
> >>>>>> To: lsf-pc@lists.linux-foundation.org
> >>>>>> Cc: linux-nvdimm@lists.01.org, linux-block@vger.kernel.org,
> linux-fsdevel@vger.kernel.org
> >>>>>>
> >>>>>>> The current implementation of badblocks, where we consult the
> >>>>>>> badblocks list for every IO in the block driver works, and is a
> >>>>>>> last option failsafe, but from a user perspective, it isn't the
> >>>>>>> easiest interface to work with.
> >>>>>>
> >>>>>> As I remember, FAT and HFS+ specifications contain description of
> bad blocks
> >>>>>> (physical sectors) table. I believe that this table was used for
> the case of
> >>>>>> floppy media. But, finally, this table becomes to be the completely
> obsolete
> >>>>>> artefact because mostly storage devices are reliably enough. Why do
> you need
> >>>>
> >>>> ext4 has a badblocks inode to own all the bad spots on disk, but ISTR
> it
> >>>> doesn't support(??) extents or 64-bit filesystems, and might just be a
> >>>> vestigial organ at this point.  XFS doesn't have anything to track bad
> >>>> blocks currently....
> >>>>
> >>>>>> in exposing the bad blocks on the file system level?  Do you expect
> that next
> >>>>>> generation of NVM memory will be so unreliable that file system
> needs to manage
> >>>>>> bad blocks? What's about erasure coding schemes? Do file system
> really need to suffer
> >>>>>> from the bad block issue?
> >>>>>>
> >>>>>> Usually, we are using LBAs and it is the responsibility of storage
> device to map
> >>>>>> a bad physical block/page/sector into valid one. Do you mean that
> we have
> >>>>>> access to physical NVM memory address directly? But it looks like
> that we can
> >>>>>> have a "bad block" issue even we will access data into page cache's
> memory
> >>>>>> page (if we will use NVM memory for page cache, of course). So,
> what do you
> >>>>>> imply by "bad block" issue?
> >>>>>
> >>>>> We don't have direct physical access to the device's address space,
> in
> >>>>> the sense the device is still free to perform remapping of chunks of
> NVM
> >>>>> underneath us. The problem is that when a block or address range (as
> >>>>> small as a cache line) goes bad, the device maintains a poison bit
> for
> >>>>> every affected cache line. Behind the scenes, it may have already
> >>>>> remapped the range, but the cache line poison has to be kept so that
> >>>>> there is a notification to the user/owner of the data that something
> has
> >>>>> been lost. Since NVM is byte addressable memory sitting on the memory
> >>>>> bus, such a poisoned cache line results in memory errors and
> SIGBUSes.
> >>>>> Compared to tradational storage where an app will get nice and
> friendly
> >>>>> (relatively speaking..) -EIOs. The whole badblocks implementation was
> >>>>> done so that the driver can intercept IO (i.e. reads) to _known_ bad
> >>>>> locations, and short-circuit them with an EIO. If the driver doesn't
> >>>>> catch these, the reads will turn into a memory bus access, and the
> >>>>> poison will cause a SIGBUS.
> >>>>
> >>>> "driver" ... you mean XFS?  Or do you mean the thing that makes pmem
> >>>> look kind of like a traditional block device? :)
> >>>
> >>> Yes, the thing that makes pmem look like a block device :) --
> >>> drivers/nvdimm/pmem.c
> >>>
> >>>>
> >>>>> This effort is to try and make this badblock checking smarter - and
> try
> >>>>> and reduce the penalty on every IO to a smaller range, which only the
> >>>>> filesystem can do.
> >>>>
> >>>> Though... now that XFS merged the reverse mapping support, I've been
> >>>> wondering if there'll be a resubmission of the device errors callback?
> >>>> It still would be useful to be able to inform the user that part of
> >>>> their fs has gone bad, or, better yet, if the buffer is still in
> memory
> >>>> someplace else, just write it back out.
> >>>>
> >>>> Or I suppose if we had some kind of raid1 set up between memories we
> >>>> could read one of the other copies and rewrite it into the failing
> >>>> region immediately.
> >>>
> >>> Yes, that is kind of what I was hoping to accomplish via this
> >>> discussion. How much would filesystems want to be involved in this sort
> >>> of badblocks handling, if at all. I can refresh my patches that provide
> >>> the fs notification, but that's the easy bit, and a starting point.
> >>>
> >>
> >> I have some questions. Why moving badblock handling to file system
> >> level avoid the checking phase? In file system level for each I/O I
> >> still have to check the badblock list, right? Do you mean during mount
> >> it can go through the pmem device and locates all the data structures
> >> mangled by badblocks and handle them accordingly, so that during
> >> normal running the badblocks will never be accessed? Or, if there is
> >> replicataion/snapshot support, use a copy to recover the badblocks?
> >
> > With ext4 badblocks, the main outcome is that the bad blocks would be
> > pemanently marked in the allocation bitmap as being used, and they would
> > never be allocated to a file, so they should never be accessed unless
> > doing a full device scan (which ext4 and e2fsck never do).  That would
> > avoid the need to check every I/O against the bad blocks list, if the
> > driver knows that the filesystem will handle this.
> >
>
> Thank you for explanation. However this only works for free blocks,
> right? What about allocated blocks, like file data and metadata?
>
> Thanks,
> Andiry
>
> > The one caveat is that ext4 only allows 32-bit block numbers in the
> > badblocks list, since this feature hasn't been used in a long time.
> > This is good for up to 16TB filesystems, but if there was a demand to
> > use this feature again it would be possible allow 64-bit block numbers.
> >
> > Cheers, Andreas
> >
> >
> >
> >
> >
> _______________________________________________
> Linux-nvdimm mailing list
> Linux-nvdimm@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm
>
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
  2017-01-18  3:08                 ` Lu Zhang
@ 2017-01-20  0:46                   ` Vishal Verma
  2017-01-20  9:24                     ` Yasunori Goto
  0 siblings, 1 reply; 19+ messages in thread
From: Vishal Verma @ 2017-01-20  0:46 UTC (permalink / raw)
  To: Lu Zhang
  Cc: Andreas Dilger, Slava Dubeyko, Darrick J. Wong,
	linux-nvdimm@lists.01.org, linux-block@vger.kernel.org,
	Linux FS Devel, Viacheslav Dubeyko, Andiry Xu,
	lsf-pc@lists.linux-foundation.org

On 01/17, Lu Zhang wrote:
> I'm curious about the fault model and corresponding hardware ECC mechanisms
> for NVDIMMs. In my understanding for memory accesses to trigger MCE, it
> means the memory controller finds a detectable but uncorrectable error
> (DUE). So if there is no hardware ECC support the media errors won't even
> be noticed, not to mention badblocks or machine checks.
> 
> Current hardware ECC support for DRAM usually employs (72, 64) single-bit
> error correction mechanism, and for advanced ECCs there are techniques like
> Chipkill or SDDC which can tolerate a single DRAM chip failure. What is the
> expected ECC mode for NVDIMMs, assuming that PCM or 3dXpoint based
> technology might have higher error rates?

I'm sure once NVDIMMs start becoming widely available, there will be
more information on how they do ECC..

> 
> If DUE does happen and is flagged to the file system via MCE (somehow...),
> and the fs finds that the error corrupts its allocated data page, or
> metadata, now if the fs wants to recover its data the intuition is that
> there needs to be a stronger error correction mechanism to correct the
> hardware-uncorrectable errors. So knowing the hardware ECC baseline is
> helpful for the file system to understand how severe are the faults in
> badblocks, and develop its recovery methods.

Like mentioned before, this discussion is more about presentation of
errors in a known consumable format, rather than recovering from errors.
While recovering from errors is interesting, we already have layers
like RAID for that, and they are as applicable to NVDIMM backed storage
as they have been for disk/SSD based storage.

> 
> Regards,
> Lu
> 
> On Tue, Jan 17, 2017 at 6:01 PM, Andiry Xu <andiry@gmail.com> wrote:
> 
> > On Tue, Jan 17, 2017 at 4:16 PM, Andreas Dilger <adilger@dilger.ca> wrote:
> > > On Jan 17, 2017, at 3:15 PM, Andiry Xu <andiry@gmail.com> wrote:
> > >> On Tue, Jan 17, 2017 at 1:35 PM, Vishal Verma <vishal.l.verma@intel.com>
> > wrote:
> > >>> On 01/16, Darrick J. Wong wrote:
> > >>>> On Fri, Jan 13, 2017 at 05:49:10PM -0700, Vishal Verma wrote:
> > >>>>> On 01/14, Slava Dubeyko wrote:
> > >>>>>>
> > >>>>>> ---- Original Message ----
> > >>>>>> Subject: [LSF/MM TOPIC] Badblocks checking/representation in
> > filesystems
> > >>>>>> Sent: Jan 13, 2017 1:40 PM
> > >>>>>> From: "Verma, Vishal L" <vishal.l.verma@intel.com>
> > >>>>>> To: lsf-pc@lists.linux-foundation.org
> > >>>>>> Cc: linux-nvdimm@lists.01.org, linux-block@vger.kernel.org,
> > linux-fsdevel@vger.kernel.org
> > >>>>>>
> > >>>>>>> The current implementation of badblocks, where we consult the
> > >>>>>>> badblocks list for every IO in the block driver works, and is a
> > >>>>>>> last option failsafe, but from a user perspective, it isn't the
> > >>>>>>> easiest interface to work with.
> > >>>>>>
> > >>>>>> As I remember, FAT and HFS+ specifications contain description of
> > bad blocks
> > >>>>>> (physical sectors) table. I believe that this table was used for
> > the case of
> > >>>>>> floppy media. But, finally, this table becomes to be the completely
> > obsolete
> > >>>>>> artefact because mostly storage devices are reliably enough. Why do
> > you need
> > >>>>
> > >>>> ext4 has a badblocks inode to own all the bad spots on disk, but ISTR
> > it
> > >>>> doesn't support(??) extents or 64-bit filesystems, and might just be a
> > >>>> vestigial organ at this point.  XFS doesn't have anything to track bad
> > >>>> blocks currently....
> > >>>>
> > >>>>>> in exposing the bad blocks on the file system level?  Do you expect
> > that next
> > >>>>>> generation of NVM memory will be so unreliable that file system
> > needs to manage
> > >>>>>> bad blocks? What's about erasure coding schemes? Do file system
> > really need to suffer
> > >>>>>> from the bad block issue?
> > >>>>>>
> > >>>>>> Usually, we are using LBAs and it is the responsibility of storage
> > device to map
> > >>>>>> a bad physical block/page/sector into valid one. Do you mean that
> > we have
> > >>>>>> access to physical NVM memory address directly? But it looks like
> > that we can
> > >>>>>> have a "bad block" issue even we will access data into page cache's
> > memory
> > >>>>>> page (if we will use NVM memory for page cache, of course). So,
> > what do you
> > >>>>>> imply by "bad block" issue?
> > >>>>>
> > >>>>> We don't have direct physical access to the device's address space,
> > in
> > >>>>> the sense the device is still free to perform remapping of chunks of
> > NVM
> > >>>>> underneath us. The problem is that when a block or address range (as
> > >>>>> small as a cache line) goes bad, the device maintains a poison bit
> > for
> > >>>>> every affected cache line. Behind the scenes, it may have already
> > >>>>> remapped the range, but the cache line poison has to be kept so that
> > >>>>> there is a notification to the user/owner of the data that something
> > has
> > >>>>> been lost. Since NVM is byte addressable memory sitting on the memory
> > >>>>> bus, such a poisoned cache line results in memory errors and
> > SIGBUSes.
> > >>>>> Compared to tradational storage where an app will get nice and
> > friendly
> > >>>>> (relatively speaking..) -EIOs. The whole badblocks implementation was
> > >>>>> done so that the driver can intercept IO (i.e. reads) to _known_ bad
> > >>>>> locations, and short-circuit them with an EIO. If the driver doesn't
> > >>>>> catch these, the reads will turn into a memory bus access, and the
> > >>>>> poison will cause a SIGBUS.
> > >>>>
> > >>>> "driver" ... you mean XFS?  Or do you mean the thing that makes pmem
> > >>>> look kind of like a traditional block device? :)
> > >>>
> > >>> Yes, the thing that makes pmem look like a block device :) --
> > >>> drivers/nvdimm/pmem.c
> > >>>
> > >>>>
> > >>>>> This effort is to try and make this badblock checking smarter - and
> > try
> > >>>>> and reduce the penalty on every IO to a smaller range, which only the
> > >>>>> filesystem can do.
> > >>>>
> > >>>> Though... now that XFS merged the reverse mapping support, I've been
> > >>>> wondering if there'll be a resubmission of the device errors callback?
> > >>>> It still would be useful to be able to inform the user that part of
> > >>>> their fs has gone bad, or, better yet, if the buffer is still in
> > memory
> > >>>> someplace else, just write it back out.
> > >>>>
> > >>>> Or I suppose if we had some kind of raid1 set up between memories we
> > >>>> could read one of the other copies and rewrite it into the failing
> > >>>> region immediately.
> > >>>
> > >>> Yes, that is kind of what I was hoping to accomplish via this
> > >>> discussion. How much would filesystems want to be involved in this sort
> > >>> of badblocks handling, if at all. I can refresh my patches that provide
> > >>> the fs notification, but that's the easy bit, and a starting point.
> > >>>
> > >>
> > >> I have some questions. Why moving badblock handling to file system
> > >> level avoid the checking phase? In file system level for each I/O I
> > >> still have to check the badblock list, right? Do you mean during mount
> > >> it can go through the pmem device and locates all the data structures
> > >> mangled by badblocks and handle them accordingly, so that during
> > >> normal running the badblocks will never be accessed? Or, if there is
> > >> replicataion/snapshot support, use a copy to recover the badblocks?
> > >
> > > With ext4 badblocks, the main outcome is that the bad blocks would be
> > > pemanently marked in the allocation bitmap as being used, and they would
> > > never be allocated to a file, so they should never be accessed unless
> > > doing a full device scan (which ext4 and e2fsck never do).  That would
> > > avoid the need to check every I/O against the bad blocks list, if the
> > > driver knows that the filesystem will handle this.
> > >
> >
> > Thank you for explanation. However this only works for free blocks,
> > right? What about allocated blocks, like file data and metadata?
> >
> > Thanks,
> > Andiry
> >
> > > The one caveat is that ext4 only allows 32-bit block numbers in the
> > > badblocks list, since this feature hasn't been used in a long time.
> > > This is good for up to 16TB filesystems, but if there was a demand to
> > > use this feature again it would be possible allow 64-bit block numbers.
> > >
> > > Cheers, Andreas
> > >
> > >
> > >
> > >
> > >
> > _______________________________________________
> > Linux-nvdimm mailing list
> > Linux-nvdimm@lists.01.org
> > https://lists.01.org/mailman/listinfo/linux-nvdimm
> >
> _______________________________________________
> Linux-nvdimm mailing list
> Linux-nvdimm@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
  2017-01-20  0:46                   ` Vishal Verma
@ 2017-01-20  9:24                     ` Yasunori Goto
       [not found]                       ` <20170120182435.0E12.E1E9C6FF-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
  0 siblings, 1 reply; 19+ messages in thread
From: Yasunori Goto @ 2017-01-20  9:24 UTC (permalink / raw)
  To: Vishal Verma
  Cc: Andreas Dilger, Slava Dubeyko, Darrick J. Wong,
	linux-nvdimm@lists.01.org, linux-block@vger.kernel.org, Andiry Xu,
	Viacheslav Dubeyko, Linux FS Devel,
	lsf-pc@lists.linux-foundation.org

Hello,
Virshal-san.

First of all, your discussion is quite interesting for me. Thanks.

> > 
> > If DUE does happen and is flagged to the file system via MCE (somehow...),
> > and the fs finds that the error corrupts its allocated data page, or
> > metadata, now if the fs wants to recover its data the intuition is that
> > there needs to be a stronger error correction mechanism to correct the
> > hardware-uncorrectable errors. So knowing the hardware ECC baseline is
> > helpful for the file system to understand how severe are the faults in
> > badblocks, and develop its recovery methods.
> 
> Like mentioned before, this discussion is more about presentation of
> errors in a known consumable format, rather than recovering from errors.
> While recovering from errors is interesting, we already have layers
> like RAID for that, and they are as applicable to NVDIMM backed storage
> as they have been for disk/SSD based storage.

I have one question here.

Certainly, user can use LVM mirroring for storage mode of NVDIMM.
However, NVDIMM has DAX mode. 
Can user use LVM mirroring for NVDIMM DAX mode?
I could not find any information that LVM support DAX....

In addition, current specs of NVDIMM (*) only define interleave feature of NVDIMMs. 
They does not mention about mirroring feature.
So, I don't understand how to use mirroring for DAX.

(*) "NVDIMM Namespace Specification" , "NVDIMM Block Window Driver Writer’s Guide",
       and "ACPI 6.1"

Regards,
---
Yasunori Goto

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 19+ messages in thread

[parent not found: <20170120182435.0E12.E1E9C6FF-+CUm20s59erQFUHtdCDX3A@public.gmane.org>]

* Re: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
       [not found]                       ` <20170120182435.0E12.E1E9C6FF-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
@ 2017-01-21  0:23                         ` Kani, Toshimitsu
  0 siblings, 0 replies; 19+ messages in thread
From: Kani, Toshimitsu @ 2017-01-21  0:23 UTC (permalink / raw)
  To: y-goto-+CUm20s59erQFUHtdCDX3A@public.gmane.org,
	vishal.l.verma-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org
  Cc: adilger-m1MBpc4rdrD3fQ9qLvQP4Q@public.gmane.org,
	Vyacheslav.Dubeyko-Sjgp3cTcYWE@public.gmane.org,
	darrick.wong-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org,
	linux-nvdimm-y27Ovi1pjclAfugRpC6u6w@public.gmane.org,
	linux-block-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	andiry-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org,
	slava-yeENwD64cLxBDgjK7y7TUQ@public.gmane.org,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	lsf-pc-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org

On Fri, 2017-01-20 at 18:24 +0900, Yasunori Goto wrote:
 :
> > 
> > Like mentioned before, this discussion is more about presentation
> > of errors in a known consumable format, rather than recovering from
> > errors. While recovering from errors is interesting, we already
> > have layers like RAID for that, and they are as applicable to
> > NVDIMM backed storage as they have been for disk/SSD based storage.
> 
> I have one question here.
> 
> Certainly, user can use LVM mirroring for storage mode of NVDIMM.
> However, NVDIMM has DAX mode. 
> Can user use LVM mirroring for NVDIMM DAX mode?
> I could not find any information that LVM support DAX....

dm-linear and dm-stripe support DAX.  This is done by mapping block
allocations to LVM physical devices.  Once blocks are allocated, all
DAX I/Os are direct and do not go through the device-mapper layer.  We
may be able to change it for read/write paths, but it remains true for
mmap.  So, I do not think DAX can be supported with LVM mirroring. 
This does not preclude hardware mirroring, though.

-Toshi
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
  2017-01-18  2:01               ` Andiry Xu
  2017-01-18  3:08                 ` Lu Zhang
@ 2017-01-20  0:55                 ` Verma, Vishal L
  1 sibling, 0 replies; 19+ messages in thread
From: Verma, Vishal L @ 2017-01-20  0:55 UTC (permalink / raw)
  To: andiry@gmail.com, adilger@dilger.ca
  Cc: darrick.wong@oracle.com, Vyacheslav.Dubeyko@wdc.com,
	linux-block@vger.kernel.org, slava@dubeyko.com,
	lsf-pc@lists.linux-foundation.org, linux-nvdimm@ml01.01.org,
	linux-fsdevel@vger.kernel.org

On Tue, 2017-01-17 at 18:01 -0800, Andiry Xu wrote:
> On Tue, Jan 17, 2017 at 4:16 PM, Andreas Dilger <adilger@dilger.ca>
> wrote:
> > On Jan 17, 2017, at 3:15 PM, Andiry Xu <andiry@gmail.com> wrote:
> > > On Tue, Jan 17, 2017 at 1:35 PM, Vishal Verma <vishal.l.verma@inte
> > > l.com> wrote:
> > > > On 01/16, Darrick J. Wong wrote:
> > > > > On Fri, Jan 13, 2017 at 05:49:10PM -0700, Vishal Verma wrote:
> > > > > > On 01/14, Slava Dubeyko wrote:
> > > > > > > 
> > > > > > > ---- Original Message ----
> > > > > > > Subject: [LSF/MM TOPIC] Badblocks checking/representation
> > > > > > > in filesystems
> > > > > > > Sent: Jan 13, 2017 1:40 PM
> > > > > > > From: "Verma, Vishal L" <vishal.l.verma@intel.com>
> > > > > > > To: lsf-pc@lists.linux-foundation.org
> > > > > > > Cc: linux-nvdimm@lists.01.org, linux-block@vger.kernel.org
> > > > > > > , linux-fsdevel@vger.kernel.org
> > > > > > > 
> > > > > > > > The current implementation of badblocks, where we
> > > > > > > > consult the
> > > > > > > > badblocks list for every IO in the block driver works,
> > > > > > > > and is a
> > > > > > > > last option failsafe, but from a user perspective, it
> > > > > > > > isn't the
> > > > > > > > easiest interface to work with.
> > > > > > > 
> > > > > > > As I remember, FAT and HFS+ specifications contain
> > > > > > > description of bad blocks
> > > > > > > (physical sectors) table. I believe that this table was
> > > > > > > used for the case of
> > > > > > > floppy media. But, finally, this table becomes to be the
> > > > > > > completely obsolete
> > > > > > > artefact because mostly storage devices are reliably
> > > > > > > enough. Why do you need
> > > > > 
> > > > > ext4 has a badblocks inode to own all the bad spots on disk,
> > > > > but ISTR it
> > > > > doesn't support(??) extents or 64-bit filesystems, and might
> > > > > just be a
> > > > > vestigial organ at this point.  XFS doesn't have anything to
> > > > > track bad
> > > > > blocks currently....
> > > > > 
> > > > > > > in exposing the bad blocks on the file system level?  Do
> > > > > > > you expect that next
> > > > > > > generation of NVM memory will be so unreliable that file
> > > > > > > system needs to manage
> > > > > > > bad blocks? What's about erasure coding schemes? Do file
> > > > > > > system really need to suffer
> > > > > > > from the bad block issue?
> > > > > > > 
> > > > > > > Usually, we are using LBAs and it is the responsibility of
> > > > > > > storage device to map
> > > > > > > a bad physical block/page/sector into valid one. Do you
> > > > > > > mean that we have
> > > > > > > access to physical NVM memory address directly? But it
> > > > > > > looks like that we can
> > > > > > > have a "bad block" issue even we will access data into
> > > > > > > page cache's memory
> > > > > > > page (if we will use NVM memory for page cache, of
> > > > > > > course). So, what do you
> > > > > > > imply by "bad block" issue?
> > > > > > 
> > > > > > We don't have direct physical access to the device's address
> > > > > > space, in
> > > > > > the sense the device is still free to perform remapping of
> > > > > > chunks of NVM
> > > > > > underneath us. The problem is that when a block or address
> > > > > > range (as
> > > > > > small as a cache line) goes bad, the device maintains a
> > > > > > poison bit for
> > > > > > every affected cache line. Behind the scenes, it may have
> > > > > > already
> > > > > > remapped the range, but the cache line poison has to be kept
> > > > > > so that
> > > > > > there is a notification to the user/owner of the data that
> > > > > > something has
> > > > > > been lost. Since NVM is byte addressable memory sitting on
> > > > > > the memory
> > > > > > bus, such a poisoned cache line results in memory errors and
> > > > > > SIGBUSes.
> > > > > > Compared to tradational storage where an app will get nice
> > > > > > and friendly
> > > > > > (relatively speaking..) -EIOs. The whole badblocks
> > > > > > implementation was
> > > > > > done so that the driver can intercept IO (i.e. reads) to
> > > > > > _known_ bad
> > > > > > locations, and short-circuit them with an EIO. If the driver
> > > > > > doesn't
> > > > > > catch these, the reads will turn into a memory bus access,
> > > > > > and the
> > > > > > poison will cause a SIGBUS.
> > > > > 
> > > > > "driver" ... you mean XFS?  Or do you mean the thing that
> > > > > makes pmem
> > > > > look kind of like a traditional block device? :)
> > > > 
> > > > Yes, the thing that makes pmem look like a block device :) --
> > > > drivers/nvdimm/pmem.c
> > > > 
> > > > > 
> > > > > > This effort is to try and make this badblock checking
> > > > > > smarter - and try
> > > > > > and reduce the penalty on every IO to a smaller range, which
> > > > > > only the
> > > > > > filesystem can do.
> > > > > 
> > > > > Though... now that XFS merged the reverse mapping support,
> > > > > I've been
> > > > > wondering if there'll be a resubmission of the device errors
> > > > > callback?
> > > > > It still would be useful to be able to inform the user that
> > > > > part of
> > > > > their fs has gone bad, or, better yet, if the buffer is still
> > > > > in memory
> > > > > someplace else, just write it back out.
> > > > > 
> > > > > Or I suppose if we had some kind of raid1 set up between
> > > > > memories we
> > > > > could read one of the other copies and rewrite it into the
> > > > > failing
> > > > > region immediately.
> > > > 
> > > > Yes, that is kind of what I was hoping to accomplish via this
> > > > discussion. How much would filesystems want to be involved in
> > > > this sort
> > > > of badblocks handling, if at all. I can refresh my patches that
> > > > provide
> > > > the fs notification, but that's the easy bit, and a starting
> > > > point.
> > > > 
> > > 
> > > I have some questions. Why moving badblock handling to file system
> > > level avoid the checking phase? In file system level for each I/O
> > > I
> > > still have to check the badblock list, right? Do you mean during
> > > mount
> > > it can go through the pmem device and locates all the data
> > > structures
> > > mangled by badblocks and handle them accordingly, so that during
> > > normal running the badblocks will never be accessed? Or, if there
> > > is
> > > replicataion/snapshot support, use a copy to recover the
> > > badblocks?
> > 
> > With ext4 badblocks, the main outcome is that the bad blocks would
> > be
> > pemanently marked in the allocation bitmap as being used, and they
> > would
> > never be allocated to a file, so they should never be accessed
> > unless
> > doing a full device scan (which ext4 and e2fsck never do).  That
> > would
> > avoid the need to check every I/O against the bad blocks list, if
> > the
> > driver knows that the filesystem will handle this.
> > 
> 
> Thank you for explanation. However this only works for free blocks,
> right? What about allocated blocks, like file data and metadata?
> 
Like Andreas said, the ext4 badblocks feature has not been in use, and
the current block layer badblocks are distinct and unrelated to these.
Can the ext4 badblocks infrastructure be revived and extended if we
decide to add badblocks to filesystems? Maybe - that was one of the
topics I was hoping to discuss/find out more about.

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2017-01-21  0:23 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-01-13 21:40 [LSF/MM TOPIC] Badblocks checking/representation in filesystems Verma, Vishal L
     [not found] <at1mp6pou4lenesjdgh22k4p.1484345585589@email.android.com>
     [not found] ` <b9rbflutjt10mb4ofherta8j.1484345610771@email.android.com>
2017-01-14  0:00   ` Slava Dubeyko
2017-01-14  0:49     ` Vishal Verma
2017-01-16  2:27       ` Slava Dubeyko
2017-01-17  6:33       ` Darrick J. Wong
2017-01-17 21:35         ` Vishal Verma
2017-01-17 22:15           ` Andiry Xu
2017-01-17 22:37             ` Vishal Verma
2017-01-17 23:20               ` Andiry Xu
2017-01-17 23:51                 ` Vishal Verma
2017-01-18  1:58                   ` Andiry Xu
     [not found]                     ` <CAOvWMLZCt39EDg-1uppVVUeRG40JvOo9sKLY2XMuynZdnc0W9w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-01-20  0:32                       ` Verma, Vishal L
2017-01-18  0:16             ` Andreas Dilger
2017-01-18  2:01               ` Andiry Xu
2017-01-18  3:08                 ` Lu Zhang
2017-01-20  0:46                   ` Vishal Verma
2017-01-20  9:24                     ` Yasunori Goto
     [not found]                       ` <20170120182435.0E12.E1E9C6FF-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
2017-01-21  0:23                         ` Kani, Toshimitsu
2017-01-20  0:55                 ` Verma, Vishal L

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox