- * Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
  2017-01-04  7:24           ` Damien Le Moal
@ 2017-01-04 12:39             ` Matias Bjørling
  2017-01-04 16:57             ` Theodore Ts'o
                               ` (2 subsequent siblings)
  3 siblings, 0 replies; 27+ messages in thread
From: Matias Bjørling @ 2017-01-04 12:39 UTC (permalink / raw)
  To: Damien Le Moal, Slava Dubeyko, Viacheslav Dubeyko,
	lsf-pc@lists.linux-foundation.org
  Cc: Linux FS Devel, linux-block@vger.kernel.org,
	linux-nvme@lists.infradead.org
On 01/04/2017 08:24 AM, Damien Le Moal wrote:
> 
> Slava,
> 
> On 1/4/17 11:59, Slava Dubeyko wrote:
>> What's the goal of SMR compatibility? Any unification or interface
>> abstraction has the goal to hide the peculiarities of underlying
>> hardware. But we have block device abstraction that hides all 
>> hardware's peculiarities perfectly. Also FTL (or any other
>> Translation Layer) is able to represent the device as sequence of
>> physical sectors without real knowledge on software side about 
>> sophisticated management activity on the device side. And, finally,
>> guys will be completely happy to use the regular file systems (ext4,
>> xfs) without necessity to modify software stack. But I believe that 
>> the goal of Open-channel SSD approach is completely opposite. Namely,
>> provide the opportunity for software side (file system, for example)
>> to manage the Open-channel SSD device with smarter policy.
> 
> The Zoned Block Device API is part of the block layer. So as such, it
> does abstract many aspects of the device characteristics, as so many
> other API of the block layer do (look at blkdev_issue_discard or zeroout
> implementations to see how far this can be pushed).
> 
> Regarding the use of open channel SSDs, I think you are absolutely
> correct: (1) some users may be very happy to use a regular, unmodified
> ext4 or xfs on top of an open channel SSD, as long as the FTL
> implementation does a complete abstraction of the device special
> features and presents a regular block device to upper layers. And
> conversely, (2) some file system implementations may prefer to directly
> use those special features and characteristics of open channel SSDs. No
> arguing with this.
> 
> But you are missing the parallel with SMR. For SMR, or more correctly
> zoned block devices since the ZBC or ZAC standards can equally apply to
> HDDs and SSDs, 3 models exists: drive-managed, host-aware and host-managed.
> 
> Case (1) above corresponds *exactly* to the drive managed model, with
> the difference that the abstraction of the device characteristics (SMR
> here) is in the drive FW and not in a host-level FTL implementation as
> it would be for open channel SSDs. Case (2) above corresponds to the
> host-managed model, that is, the device user has to deal with the device
> characteristics itself and use it correctly. The host-aware model lies
> in between these 2 extremes: it offers the possibility of complete
> abstraction by default, but also allows a user to optimize its operation
> for the device by allowing access to the device characteristics. So this
> would correspond to a possible third way of implementing an FTL for open
> channel SSDs.
> 
>> So, my key worry that the trying to hide under the same interface the
>> two different technologies (SMR and NAND flash) will be resulted in
>> the loss of opportunity to manage the device in more smarter way.
>> Because any unification has the goal to create a simple interface.
>> But SMR and NAND flash are significantly different technologies. And
>> if somebody creates technology-oriented file system, for example,
>> then it needs to have access to really special features of the
>> technology. Otherwise, interface will be overloaded by features of
>> both technologies and it will looks like as a mess.
> 
> I do not think so, as long as the device "model" is exposed to the user
> as the zoned block device interface does. This allows a user to adjust
> its operation depending on the device. This is true of course as long as
> each "model" has a clearly defined set of features associated. Again,
> that is the case for zoned block devices and an example of how this can
> be used is now in f2fs (which allows different operation modes for
> host-aware devices, but only one for host-managed devices). Again, I can
> see a clear parallel with open channel SSDs here.
> 
>> SMR zone and NAND flash erase block look comparable but, finally, it
>> significantly different stuff. Usually, SMR zone has 265 MB in size
>> but NAND flash erase block can vary from 512 KB to 8 MB (it will be
>> slightly larger in the future but not more than 32 MB, I suppose). It
>> is possible to group several erase blocks into aggregated entity but
>> it could be not very good policy from file system point of view.
> 
> Why not? For f2fs, the 2MB segments are grouped together into sections
> with a size matching the device zone size. That works well and can
> actually even reduce the garbage collection overhead in some cases.
> Nothing in the kernel zoned block device support limits the zone size to
> a particular minimum or maximum. The only direct implication of the zone
> size on the block I/O stack is that BIOs and requests cannot cross zone
> boundaries. In an extreme setup, a zone size of 4KB would work too and
> result in read/write commands of 4KB at most to the device.
> 
>> Another point that QLC device could have more tricky features of
>> erase blocks management. Also we should apply erase operation on NAND
>> flash erase block but it is not mandatory for the case of SMR zone.
> 
> Incorrect: host-managed devices require a zone "reset" (equivalent to
> discard/trim) to be reused after being written once. So again, the
> "tricky features" you mention will depend on the device "model",
> whatever this ends up to be for an open channel SSD.
> 
>> Because SMR zone could be simply re-written in sequential order if
>> all zone's data is invalid, for example. Also conventional zone could
>> be really tricky point. Because it is one zone only for the whole
>> device that could be updated in-place. Raw NAND flash, usually,
>> hasn't likewise conventional zone.
> 
> Conventional zones are optional in zoned block devices. There may be
> none at all and an implementation may well decide to not support a
> device without any conventional zones if some are required.
> In the case of open channel SSDs, the FTL implementation may well decide
> to expose a particular range of LBAs as "conventional zones" and have a
> lower level exposure for the remaining capacity whcih can then be
> optimally used by the file system based on the features available for
> that remaining LBA range. Again, a parallel is possible with SMR.
> 
>> Finally, if I really like to develop SMR- or NAND flash oriented file
>> system then I would like to play with peculiarities of concrete
>> technologies. And any unified interface will destroy the opportunity 
>> to create the really efficient solution. Finally, if my software
>> solution is unable to provide some fancy and efficient features then
>> guys will prefer to use the regular stack (ext4, xfs + block layer).
> 
> Not necessarily. Again think in terms of device "model" and associated
> feature set. An FS implementation may decide to support all possible
> models, with likely a resulting incredible complexity. More likely,
> similarly with what is happening with SMR, only models that make sense
> will be supported by FS implementation that can be easily modified.
> Example again here of f2fs: changes to support SMR were rather simple,
> whereas the initial effort to support SMR with ext4 was pretty much
> abandoned as it was too complex to integrate in the existing code while
> keeping the existing on-disk format.
> 
> Your argument above is actually making the same point: you want your
> implementation to use the device features directly. That is, your
> implementation wants a "host-managed" like device model. Using ext4 will
> require a "host-aware" or "drive-managed" model, which could be provided
> through a different FTL or device-mapper implementation in the case of
> open channel SSDs.
> 
> I am not trying to argue that open channel SSDs and zoned block devices
> should be supported under the exact same API. But I can definitely see
> clear parallels worth a discussion. As a first step, I would suggest
> trying to try defining open channel SSDs "models" and their feature set
> and see how these fit with the existing ZBC/ZAC defined models and at
> least estimate the implications on the block I/O stack. If adding the
> new models only results in the addition of a few top level functions or
> ioctls, it may be entirely feasible to integrate the two together.
> 
Thanks Damien. I couldn't have said it better my self.
The OCSSD 1.3 specification has been made with an eye towards the SMR
interface:
 - "Identification" - Follows the same "global" size definitions, and
also supports that each zone has its own local size.
 - "Get Report" command follows a very similar structure as SMR, such
that it can sit behind the "Report Zones" interface.
 - "Erase/Prepare Block" command follows the Reset block interface.
Those should fit right in. If the layout is planar, such that the OCSSD
only exposes a set of zones, it should be able to fit right into the
framework with minor modifications.
A couple of details are added when going towards managing multiple
parallel units, which is some of the things that require a bit of
discussion.
^ permalink raw reply	[flat|nested] 27+ messages in thread
- * Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
  2017-01-04  7:24           ` Damien Le Moal
  2017-01-04 12:39             ` Matias Bjørling
@ 2017-01-04 16:57             ` Theodore Ts'o
  2017-01-10  1:42               ` Damien Le Moal
  2017-01-05 22:58             ` Slava Dubeyko
  2017-01-06  1:09             ` Jaegeuk Kim
  3 siblings, 1 reply; 27+ messages in thread
From: Theodore Ts'o @ 2017-01-04 16:57 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Slava Dubeyko, Matias Bjørling, Viacheslav Dubeyko,
	lsf-pc@lists.linux-foundation.org, Linux FS Devel,
	linux-block@vger.kernel.org, linux-nvme@lists.infradead.org
I agree with Damien, but I'd also add that in the future there may
very well be some new Zone types added to the ZBC model.  So we
shouldn't assume that the ZBC model is a fixed one.  And who knows?
Perhaps T10 standards body will come up with a simpler model for
interfacing with SCSI/SATA-attached SSD's that might leverage the ZBC
model --- or not.
Either way, that's not really relevant as far as the Linux block layer
is concerned, since the Linux block layer is designed to be an
abstraction on top of hardware --- and in some cases we can use a
similar abstraction on top of eMMC's, SCSI's, and SATA's
implementation definition of TRIM/DISCARD/WRITE SAME/SECURE
TRIM/QUEUED TRIM, even though they are different in some subtle ways,
and may have different performance characteristics and semantics.
The trick is to expose similarities where the differences won't matter
to the upper layers, but also to expose the fine distinctions and
allow the file system and/or user space to use the protocol-specific
differences when it matters to them.
Designing that is going to be important, and I can guarantee we won't
get it right at first.  Which is why it's a good thing that internal
kernel interfaces aren't cast into concrete, and can be subject to
change as new revisions to ZBC, or new interfaces (like perhaps
OCSSD's) get promulgated by various standards bodies or by various
vendors.
> > Another point that QLC device could have more tricky features of
> > erase blocks management. Also we should apply erase operation on NAND
> > flash erase block but it is not mandatory for the case of SMR zone.
> 
> Incorrect: host-managed devices require a zone "reset" (equivalent to
> discard/trim) to be reused after being written once. So again, the
> "tricky features" you mention will depend on the device "model",
> whatever this ends up to be for an open channel SSD.
... and this is exposed by having different zone types (sequential
write required vs sequential write preferred vs conventional).  And if
OCSSD's "zones" don't fit into the current ZBC zone types, we can
easily add new ones.  I would suggest however, that we explicitly
disclaim that the block device layer's code points for zone types is
an exact match with the ZBC zone types numbering, precisely so we can
add new zone types that correspond to abstractions from different
hardware types, such as OCSSD.
> Not necessarily. Again think in terms of device "model" and associated
> feature set. An FS implementation may decide to support all possible
> models, with likely a resulting incredible complexity. More likely,
> similarly with what is happening with SMR, only models that make sense
> will be supported by FS implementation that can be easily modified.
> Example again here of f2fs: changes to support SMR were rather simple,
> whereas the initial effort to support SMR with ext4 was pretty much
> abandoned as it was too complex to integrate in the existing code while
> keeping the existing on-disk format.
I'll note that Abutalib Aghayev and I will be presenting a paper at
the 2017 FAST conference detailing a way to optimize ext4 for
Host-Aware SMR drives by making a surprisingly small set of changes to
ext4's journalling layer, with some very promising performance
improvements for certain workloads, which we tested on both Seagate
and WD HA drives and achieved 2x performance improvements.  Patches
are on the unstable portion of the ext4 patch queue, and I hope to get
them into an upstream acceptable shape (as opposed to "good enough for
a research paper") in the next few months.
So it may very well be that small changes can be made to file systems
to support exotic devices if there are ways that we can expose the
right information about underlying storage devices, and offering the
right abstractions to enable the right kind of minimal I/O tagging, or
hints, or commands as necessary such that the changes we do need to
make to the file system can be kept small, and kept easily testable
even if hardware is not available.
For example, by creating device mapper emulators of the feature sets
of these advanced storage interfaces that are exposed via the block
layer abstractions, whether it be for ZBC zones, or hardware
encryption acceleration, etc.
Cheers,
					- Ted
^ permalink raw reply	[flat|nested] 27+ messages in thread 
- * Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
  2017-01-04 16:57             ` Theodore Ts'o
@ 2017-01-10  1:42               ` Damien Le Moal
  2017-01-10  4:24                 ` Theodore Ts'o
  0 siblings, 1 reply; 27+ messages in thread
From: Damien Le Moal @ 2017-01-10  1:42 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Slava Dubeyko, Matias Bjørling, Viacheslav Dubeyko,
	lsf-pc@lists.linux-foundation.org, Linux FS Devel,
	linux-block@vger.kernel.org, linux-nvme@lists.infradead.org
Ted,
On 1/5/17 01:57, Theodore Ts'o wrote:
> I agree with Damien, but I'd also add that in the future there may
> very well be some new Zone types added to the ZBC model.  So we
> shouldn't assume that the ZBC model is a fixed one.  And who knows?
> Perhaps T10 standards body will come up with a simpler model for
> interfacing with SCSI/SATA-attached SSD's that might leverage the ZBC
> model --- or not.
Totally agree. There is already some activity in T10 for a ZBC V2
standard which indeed may include new zone types (for instance a
"circular buffer zone type that can be sequentially rewritten without a
reset, preserving previously written data for reads after the write
pointer). Such type of zone could be a perfect match for an FS journal
log space for instance.
> Either way, that's not really relevant as far as the Linux block layer
> is concerned, since the Linux block layer is designed to be an
> abstraction on top of hardware --- and in some cases we can use a
> similar abstraction on top of eMMC's, SCSI's, and SATA's
> implementation definition of TRIM/DISCARD/WRITE SAME/SECURE
> TRIM/QUEUED TRIM, even though they are different in some subtle ways,
> and may have different performance characteristics and semantics.
> 
> The trick is to expose similarities where the differences won't matter
> to the upper layers, but also to expose the fine distinctions and
> allow the file system and/or user space to use the protocol-specific
> differences when it matters to them.
Absolutely. The initial zoned block device support was written to match
what ZBC/ZAC defines. It was simple this way and there was no other
users of the zone concept. But the device models and zone types are just
numerical values reported to the device user. The block I/O stack
currently does not use these values beyond the device initialization. It
is up to the users (e.g. FS) of the device to determine what to do to
correctly use the device according to the types reported. So this basic
design is definitely extensible to new zone types and device models.
> 
> Designing that is going to be important, and I can guarantee we won't
> get it right at first.  Which is why it's a good thing that internal
> kernel interfaces aren't cast into concrete, and can be subject to
> change as new revisions to ZBC, or new interfaces (like perhaps
> OCSSD's) get promulgated by various standards bodies or by various
> vendors.
Indeed. The ZBC case was simple as we matched the standard defined
models. Whihc in any case is not really used in any way directly by the
block I/O stack itself. Only upper layers use that.
In the case of ACSSD, this adds one hardware-defined model set by the
standard, plus a potential collection of software defined models through
different FTL implementations on the host. Getting these models and
there API right will be indeed tricky. In a first step, providing a
ZBC-like host-aware model and a host-managed model may be a good idea as
upper layer code already ready for ZBC disks will work out-of-the-box
for OCSSDs too. From there, I can see a lot of possibilities for more
SSD optimized models though.
>>> Another point that QLC device could have more tricky features of
>>> erase blocks management. Also we should apply erase operation on NAND
>>> flash erase block but it is not mandatory for the case of SMR zone.
>>
>> Incorrect: host-managed devices require a zone "reset" (equivalent to
>> discard/trim) to be reused after being written once. So again, the
>> "tricky features" you mention will depend on the device "model",
>> whatever this ends up to be for an open channel SSD.
> 
> ... and this is exposed by having different zone types (sequential
> write required vs sequential write preferred vs conventional).  And if
> OCSSD's "zones" don't fit into the current ZBC zone types, we can
> easily add new ones.  I would suggest however, that we explicitly
> disclaim that the block device layer's code points for zone types is
> an exact match with the ZBC zone types numbering, precisely so we can
> add new zone types that correspond to abstractions from different
> hardware types, such as OCSSD.
The struct blk_zone type is 64B in size but only currently uses 32B. So
there is room for new fields, and existing fields can have newly defined
values too as the ZBC standard uses only few of the possible values in
the structure fields.
>> Not necessarily. Again think in terms of device "model" and associated
>> feature set. An FS implementation may decide to support all possible
>> models, with likely a resulting incredible complexity. More likely,
>> similarly with what is happening with SMR, only models that make sense
>> will be supported by FS implementation that can be easily modified.
>> Example again here of f2fs: changes to support SMR were rather simple,
>> whereas the initial effort to support SMR with ext4 was pretty much
>> abandoned as it was too complex to integrate in the existing code while
>> keeping the existing on-disk format.
> 
> I'll note that Abutalib Aghayev and I will be presenting a paper at
> the 2017 FAST conference detailing a way to optimize ext4 for
> Host-Aware SMR drives by making a surprisingly small set of changes to
> ext4's journalling layer, with some very promising performance
> improvements for certain workloads, which we tested on both Seagate
> and WD HA drives and achieved 2x performance improvements.  Patches
> are on the unstable portion of the ext4 patch queue, and I hope to get
> them into an upstream acceptable shape (as opposed to "good enough for
> a research paper") in the next few months.
Thank you for the information. I will check this out. Is it the
optimization that aggressively delay meta-data update by allowing
reading of meta-data blocks directly from the journal (for blocks that
are not yet updated in place) ?
> So it may very well be that small changes can be made to file systems
> to support exotic devices if there are ways that we can expose the
> right information about underlying storage devices, and offering the
> right abstractions to enable the right kind of minimal I/O tagging, or
> hints, or commands as necessary such that the changes we do need to
> make to the file system can be kept small, and kept easily testable
> even if hardware is not available.
> 
> For example, by creating device mapper emulators of the feature sets
> of these advanced storage interfaces that are exposed via the block
> layer abstractions, whether it be for ZBC zones, or hardware
> encryption acceleration, etc.
Emulators may indeed be very useful for development. But we could also
go further and implement the different models using device mappers too.
Doing so, the same device could be used with different FTL through the
same DM interface. And this may also simplify the implementation of
complex models using DM stacking (e.g. the host-aware model can be
implemented on top of a host-managed model).
Best regards.
-- 
Damien Le Moal, Ph.D.
Sr. Manager, System Software Research Group,
Western Digital Corporation
Damien.LeMoal@wdc.com
(+81) 0466-98-3593 (ext. 513593)
1 kirihara-cho, Fujisawa,
Kanagawa, 252-0888 Japan
www.wdc.com, www.hgst.com
^ permalink raw reply	[flat|nested] 27+ messages in thread 
- * Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
  2017-01-10  1:42               ` Damien Le Moal
@ 2017-01-10  4:24                 ` Theodore Ts'o
  2017-01-10 13:06                   ` Matias Bjorling
  0 siblings, 1 reply; 27+ messages in thread
From: Theodore Ts'o @ 2017-01-10  4:24 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Slava Dubeyko, Matias Bjørling, Viacheslav Dubeyko,
	lsf-pc@lists.linux-foundation.org, Linux FS Devel,
	linux-block@vger.kernel.org, linux-nvme@lists.infradead.org
On Tue, Jan 10, 2017 at 10:42:45AM +0900, Damien Le Moal wrote:
> Thank you for the information. I will check this out. Is it the
> optimization that aggressively delay meta-data update by allowing
> reading of meta-data blocks directly from the journal (for blocks that
> are not yet updated in place) ?
Essentially, yes.  In some cases, the metadata might never be written
back to its permanent location in disk.  Instead, we might copy a
metadata block from the tail of the journal to the head, if there is
enough space.  So effectively, it turns the journal into a log
structured store for metadata blocks.  Over time, if metadata blocks
become cold we can evict them from the journal to make room for more
frequently updated metadata blocks (given the currently running
workload).  Eliminating the random 4k updates to the allocation
bitmaps, inode table blocks, etc., really helps with host-aware SMR
drives, without requiring the massive amount of changes needed to make
a file system compatible with the host-managed model.
When you consider that for most files, they are never updated in
place, but are usually just replaced, I suspect that with some
additional adjustments to a traditional file system's block allocator,
we can get even further wins.  But that's for future work...
> Emulators may indeed be very useful for development. But we could also
> go further and implement the different models using device mappers too.
> Doing so, the same device could be used with different FTL through the
> same DM interface. And this may also simplify the implementation of
> complex models using DM stacking (e.g. the host-aware model can be
> implemented on top of a host-managed model).
Yes, indeed.  That would also allow people to experiment with how much
benefit can be derived if we were to give additional side channel
information to STL / FTL's --- since it's much easier to adjust kernel
code than to go through a negotiations with a HDD/SSD vendor to make
firmware changes!
This may be an area where if we can create the right framework, and
fund some research work, we might be able to get some researchers and
their graduate students interested in doing some work in figuring out
what sort of divisions of responsibilities and hints back and forth
between the storage device and host have the most benefit.
Cheers,
						- Ted
^ permalink raw reply	[flat|nested] 27+ messages in thread 
- * Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
  2017-01-10  4:24                 ` Theodore Ts'o
@ 2017-01-10 13:06                   ` Matias Bjorling
  2017-01-11  4:07                     ` Damien Le Moal
  0 siblings, 1 reply; 27+ messages in thread
From: Matias Bjorling @ 2017-01-10 13:06 UTC (permalink / raw)
  To: Theodore Ts'o, Damien Le Moal
  Cc: Slava Dubeyko, Viacheslav Dubeyko,
	lsf-pc@lists.linux-foundation.org, Linux FS Devel,
	linux-block@vger.kernel.org, linux-nvme@lists.infradead.org
On 01/10/2017 05:24 AM, Theodore Ts'o wrote:
> This may be an area where if we can create the right framework, and
> fund some research work, we might be able to get some researchers and
> their graduate students interested in doing some work in figuring out
> what sort of divisions of responsibilities and hints back and forth
> between the storage device and host have the most benefit.
> 
That is a good idea. There is a couple of papers at FAST with
Open-Channel SSDs this year.  They look into the interface and various
ways to reduce latency fluctuations.
One thing I've heard a couple of times is the feature to move the GC
read/write process into the firmware. Enabling the host to offload GC
data movement, while the keeping the host in control. Would this be
beneficial for SMR?
-Matias
^ permalink raw reply	[flat|nested] 27+ messages in thread 
- * Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
  2017-01-10 13:06                   ` Matias Bjorling
@ 2017-01-11  4:07                     ` Damien Le Moal
  2017-01-11  6:06                       ` Matias Bjorling
  2017-01-11  7:49                       ` Hannes Reinecke
  0 siblings, 2 replies; 27+ messages in thread
From: Damien Le Moal @ 2017-01-11  4:07 UTC (permalink / raw)
  To: Matias Bjorling, Theodore Ts'o
  Cc: Slava Dubeyko, Viacheslav Dubeyko,
	lsf-pc@lists.linux-foundation.org, Linux FS Devel,
	linux-block@vger.kernel.org, linux-nvme@lists.infradead.org
Matias,
On 1/10/17 22:06, Matias Bjorling wrote:
> On 01/10/2017 05:24 AM, Theodore Ts'o wrote:
>> This may be an area where if we can create the right framework, and
>> fund some research work, we might be able to get some researchers and
>> their graduate students interested in doing some work in figuring out
>> what sort of divisions of responsibilities and hints back and forth
>> between the storage device and host have the most benefit.
>>
> 
> That is a good idea. There is a couple of papers at FAST with
> Open-Channel SSDs this year.  They look into the interface and various
> ways to reduce latency fluctuations.
> 
> One thing I've heard a couple of times is the feature to move the GC
> read/write process into the firmware. Enabling the host to offload GC
> data movement, while the keeping the host in control. Would this be
> beneficial for SMR?
Host-aware SMR drives already have GC internally implemented (for cases
when the host does not write sequentially). Host-managed drives do not.
As for moving an application specific GC code into the device, well,
code injection in the storage device is not for tomorrow, and likely not
ever.
There are however other clever ways to reduce GC related host overhead
with basic commands. For SCSI, these may be WRITE SCATTERED, EXTENDED
COPY, and some others can greatly improve overhead over a simple
read+write loop. A better approach to GC offload may not be a "GC"
command, but something more generic for moving around LBAs internally
within the device. That is, if existing commands are not satisfactory.
Best.
-- 
Damien Le Moal, Ph.D.
Sr. Manager, System Software Research Group,
Western Digital Corporation
Damien.LeMoal@wdc.com
(+81) 0466-98-3593 (ext. 513593)
1 kirihara-cho, Fujisawa,
Kanagawa, 252-0888 Japan
www.wdc.com, www.hgst.com
^ permalink raw reply	[flat|nested] 27+ messages in thread 
- * Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
  2017-01-11  4:07                     ` Damien Le Moal
@ 2017-01-11  6:06                       ` Matias Bjorling
  2017-01-11  7:49                       ` Hannes Reinecke
  1 sibling, 0 replies; 27+ messages in thread
From: Matias Bjorling @ 2017-01-11  6:06 UTC (permalink / raw)
  To: Damien Le Moal, Theodore Ts'o
  Cc: Slava Dubeyko, Viacheslav Dubeyko,
	lsf-pc@lists.linux-foundation.org, Linux FS Devel,
	linux-block@vger.kernel.org, linux-nvme@lists.infradead.org
On 01/11/2017 05:07 AM, Damien Le Moal wrote:
> 
> Matias,
> 
> On 1/10/17 22:06, Matias Bjorling wrote:
>> On 01/10/2017 05:24 AM, Theodore Ts'o wrote:
>>> This may be an area where if we can create the right framework, and
>>> fund some research work, we might be able to get some researchers and
>>> their graduate students interested in doing some work in figuring out
>>> what sort of divisions of responsibilities and hints back and forth
>>> between the storage device and host have the most benefit.
>>>
>>
>> That is a good idea. There is a couple of papers at FAST with
>> Open-Channel SSDs this year.  They look into the interface and various
>> ways to reduce latency fluctuations.
>>
>> One thing I've heard a couple of times is the feature to move the GC
>> read/write process into the firmware. Enabling the host to offload GC
>> data movement, while the keeping the host in control. Would this be
>> beneficial for SMR?
> 
> Host-aware SMR drives already have GC internally implemented (for cases
> when the host does not write sequentially). Host-managed drives do not.
> As for moving an application specific GC code into the device, well,
> code injection in the storage device is not for tomorrow, and likely not
> ever.
> 
> There are however other clever ways to reduce GC related host overhead
> with basic commands. For SCSI, these may be WRITE SCATTERED, EXTENDED
> COPY, and some others can greatly improve overhead over a simple
> read+write loop. A better approach to GC offload may not be a "GC"
> command, but something more generic for moving around LBAs internally
> within the device. That is, if existing commands are not satisfactory.
Hi Damien,
You're right. I was thinking of something similar to scattered
read/write to move data from one place to another. There is no
sector-granularity mapping table maintained by the OCSSD, which leaves
the logic up to the host.
Let me know if you decide to kick of a standardized interface for code
injection. Such an interface is long overdue. ;)
^ permalink raw reply	[flat|nested] 27+ messages in thread 
- * Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
  2017-01-11  4:07                     ` Damien Le Moal
  2017-01-11  6:06                       ` Matias Bjorling
@ 2017-01-11  7:49                       ` Hannes Reinecke
  1 sibling, 0 replies; 27+ messages in thread
From: Hannes Reinecke @ 2017-01-11  7:49 UTC (permalink / raw)
  To: Damien Le Moal, Matias Bjorling, Theodore Ts'o
  Cc: Slava Dubeyko, linux-nvme@lists.infradead.org,
	linux-block@vger.kernel.org, Viacheslav Dubeyko, Linux FS Devel,
	lsf-pc@lists.linux-foundation.org
On 01/11/2017 05:07 AM, Damien Le Moal wrote:
> 
> Matias,
> 
> On 1/10/17 22:06, Matias Bjorling wrote:
>> On 01/10/2017 05:24 AM, Theodore Ts'o wrote:
>>> This may be an area where if we can create the right framework, and
>>> fund some research work, we might be able to get some researchers and
>>> their graduate students interested in doing some work in figuring out
>>> what sort of divisions of responsibilities and hints back and forth
>>> between the storage device and host have the most benefit.
>>>
>>
>> That is a good idea. There is a couple of papers at FAST with
>> Open-Channel SSDs this year.  They look into the interface and various
>> ways to reduce latency fluctuations.
>>
>> One thing I've heard a couple of times is the feature to move the GC
>> read/write process into the firmware. Enabling the host to offload GC
>> data movement, while the keeping the host in control. Would this be
>> beneficial for SMR?
> 
> Host-aware SMR drives already have GC internally implemented (for cases
> when the host does not write sequentially). Host-managed drives do not.
> As for moving an application specific GC code into the device, well,
> code injection in the storage device is not for tomorrow, and likely not
> ever.
> 
> There are however other clever ways to reduce GC related host overhead
> with basic commands. For SCSI, these may be WRITE SCATTERED, EXTENDED
> COPY, and some others can greatly improve overhead over a simple
> read+write loop. A better approach to GC offload may not be a "GC"
> command, but something more generic for moving around LBAs internally
> within the device. That is, if existing commands are not satisfactory.
> 
Logical head depop rears its head again...
But yes, I think it's more sensible to have I/O functions which help GC
(like UNMAP) instead of influencing the GC itself.
Anyway. Given the length of this thread I guess this is a worthy topic
for LSF.
Cheers,
Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg
GF: F. Imend�rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG N�rnberg)
^ permalink raw reply	[flat|nested] 27+ messages in thread 
 
 
 
 
 
- * RE: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
  2017-01-04  7:24           ` Damien Le Moal
  2017-01-04 12:39             ` Matias Bjørling
  2017-01-04 16:57             ` Theodore Ts'o
@ 2017-01-05 22:58             ` Slava Dubeyko
  2017-01-06  1:11               ` Theodore Ts'o
  2017-01-06 13:05               ` Matias Bjørling
  2017-01-06  1:09             ` Jaegeuk Kim
  3 siblings, 2 replies; 27+ messages in thread
From: Slava Dubeyko @ 2017-01-05 22:58 UTC (permalink / raw)
  To: Damien Le Moal, Matias Bjørling, Viacheslav Dubeyko,
	lsf-pc@lists.linux-foundation.org, Theodore Ts'o
  Cc: Linux FS Devel, linux-block@vger.kernel.org,
	linux-nvme@lists.infradead.org
-----Original Message-----
From: Damien Le Moal 
Sent: Tuesday, January 3, 2017 11:25 PM
To: Slava Dubeyko <Vyacheslav.Dubeyko@wdc.com>; Matias Bjørling <m@bjorling.me>; Viacheslav Dubeyko <slava@dubeyko.com>; lsf-pc@lists.linux-foundation.org
Cc: Linux FS Devel <linux-fsdevel@vger.kernel.org>; linux-block@vger.kernel.org; linux-nvme@lists.infradead.org
Subject: Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
<skipped>
> But you are missing the parallel with SMR. For SMR, or more correctly zoned
> block devices since the ZBC or ZAC standards can equally apply to HDDs and SSDs,
> 3 models exists: drive-managed, host-aware and host-managed.
> Case (1) above corresponds *exactly* to the drive managed model, with
> the difference that the abstraction of the device characteristics (SMR
> here) is in the drive FW and not in a host-level FTL implementation
> as it would be for open channel SSDs. Case (2) above corresponds to the host-managed
> model, that is, the device user has to deal with the device characteristics
> itself and use it correctly. The host-aware model lies in between these 2 extremes:
> it offers the possibility of complete abstraction by default, but also allows a user
> to optimize its operation for the device by allowing access to the device characteristics.
> So this would correspond to a possible third way of implementing an FTL for open channel SSDs.
I see your point. And  I think that, historically, we need to distinguish 4 cases for the
case of NAND flash:
(1) drive-managed: regular file systems (ext4, xfs and so on);
(2) host-aware: flash-friendly file systems (NILFS2, F2FS and so on);
(3) host-managed: <file systems under implementation>;
(4) old-fashioned flash-oriented file systems for raw NAND (jffs, yaffs, ubifs and so on).
But, frankly speaking, even regular file systems are slightly flash-aware today because of
blkdev_issue_discard (TRIM) or REQ_META flag. So, the next really important question is:
what can/should be exposed for the host-managed and host-aware cases? What's principal
difference between these models? And, finally, the difference is not so clear.
Let's start from error corrections. Only flash-oriented file systems take care about
error corrections. But I assume that drive-managed, host-aware and host-managed cases
expect hardware-based error correction. So, we can treat our logical page/block as ideal
byte stream that always contains valid data. So, we have no difference and no contradiction
here.
Next point is read disturbance. If BER of physical page/block achieves some threshold then
we need to move data from one page/block into another one. What subsystem will be
responsible for this activity? The drive-managed case expects that device's GC will manage
read disturbance issue. But what's about host-aware or host-managed case? If the host side
hasn't information about BER then the host's software is unable to manage this issue. Finally,
it sounds that we will have GC subsystem as on file system side as on device side. As a result,
it means possible unpredictable performance degradation and decreasing device lifetime.
Let's imagine that host-aware case could be unaware about read disturbance management.
But how host-managed case can manage this issue?
Bad block management... So, drive-managed and host-aware cases should be completely unaware
about  bad blocks. But what's about host-managed case? If a device will hide bad blocks from
the host then it means mapping table presence, access to logical pages/blocks and so on. If the host
hasn't access to the bad block management then it's not host-managed model. And it sounds as
completely unmanageable situation for the host-managed model. Because if the host has access
to bad block management (but how?) then we have really simple model. Otherwise, the host
has access to logical pages/blocks only and device should have internal GC. As a result,
it means possible unpredictable performance degradation and decreasing device lifetime because
of competition of GC on device side and GC on the host side.
Wear leveling... Device will be responsible to manage wear-leveling for the case of device-managed
and host-aware models. It looks like that the host side should be responsible to manage wear-leveling
for the host-managed case. But it means that the host should manage bad blocks and to have direct
access to physical pages/blocks. Otherwise, physical erase blocks will be hidden by device's indirection
layer and wear-leveling management will be unavailable on the host side. As a result, device will have
internal GC and the traditional issues (possible unpredictable performance degradation and decreasing
device lifetime). But even if SSD provides access to all internals then how will file system be able
to implement wear-leveling or bad block management in the case of regular I/O operations? Because
block device creates LBA abstraction for us. Does it mean that software FTL on the block layer level
is able to manage SSD internals directly? And, again, file system cannot manage SSD internals directly
for the case of software FTL. And where should software FTL keep mapping table, for example?
So, F2FS and NILFS2 looks like a host-aware case because it is LFS file systems that is oriented on
regular SSDs. So, it could be desirable to have some knowledge (page size, erase block size and so on)
about SSD internals. But, mostly, such knowledge should be shared with mkfs tool during file
system volume creation. The rest looks like as not very promising and not very different with
device-managed model. Because even if F2FS and NILFS2 has GC subsystem and mostly looks like
as LFS case (F2FS has in-place updated area; NILFS2 has in-place updated superblocks in the begin/end
of the volume), anyway, both these file systems completely rely on device indirection layer and
GC subsystem. We are still in the same hell of GCs competition. So, what's the point of host-aware
model?
So, I am not completely convinced that, finally, we will have really distinctive features for the
case of device-managed, host-aware and host-managed model. Also I have many question about
host-managed model if we will use block device abstraction. How can direct management of
SSD internals be organized for the case of host-managed model is hidden under block device
abstraction?
Another interesting question... Let's imagine that we create file system volume for one device
geometry. It means that geometry details will be stored in the file system metadata during volume
creation for the case host-aware or host-managed case. Then we backups this volume and restore
the volume on device with completely different geometry. So, what will we have for such case?
Performance degradation? Or will we kill the device?
> The open-channel SSD interface is very 
> similar to the one exposed by SMR hard-drives. They both have a set of 
> chunks (zones) exposed, and zones are managed using open/close logic. 
> The main difference on open-channel SSDs is that it additionally exposes 
> multiple sets of zones through a hierarchical interface, which covers a 
> numbers levels (X channels, Y LUNs per channel, Z zones per LUN).
I would like to have access channels/LUNs/zones on file system level.
If, for example, LUN will be associated with partition then it means
that it will need to aggregate several partitions inside of one volume.
First of all, not every file system is ready for the aggregation several
partitions inside of the one volume. Secondly, what's about aggregation
several physical devices inside of one volume? It looks like as slightly
tricky to distinguish partitions of the same device and different devices
on file system level. Isn't it?
> I agree with Damien, but I'd also add that in the future there may very
> well be some new Zone types added to the ZBC model. 
> So we shouldn't assume that the ZBC model is a fixed one.  And who knows?
> Perhaps T10 standards body will come up with a simpler model for
> interfacing with SCSI/SATA-attached SSD's that might leverage the ZBC model --- or not.
Different zone types is good. But maybe LUN will be the better place
for distinguishing the different zone types. Because if zone can have the type
then it's possible to imagine any combinations of zones. But mostly
zone of some type will be inside of some contiguous area (inside of NAND
die, for example). So, LUN looks like as NAND die representation.
>> SMR zone and NAND flash erase block look comparable but, finally, it 
>> significantly different stuff. Usually, SMR zone has 265 MB in size 
>> but NAND flash erase block can vary from 512 KB to 8 MB (it will be 
>> slightly larger in the future but not more than 32 MB, I suppose). It 
>> is possible to group several erase blocks into aggregated entity but 
>> it could be not very good policy from file system point of view.
>
> Why not? For f2fs, the 2MB segments are grouped together into sections
> with a size matching the device zone size. That works well and can actually
> even reduce the garbage collection overhead in some cases.
> Nothing in the kernel zoned block device support limits the zone size
> to a particular minimum or maximum. The only direct implication of the zone
> size on the block I/O stack is that BIOs and requests cannot cross zone
> boundaries. In an extreme setup, a zone size of 4KB would work too
> and result in read/write commands of 4KB at most to the device.
The situation with grouping of segments into sections for the case of F2FS
is not so simple. First of all, you need to fill such aggregation with data.
F2FS distinguish several types of segments and it means that current
segment/section will be larger. If you mix different types of segments into
one section (but I believe that F2FS doesn't provide opportunity to do this)
then GC overhead could be larger, I suppose. Otherwise, the using one section
for one segment type means that the current section with greater size than
segment (2MB) will be resulted in changing the speed of filling sections with
different type of data. As a result, it will change dramatically the distribution
of different type of sections on file system volume. Does it reduce GC overhead?
I am not sure. And if file system's segment should be equal to zone size
(for example, NILFS2 case) then it could mean that you need to prepare the
whole segment before real flush. And if you will need to process O_DIRECT
or synchronous mount case then, most probably, you will need to flush the
segment with huge hole. I suppose that it could significantly decrease file system's
free space, increase GC activity and decrease device lifetime.
>> Another point that QLC device could have more tricky features of erase 
>> blocks management. Also we should apply erase operation on NAND flash 
>> erase block but it is not mandatory for the case of SMR zone.
>
> Incorrect: host-managed devices require a zone "reset" (equivalent to
> discard/trim) to be reused after being written once. So again, the
> "tricky features" you mention will depend on the device "model",
> whatever this ends up to be for an open channel SSD.
OK. But I assume that SMR zone "reset" is significantly cheaper than
NAND flash block erase operation. And you can fill your SMR zone with
data then "reset" it and to fill again with data without significant penalty.
Also, TRIM and zone "reset" are different, I suppose. Because, TRIM looks
like as a hint for SSD controller. If SSD controller receives TRIM for some
erase block then it doesn't mean  that erase operation will be done
immediately. Usually, it should be done in the background because real
erase operation is expensive operation. 
Thanks,
Vyacheslav Dubeyko.
^ permalink raw reply	[flat|nested] 27+ messages in thread 
- * Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
  2017-01-05 22:58             ` Slava Dubeyko
@ 2017-01-06  1:11               ` Theodore Ts'o
  2017-01-06 12:51                 ` Matias Bjørling
  2017-01-09  6:49                 ` Slava Dubeyko
  2017-01-06 13:05               ` Matias Bjørling
  1 sibling, 2 replies; 27+ messages in thread
From: Theodore Ts'o @ 2017-01-06  1:11 UTC (permalink / raw)
  To: Slava Dubeyko
  Cc: Damien Le Moal, Matias Bjørling, Viacheslav Dubeyko,
	lsf-pc@lists.linux-foundation.org, Linux FS Devel,
	linux-block@vger.kernel.org, linux-nvme@lists.infradead.org
On Thu, Jan 05, 2017 at 10:58:57PM +0000, Slava Dubeyko wrote:
> 
> Next point is read disturbance. If BER of physical page/block achieves some threshold then
> we need to move data from one page/block into another one. What subsystem will be
> responsible for this activity? The drive-managed case expects that device's GC will manage
> read disturbance issue. But what's about host-aware or host-managed case? If the host side
> hasn't information about BER then the host's software is unable to manage this issue. Finally,
> it sounds that we will have GC subsystem as on file system side as on device side. As a result,
> it means possible unpredictable performance degradation and decreasing device lifetime.
> Let's imagine that host-aware case could be unaware about read disturbance management.
> But how host-managed case can manage this issue?
One of the ways this could be done in the ZBC specification (assuming
that erase blocks == zones) would be set the "reset" bit in the zone
descriptor which is returned by the REPORT ZONES EXT command.  This is
a hint that the a reset write pointer should be sent to the zone in
question, and it could be set when you start seeing soft ECC errors or
the flash management layer has decided that the zone should be
rewritten in the near future.  A simple way to do this is to ask the
Host OS to copy the data to another zone and then send a reset write
pointer command for the zone.
So I think it very much could be done, and done within the framework
of the ZBC model --- although whether SSD manufactuers will chose to
do this, and/or choose to engage the T10/T13 standards committees to
add the necessary extensions to the ZBC specification is a question
that we probably can't answer in this venue or by the participants on
this thread.
> Wear leveling... Device will be responsible to manage wear-leveling for the case of device-managed
> and host-aware models. It looks like that the host side should be responsible to manage wear-leveling
> for the host-managed case. But it means that the host should manage bad blocks and to have direct
> access to physical pages/blocks. Otherwise, physical erase blocks will be hidden by device's indirection
> layer and wear-leveling management will be unavailable on the host side. As a result, device will have
> internal GC and the traditional issues (possible unpredictable performance degradation and decreasing
> device lifetime).
So I can imagine a setup where the flash translation layer manages the
mapping between zone numbers and the physical erase blocks, such that
when the host OS issues an "reset write pointer", it immediately gets
a new erase block assigned to the specific zone in question.  The
original erase block would then get erased in the background, when the
flash chip in question is available for maintenance activities.
I think you've been thinking about a model where *either* the host as
complete control over all aspects of the flash management, or the FTL
has complete control --- and it may be that there are more clever ways
that the work could be split between flash device and the host OS.
> Another interesting question... Let's imagine that we create file system volume for one device
> geometry. It means that geometry details will be stored in the file system metadata during volume
> creation for the case host-aware or host-managed case. Then we backups this volume and restore
> the volume on device with completely different geometry. So, what will we have for such case?
> Performance degradation? Or will we kill the device?
This is why I suspect that exposing the full details of the details of
the Flash layout via LUNS is a bad, bad, BAD idea.  It's much better
to use an abstraction such as Zones, and then have an abstraction
layer that hides the low-level details of the hardware from the OS.
The trick is picking an abstraction that exposes the _right_ set of
details so that the division of labor betewen the Host OS and the
storage device is at a better place.  Hence my suggestion of perhaps
providing a virtual mapping layer betewen "Zone number" and the
low-level physical erase block.
> I would like to have access channels/LUNs/zones on file system level.
> If, for example, LUN will be associated with partition then it means
> that it will need to aggregate several partitions inside of one volume.
> First of all, not every file system is ready for the aggregation several
> partitions inside of the one volume. Secondly, what's about aggregation
> several physical devices inside of one volume? It looks like as slightly
> tricky to distinguish partitions of the same device and different devices
> on file system level. Isn't it?
Yes, this is why using LUN's are a BAD idea.  There's too much code
--- in file systems, in the block layer in terms of how we expose
block devices, etc., that assumes that different LUN's are used for
different logical containers of storage.  There has been decades of
usage of this concept by enterprise storage arrays.  Trying to
appropriate LUN's for another use case is stupid.  And maybe we can't
stop OCSSD folks if they have gone down that questionable design path,
but there's nothing that says we have to expose it as a SCSI LUN
inside of Linux!
> OK. But I assume that SMR zone "reset" is significantly cheaper than
> NAND flash block erase operation. And you can fill your SMR zone with
> data then "reset" it and to fill again with data without significant penalty.
If you have virtual mapping layer between zones and erase blocks, a
reset write pointer could be fast for SSD's as well.  And that allows
the implementation of your suggestion below:
> Also, TRIM and zone "reset" are different, I suppose. Because, TRIM looks
> like as a hint for SSD controller. If SSD controller receives TRIM for some
> erase block then it doesn't mean  that erase operation will be done
> immediately. Usually, it should be done in the background because real
> erase operation is expensive operation.
Cheers,
					- Ted
^ permalink raw reply	[flat|nested] 27+ messages in thread 
- * Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
  2017-01-06  1:11               ` Theodore Ts'o
@ 2017-01-06 12:51                 ` Matias Bjørling
  2017-01-09  6:49                 ` Slava Dubeyko
  1 sibling, 0 replies; 27+ messages in thread
From: Matias Bjørling @ 2017-01-06 12:51 UTC (permalink / raw)
  To: Theodore Ts'o, Slava Dubeyko
  Cc: Damien Le Moal, Viacheslav Dubeyko,
	lsf-pc@lists.linux-foundation.org, Linux FS Devel,
	linux-block@vger.kernel.org, linux-nvme@lists.infradead.org
On 01/06/2017 02:11 AM, Theodore Ts'o wrote:
> On Thu, Jan 05, 2017 at 10:58:57PM +0000, Slava Dubeyko wrote:
>>
>> Next point is read disturbance. If BER of physical page/block achieves some threshold then
>> we need to move data from one page/block into another one. What subsystem will be
>> responsible for this activity? The drive-managed case expects that device's GC will manage
>> read disturbance issue. But what's about host-aware or host-managed case? If the host side
>> hasn't information about BER then the host's software is unable to manage this issue. Finally,
>> it sounds that we will have GC subsystem as on file system side as on device side. As a result,
>> it means possible unpredictable performance degradation and decreasing device lifetime.
>> Let's imagine that host-aware case could be unaware about read disturbance management.
>> But how host-managed case can manage this issue?
> 
> One of the ways this could be done in the ZBC specification (assuming
> that erase blocks == zones) would be set the "reset" bit in the zone
> descriptor which is returned by the REPORT ZONES EXT command.  This is
> a hint that the a reset write pointer should be sent to the zone in
> question, and it could be set when you start seeing soft ECC errors or
> the flash management layer has decided that the zone should be
> rewritten in the near future.  A simple way to do this is to ask the
> Host OS to copy the data to another zone and then send a reset write
> pointer command for the zone.
This is an interesting approach. Currently, the OCSSD interface uses
both the soft ECC mark to tell the host to rewrite, while the interface
also has an explicit method to make the host rewrite the data. E.g., in
the case where read scrubbing on the device requires the host to move
data due to durability.
Adding the information to the "Report zones" is a good idea. It enables
the device to keep a list of "zones" that should be refreshed by the
host but have yet to have it done. I will add that to the specification.
> 
> So I think it very much could be done, and done within the framework
> of the ZBC model --- although whether SSD manufactuers will chose to
> do this, and/or choose to engage the T10/T13 standards committees to
> add the necessary extensions to the ZBC specification is a question
> that we probably can't answer in this venue or by the participants on
> this thread.
> 
>> Wear leveling... Device will be responsible to manage wear-leveling for the case of device-managed
>> and host-aware models. It looks like that the host side should be responsible to manage wear-leveling
>> for the host-managed case. But it means that the host should manage bad blocks and to have direct
>> access to physical pages/blocks. Otherwise, physical erase blocks will be hidden by device's indirection
>> layer and wear-leveling management will be unavailable on the host side. As a result, device will have
>> internal GC and the traditional issues (possible unpredictable performance degradation and decreasing
>> device lifetime).
> 
> So I can imagine a setup where the flash translation layer manages the
> mapping between zone numbers and the physical erase blocks, such that
> when the host OS issues an "reset write pointer", it immediately gets
> a new erase block assigned to the specific zone in question.  The
> original erase block would then get erased in the background, when the
> flash chip in question is available for maintenance activities.
> 
> I think you've been thinking about a model where *either* the host as
> complete control over all aspects of the flash management, or the FTL
> has complete control --- and it may be that there are more clever ways
> that the work could be split between flash device and the host OS.
> 
>> Another interesting question... Let's imagine that we create file system volume for one device
>> geometry. It means that geometry details will be stored in the file system metadata during volume
>> creation for the case host-aware or host-managed case. Then we backups this volume and restore
>> the volume on device with completely different geometry. So, what will we have for such case?
>> Performance degradation? Or will we kill the device?
> 
> This is why I suspect that exposing the full details of the details of
> the Flash layout via LUNS is a bad, bad, BAD idea.  It's much better
> to use an abstraction such as Zones, and then have an abstraction
> layer that hides the low-level details of the hardware from the OS.
> The trick is picking an abstraction that exposes the _right_ set of
> details so that the division of labor betewen the Host OS and the
> storage device is at a better place.  Hence my suggestion of perhaps
> providing a virtual mapping layer betewen "Zone number" and the
> low-level physical erase block.
Agree. The first approach was taken in the first iteration of the
specification. After release we began to understand the chaos we just
brought onto our self, we moved to the zone/chunk approach in the second
iteration to simplify the interface.
> 
>> I would like to have access channels/LUNs/zones on file system level.
>> If, for example, LUN will be associated with partition then it means
>> that it will need to aggregate several partitions inside of one volume.
>> First of all, not every file system is ready for the aggregation several
>> partitions inside of the one volume. Secondly, what's about aggregation
>> several physical devices inside of one volume? It looks like as slightly
>> tricky to distinguish partitions of the same device and different devices
>> on file system level. Isn't it?
> 
> Yes, this is why using LUN's are a BAD idea.  There's too much code
> --- in file systems, in the block layer in terms of how we expose
> block devices, etc., that assumes that different LUN's are used for
> different logical containers of storage.  There has been decades of
> usage of this concept by enterprise storage arrays.  Trying to
> appropriate LUN's for another use case is stupid.  And maybe we can't
> stop OCSSD folks if they have gone down that questionable design path,
> but there's nothing that says we have to expose it as a SCSI LUN
> inside of Linux!
Heh, yes, really bad idea. The naming of "LUNs" for OCSSDs could have
been chosen better. In the future, it is being renamed to "parallel
unit". For OCSSDs, all the device's parallel units are exposed through
the same block device "LUN", which then has to be managed by the layers
above.
> 
>> OK. But I assume that SMR zone "reset" is significantly cheaper than
>> NAND flash block erase operation. And you can fill your SMR zone with
>> data then "reset" it and to fill again with data without significant penalty.
> 
> If you have virtual mapping layer between zones and erase blocks, a
> reset write pointer could be fast for SSD's as well.  And that allows
> the implementation of your suggestion below:
> 
>> Also, TRIM and zone "reset" are different, I suppose. Because, TRIM looks
>> like as a hint for SSD controller. If SSD controller receives TRIM for some
>> erase block then it doesn't mean  that erase operation will be done
>> immediately. Usually, it should be done in the background because real
>> erase operation is expensive operation.
> 
> Cheers,
> 
> 					- Ted
> 
^ permalink raw reply	[flat|nested] 27+ messages in thread 
- * RE: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
  2017-01-06  1:11               ` Theodore Ts'o
  2017-01-06 12:51                 ` Matias Bjørling
@ 2017-01-09  6:49                 ` Slava Dubeyko
  2017-01-09 14:55                   ` Theodore Ts'o
  1 sibling, 1 reply; 27+ messages in thread
From: Slava Dubeyko @ 2017-01-09  6:49 UTC (permalink / raw)
  To: Theodore Ts'o, Matias Bjørling
  Cc: Damien Le Moal, Viacheslav Dubeyko,
	lsf-pc@lists.linux-foundation.org, Linux FS Devel,
	linux-block@vger.kernel.org, linux-nvme@lists.infradead.org
-----Original Message-----
From: Theodore Ts'o [mailto:tytso@mit.edu] 
Sent: Thursday, January 5, 2017 5:12 PM
To: Slava Dubeyko <Vyacheslav.Dubeyko@wdc.com>
Cc: Damien Le Moal <Damien.LeMoal@wdc.com>; Matias Bjørling <m@bjorling.me>; Viacheslav Dubeyko <slava@dubeyko.com>; lsf-pc@lists.linux-foundation.org; Linux FS Devel <linux-fsdevel@vger.kernel.org>; linux-block@vger.kernel.org; linux-nvme@lists.infradead.org
Subject: Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
<skipped>
> I think you've been thinking about a model where *either* the host as complete control
> over all aspects of the flash management, or the FTL has complete control --- and it may
> be that there are more clever ways that the work could be split between
> flash device and the host OS.
Yes, I totally agree that the better way is to split different responsibilities between the flash
device and the host (file system, for example). I would like to consider an SSD device as a set
of FTL primitives. Let's imagine the SSD like an automata that is able to execute FTL primitives
but the file system issues the commands orchestrate the SSD activity. I believe it makes sense
to think about SSD like data processing accelerator engine. It means that we need in good
interface that can be the basis for the offload of data processing operations. And I clearly see
the many cases when a file system would like to say: "Hey, SSD. Please, execute this primitive
for me right now".
Let's consider operations of moving zones (or erase blocks) with high BER.
If we have completely passive SSD then it sounds for me that all operations will look like:
(1) read data on the host side; (2) "reset" zone; (3) write data into the
SSD backwards. But if we talk about some zone (erase block(s)) with high BER is full of valid data
then why does host need to execute the whole operation in such stupid way like "read-write"?
I mean that it completely doesn't make sense to spend the host's resources for such operation.
The responsibility of the host is simply to initiate such operation in the proper time. And responsibility
of the SSD is to execute such operation internally (offload of the operation). So, here we could
have the FTL primitive of moving of zones (erase blocks) for overcoming the read disturbance.
Let's consider GC operations... Right now, we have GC subsystem on the SSD side (device-managed and
host-aware case) and we have GC subsystem on the host side (LFS file systems of host-aware case).
So, it's clear that SSD device is able to provide some primitives of GC operations. Also it's completely
unreasonable to have GC subsystem as on SSD side as on the host side. If we have GC subsystem on
the host only then we need to follow the stupid paradigm "read-modify-write" and to spend the host's
resources for GC operations. Otherwise, if GC subsystem on the SSD side then GC suffers from lack of
knowledge about valid data location (file system keeps this knowledge) and such solution provides
wide range of cases for unexpected performance degradation. So, we need in much smarter solution.
What could it be?
Again, file system (host) has to initiate the GC operation in proper time but the SSD should execute
the requested operation (offload of the operation). So, we will have the GC subsystem on file system
side but the real GC operation under zone (erase block(s)) will be executed by SSD device. The key point
here that: (1) file system choses the good time for GC operation; (2) file system is able to select a zone
(erase block(s)) that provides then cost-efficient way of GC activity from the point of view of valid data
amount in the aged zone; (3) file system shares information about valid pages in the zone (erase block(s));
(4) SSD executes GC operation under the zone internally.
We need to take into account three possible cases: (1) zone is completely invalid; (2) zone is partially
invalid; (3) zone contains valid data only. If file system's GC selects a zone that doesn't contain valid data
("invalid" zone case) then GC simply needs to request "reset" zone or send TRIM command. The rest is
responsibility of SSD device. If zone is completely filled by valid data then file system's GC needs to
request moving operation on the SSD side. If we will use a virtual zones then it means that such moving
operation on the SSD side will change nothing for the file system (logical block numbers will be the same).
So, file system doesn't need to change internal mapping table for such operation.
The case of partially invalid zone (contains some amount of valid data) is more tricky. But let's consider
the situation. If file system has knowledge about position of valid logical blocks or pages inside a zone
then the file system is able to share a zone's bitmap with SSD device. It means that if we have 4 KB
logical block and 256 MB zone then we need in 8 KB bitmap for representing positions of valid
logical blocks inside of the zone. So, file system is able to send such valid pages' bitmap with the
command of GC operation initiation for some zone. The responsibility of SSD side will be: (1) "reset"
zone; (2) move the valid logical blocks from aged zone into new ones with compaction scheme using.
I mean that all valid pages should be written in contiguous manner in the newly allocated zone
(erase blocks). Finally, it means that SSD device can reposition logical blocks inside of the zone
without changing the initial order of logical pages (compaction scheme). Such compaction scheme
can be easily implemented on the SSD side. And if we will not change the order of logical blocks
then we have deterministic case that can be easily processed on file system side. If file system has
initial bitmap then it can easily re-calculate the valid logical blocks' position after compaction scheme
using. For example, F2FS can easily do such re-calculation. Finally, new values of valid logical blocks'
position should be stored into file system's mapping table. NILFS2 is slightly more complex case.
Because, NILFS2 describes logical blocks inside of the log by means of special btree in the log's header.
So, again, compaction scheme is deterministic case that provides opportunity to re-calculate the
logical blocks' position before real GC operation. It means that NILFS2 is able to prepare as valid
logical blocks' bitmap as log's header before GC operation and to share all these stuff with SSD device.
However, every GC operation under partially invalid zone is resulted in creation of zone that will be
partially filled by valid data (the rest of zone will be completely free). What does it need to do in such
case? I can see the four possible approaches:
(1) Re-use the partially filled zone. If file system will track the state of every zone (mapping table,
for example) or it will be possible to extract the state of zone then it means that aged zone will
change the state after GC operation. So, partially filled zone can be used as current zone for
writing a new data.
(2) Add valid data of aged zone into the tail of current zone. Let's imagine that file system is using
some zone as current zone for adding a new data. If we know that an aged zone contains some
number of valid pages then it's possible to reserve the space in the tail of current zone. Finally,
it is possible to initiate combine flush operation (write data from page cache of current zone)
with GC operation under aged zone on the SSD side. 
(3) Re-use aged zone as current zone. Let's imagine that we have some aged zone with small
number of valid pages. It means that we can select this zone as current zone for a new data.
First of all, we need: (1) "reset" zone; (2) initiate GC operation on the SSD device side. We know
how many valid pages we will have in the beginning of the current zone. So, we simply needs
to add a new logical blocks into page cache of current zone after reserved area of data from
aged zone. So, our GC operation will be in the background of a new data preparation in the
page cache of current zone. And, finally, we will have the whole zone is full of data after
flush operation.
(4) Merge several aged zones into new one.
> It's much better to use an abstraction such as Zones, and then have an abstraction layer
> that hides the low-level details of the hardware from the OS.
> The trick is picking an abstraction that exposes the _right_ set of details so that the division
> of labor between the Host OS and the storage device is at a better place.  Hence my suggestion
> of perhaps providing a virtual mapping layer between "Zone number" and
> the low-level physical erase block.
I like the idea of some abstraction that hides the low-level details. But it sounds that we still
will have two mapping tables on SSD side and file system side. Again we needs in distribution
the responsibilities between the file system and SSD device. If file system will manage GC activity
but the real GC operation will be delegated on SSD side (in proper time) then it sounds that
all maintenance operations will be done by SSD itself. It means that SSD device is able to manage
only one mapping table and file system simply needs to have actual copy of the mapping table.
Or, oppositely, file system can manage only one mapping table and to share the actual state
with the SSD device. But one mapping table looks like as really complicated technique. From
another point of view, virtual zone can have the same ID always. So, the responsibility of the
SSD device will be mapping the virtual zone ID with physical erase block IDs. Such mapping
table (virtual zone ID <-> erase block(s)) can be more compact as mapping table (LBA <->
physical page). The responsibility of file system (host) will be the mapping inside of
the virtual zone (LBA <-> logical block inside the virtual zone). If the virtual zone ID will be
always the same then such mapping table could be lesser in size. But I don't see how
such mapping table can be lesser in size for the current implementation of F2FS or NILFS2. 
However, let's imagine that log will be equal to the whole zone then the header of the log
can include likewise mapping table for the log/zone.
Thanks,
Vyacheslav Dubeyko.
^ permalink raw reply	[flat|nested] 27+ messages in thread
- * Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
  2017-01-09  6:49                 ` Slava Dubeyko
@ 2017-01-09 14:55                   ` Theodore Ts'o
  0 siblings, 0 replies; 27+ messages in thread
From: Theodore Ts'o @ 2017-01-09 14:55 UTC (permalink / raw)
  To: Slava Dubeyko
  Cc: Matias Bjørling, Damien Le Moal, Viacheslav Dubeyko,
	lsf-pc@lists.linux-foundation.org, Linux FS Devel,
	linux-block@vger.kernel.org, linux-nvme@lists.infradead.org
So in the model where the Flash-side is tracking logical to physical
zone mapping, and host is merely expecting the ZBC interface, one way
it could work is as follows.
1)  The flash signals that a particular zone should be reset soon.
2)  If the host does not honor the request, eventually the flash will
    have to do a forced copy of the zone to a new erase block.  (This
    is a fail-safe and shouldn't happen under normal circumstances.)
    (By the way, this model can be used for any number of things.  For
    example, for cloud workloads where tail latency is really
    important, it would be really cool if T10/T13 adopted a way that
    the host could be notified about the potential need for ATI
    remediation in a particular disk region, so the host could
    schedule it when it would be least likely to impact high priority,
    low latency workloads.  If the host fails to give permission to
    the firmware to do the ATI remediation before the "gotta go"
    deadline is exceed, the disk would could the ATI remediation at
    that point to assure data integrity as a fail safe.)
3) The host, since it has better knowledge of which blocks belong to
    which inode, and which inodes are more likely to have identical
    object lifetimes (example, all of the .o files in a directory are
    likely to be deleted at the same time when the user runs "make
    clean"; there was a Usenix or FAST paper over a decade ago the
    pointed out that doing hueristics based on file names were likely
    to be helpful), can do a better job of distributing the blocks to
    different partially filled sequential write preferred / sequential
    write required zones.
    The idea here is that you might have multiple zones that are
    partially filled based on expected object lifetime predictions.
    Or the host could move blocks based on the knowledge that a
    particular already has blocks that will share the same fate (e.g.,
    belong to the same inode) --- this is knowledge that the FTL can
    not know, so with a sufficiently smart host file system, it ought
    to be able to do a better job than the FTL.
4) Since we assumed that the Flash is tracking logical to physial zone
   mappings, and the host is responsible for everything else, if the
   host decides to move blocks to different SMR zones, the host file
   system will be responsible for updating its existing (inode,
   logical block) to physical block (SMR zone plus offset) mapping
   tables.
The main advantage of this model is to the extent that there are
cloud/enterprise customers who are already implementing Host Aware SMR
storage solutions, they might be able to reutilize code already
written for SMR HDD's, and reuse it for this model/interface.  Yes,
some tweaks would probably be needed since the design tradeoffs for
disks and flash are very different.  But the point is that the Host
Managed and Host Aware SMR models is one that is well understood by
everyone.
				 ----
There is another model you might consider, and it's one which
Christoph Hillwig suggested at a LSF/MM at least 2-3 years ago, and
this is a model where the flash or the SMR disk could use a division
of labor similar to Object Based Disks (except hopefully with a less
awful interface).  The idea here is that you give up on LBA numbers,
and instead you move the entire responsibility of mapping (inode,
logical block) to (physical location) to the storage device.  The file
system would then be responsibile for managing metadata (mod times,
user/group ownership, permission bits/ACL's, etc) and namespace issues
(e.g., directory pathnames to inode lookups).
So this solves the problem you seem to be concerned about in terms of
keeping mapping information at two layers, and it solves it
completely, since the file system no longer has to do a mapping
between inode+logical offset to LBA number, which it would in the
models you've outlined to date.  It also solves the problem of giving
the storage device more information about which blocks belong to which
inode/object, and it would also make it easier for the OS to pass
object lifetime and shared fate hints to the storage device.  This
should hopefully allow the FTL or STL to do a better job, since it now
has access to low-level hardware inforation (e.g., BER / Soft ECC
failures) as well as higher-level object information to do a better
job making storage layout and garbage collection activities.
				 ----
A fair criticism of any of the models discussed to date (one based on
ZBC, the object-based storage model, or OCSSD) don't have any mature
implementations, either in the open source or the clost source world.
But since it's true for *all* of them, we should be using other
criteria for deciding which model is the best one to choose for the
long term.
The advantage of the ZBC model is that people have had several years
to consider and understand the model, so in terms of mind share it has
an advantage.
The advantage of the object-based model is that it transfers a lot of
the complexity to the storage device, so the job that needs to be done
by the file system is much simpler than either of the other two models.
The advantage of the OCSSD model is that it exposes a lot of the raw
flash complexities to the host.  This can be good in that the host can
now do a really good job optimize for a particular flash technology.
The downside is that by exposing all of that complexity to the host,
it makes file system design very fragile, since as the number of chips
changes, or the size of erase blocks changes, or as flash developes
new capabilities such as erase suspend/resume, *all* of that hair gets
exposed to the file system implementor.
Personally, I think that's why either the ZBC model or the
object-based model makes a lot more sense than something where we
expose all of the vagaries of NAND flash to the file system.
Cheers,
					- Ted
^ permalink raw reply	[flat|nested] 27+ messages in thread
 
 
- * Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
  2017-01-05 22:58             ` Slava Dubeyko
  2017-01-06  1:11               ` Theodore Ts'o
@ 2017-01-06 13:05               ` Matias Bjørling
  1 sibling, 0 replies; 27+ messages in thread
From: Matias Bjørling @ 2017-01-06 13:05 UTC (permalink / raw)
  To: Slava Dubeyko, Damien Le Moal, Viacheslav Dubeyko,
	lsf-pc@lists.linux-foundation.org, Theodore Ts'o
  Cc: Linux FS Devel, linux-block@vger.kernel.org,
	linux-nvme@lists.infradead.org
On 01/05/2017 11:58 PM, Slava Dubeyko wrote:
> Next point is read disturbance. If BER of physical page/block achieves some threshold then
> we need to move data from one page/block into another one. What subsystem will be
> responsible for this activity? The drive-managed case expects that device's GC will manage
> read disturbance issue. But what's about host-aware or host-managed case? If the host side
> hasn't information about BER then the host's software is unable to manage this issue. Finally,
> it sounds that we will have GC subsystem as on file system side as on device side. As a result,
> it means possible unpredictable performance degradation and decreasing device lifetime.
> Let's imagine that host-aware case could be unaware about read disturbance management.
> But how host-managed case can manage this issue?
The OCSSD interface uses a couple of methods:
1) Piggy back soft ECC errors onto the completion entry. Tells the host
that a block properly should be refreshed when appropriate.
2) Use an asynchronous interface, e.g., NVMe get log page. Report blocks
through this interface that has been read disturbed. This may be coupled
with the various processes running on the SSD.
3) (That Ted suggested). Expose a "reset" bit in the Report Zones
command to let the host know which blocks should be reset. If the
plumbing for 2) is not available, or the information has been lost on
the host side, this method can be used to "resync".
> 
> Bad block management... So, drive-managed and host-aware cases should be completely unaware
> about  bad blocks. But what's about host-managed case? If a device will hide bad blocks from
> the host then it means mapping table presence, access to logical pages/blocks and so on. If the host
> hasn't access to the bad block management then it's not host-managed model. And it sounds as
> completely unmanageable situation for the host-managed model. Because if the host has access
> to bad block management (but how?) then we have really simple model. Otherwise, the host
> has access to logical pages/blocks only and device should have internal GC. As a result,
> it means possible unpredictable performance degradation and decreasing device lifetime because
> of competition of GC on device side and GC on the host side.
Agree. depending on the use-case, one may expose a "perfect" interface
to the host, or one may expose an interface where media errors may be
reported to the host. The former case are great for consumer units,
where I/O predictability isn't critical, and similarly if I/O
predictability is critical, the media errors can be reported, and the
host may deal with them appropriately.
^ permalink raw reply	[flat|nested] 27+ messages in thread 
 
- * Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
  2017-01-04  7:24           ` Damien Le Moal
                               ` (2 preceding siblings ...)
  2017-01-05 22:58             ` Slava Dubeyko
@ 2017-01-06  1:09             ` Jaegeuk Kim
  2017-01-06 12:55               ` Matias Bjørling
  3 siblings, 1 reply; 27+ messages in thread
From: Jaegeuk Kim @ 2017-01-06  1:09 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Slava Dubeyko, Matias Bjørling, Viacheslav Dubeyko,
	lsf-pc@lists.linux-foundation.org, Linux FS Devel,
	linux-block@vger.kernel.org, linux-nvme@lists.infradead.org
Hello,
On 01/04, Damien Le Moal wrote:
...
> 
> > Finally, if I really like to develop SMR- or NAND flash oriented file
> > system then I would like to play with peculiarities of concrete
> > technologies. And any unified interface will destroy the opportunity 
> > to create the really efficient solution. Finally, if my software
> > solution is unable to provide some fancy and efficient features then
> > guys will prefer to use the regular stack (ext4, xfs + block layer).
> 
> Not necessarily. Again think in terms of device "model" and associated
> feature set. An FS implementation may decide to support all possible
> models, with likely a resulting incredible complexity. More likely,
> similarly with what is happening with SMR, only models that make sense
> will be supported by FS implementation that can be easily modified.
> Example again here of f2fs: changes to support SMR were rather simple,
> whereas the initial effort to support SMR with ext4 was pretty much
> abandoned as it was too complex to integrate in the existing code while
> keeping the existing on-disk format.
>From the f2fs viewpoint, now we support single host-managed SMR drive having
a portion of conventional zones. In addition, f2fs supports multiple devices
[1], which enables us to use pure host-managed SMR which has no conventional
zone, working with another small conventional partition.
I think current lightNVM with OCSSD aims towards a drive-managed device for
generic filesystems. Depending on FTL, however, OCSSD can report conventional
or sequential zones. 1) If FTL handles random 4K writes pretty well, it would
be better to report converntional zones. Otherwise, 2) if FTL has almost nothing
to map bettwen LBA to PBA, it is able to report sequential zones likewise pure
host-managed SMR.
Interestingly, for 1) host-aware model, there is no need to change f2fs at all.
In order to explore 2) pure host-managed model, I introduced aligned write IO
[2] to make FTL more simple by eliminating partial page write. IMHO, it'd be
funny to evaluate several zoned models of SMR and OCSSD accordingly.
[1] https://lkml.org/lkml/2016/11/9/727
[2] https://lkml.org/lkml/2016/12/30/242
Thanks,
^ permalink raw reply	[flat|nested] 27+ messages in thread
- * Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os
  2017-01-06  1:09             ` Jaegeuk Kim
@ 2017-01-06 12:55               ` Matias Bjørling
  0 siblings, 0 replies; 27+ messages in thread
From: Matias Bjørling @ 2017-01-06 12:55 UTC (permalink / raw)
  To: Jaegeuk Kim, Damien Le Moal
  Cc: Slava Dubeyko, Viacheslav Dubeyko,
	lsf-pc@lists.linux-foundation.org, Linux FS Devel,
	linux-block@vger.kernel.org, linux-nvme@lists.infradead.org
On 01/06/2017 02:09 AM, Jaegeuk Kim wrote:
> Hello,
> 
> On 01/04, Damien Le Moal wrote:
> 
> ...
>>
>>> Finally, if I really like to develop SMR- or NAND flash oriented file
>>> system then I would like to play with peculiarities of concrete
>>> technologies. And any unified interface will destroy the opportunity 
>>> to create the really efficient solution. Finally, if my software
>>> solution is unable to provide some fancy and efficient features then
>>> guys will prefer to use the regular stack (ext4, xfs + block layer).
>>
>> Not necessarily. Again think in terms of device "model" and associated
>> feature set. An FS implementation may decide to support all possible
>> models, with likely a resulting incredible complexity. More likely,
>> similarly with what is happening with SMR, only models that make sense
>> will be supported by FS implementation that can be easily modified.
>> Example again here of f2fs: changes to support SMR were rather simple,
>> whereas the initial effort to support SMR with ext4 was pretty much
>> abandoned as it was too complex to integrate in the existing code while
>> keeping the existing on-disk format.
> 
> From the f2fs viewpoint, now we support single host-managed SMR drive having
> a portion of conventional zones. In addition, f2fs supports multiple devices
> [1], which enables us to use pure host-managed SMR which has no conventional
> zone, working with another small conventional partition.
That is a good approach. SSD controllers may even implement a small FTL
inside the device for the "conventional" zones. The size wouldn't be
that big and may only be used to bootstrap rest of the unit. A zone with
a couple hundred megabytes should do. That'll simplify having pblk on
the side next to f2fs.
^ permalink raw reply	[flat|nested] 27+ messages in thread