[Patch 0/4] RFC : Support for data gradation of a single file.

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [Patch 0/4] RFC : Support for data gradation of a single file.
@ 2018-04-06 11:41 Sayan Ghosh
  2018-04-06 21:31 ` Andreas Dilger
  2018-04-06 22:27 ` Theodore Y. Ts'o
  0 siblings, 2 replies; 10+ messages in thread
From: Sayan Ghosh @ 2018-04-06 11:41 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, Bhattacharya, Suparna, niloy ganguly,
	Madhumita Mallick, Bharde, Madhumita

Hi all,

The following series of patches aim to store a file with a graded
information. Consider a scenario of video indexing for learning
programme where some of the portions of the video is annotated and
important than other portions, hence to be accessed more often. We
consider the similar scenario where we have a file along with a grade
information that mentions which blocks are important and which are
not. The grades we consider are binary with 1 denoting high grade.
Now the file is stored in a LVM which comprises of different set of
storage devices belong to different tiers (as ext4 doesn’t support
spanning over multiple block driver), - one combination could be
persistent memory and hard-disk. The target is to store the higher
graded blocks in the higher performance tier and the lower graded
blocks in the lower performance tier.
Consider a C code where the grade of the file blocks are being set in
the user space through extended attribute. The grade structure stores
the span of different high graded segments in the file with starting
high grade block numbers and the span length of the segments. We
assume grade of rest of the blocks as 0 (low).

---
typedef struct _grade{
   unsigned long long block_num;
   unsigned long long length;
} grade_extents;

int fd = open(filename, O_CREAT|O_RDWR, (mode_t)00777);
int xattr_value = 1;
int status1 = fsetxattr(fd, "user.is_graded", (const void *)&xattr_value,
                    sizeof(int), 0 );

grade_extents grade_array[] = {{1,2},{50,10}};
int status2 = fsetxattr(fd, "user.grade_array", (const void *)grade_array,
                        count*sizeof(grade_struct), 0 );

/* creating a 1 MB file */
int status3 = fallocate(fd, 0, 0, (1024 * 1024));
----

The first 2 patches of the series aim to read the grades and
pre-allocate space through fallocate in the respective tiers.
The next task is to write and read data to and from these files
(respectively). The 3rd patch aims at solving this issue.
The final patch in this patch series helps to get a reduced view of
the file, ie. just shows the high graded blocks of the file - the
motivation being  an application may need to access only the important
portions of the file such as accessing only the annotated parts of a
learning video.
We made the patches on top of Linux Kernel 4.7.2.

---
 fs/dax.c          | 139 +++++++++++++++++++++++++++++++++
 fs/ext4/ext4.h    |  17 +++++
 fs/ext4/extents.c | 151 +++++++++++++++++++++++++++++++++++-
 fs/ext4/file.c    | 225 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
 4 files changed, 525 insertions(+), 7 deletions(-)

Regards,
Sayan Ghosh
IIT Kharagpur
‌

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Patch 0/4] RFC : Support for data gradation of a single file.
  2018-04-06 11:41 [Patch 0/4] RFC : Support for data gradation of a single file Sayan Ghosh
@ 2018-04-06 21:31 ` Andreas Dilger
  2018-04-06 22:27 ` Theodore Y. Ts'o
  1 sibling, 0 replies; 10+ messages in thread
From: Andreas Dilger @ 2018-04-06 21:31 UTC (permalink / raw)
  To: Sayan Ghosh
  Cc: Ext4 Developers List, Linux FS Devel, Bhattacharya, Suparna,
	niloy ganguly, Madhumita Mallick, Linux Kernel Mailing List,
	Bharde, Madhumita, Jens Axboe

[-- Attachment #1: Type: text/plain, Size: 2322 bytes --]

On Apr 6, 2018, at 5:41 AM, Sayan Ghosh <sgdgp.2014@gmail.com> wrote:
> 
> Hi all,
> 
> The following series of patches aim to store a file with a graded
> information. Consider a scenario of video indexing for learning
> programme where some of the portions of the video is annotated and
> important than other portions, hence to be accessed more often. We
> consider the similar scenario where we have a file along with a grade
> information that mentions which blocks are important and which are
> not. The grades we consider are binary with 1 denoting high grade.
> Now the file is stored in a LVM which comprises of different set of
> storage devices belong to different tiers (as ext4 doesn’t support
> spanning over multiple block driver), - one combination could be
> persistent memory and hard-disk. The target is to store the higher
> graded blocks in the higher performance tier and the lower graded
> blocks in the lower performance tier.
> Consider a C code where the grade of the file blocks are being set in
> the user space through extended attribute. The grade structure stores
> the span of different high graded segments in the file with starting
> high grade block numbers and the span length of the segments. We
> assume grade of rest of the blocks as 0 (low).

There was a considerable amount of work and discussion on implementing
Stream IDs for the block layer.  This would annotate writes from userspace
and allow the underlying storage (filesystem and block layer) to use the
stream ID for block allocation.  See the following for more details:

    https://lwn.net/Articles/717755/
    https://lwn.net/Articles/726477/
    http://lists.infradead.org/pipermail/linux-nvme/2017-June/011322.html

In the absence of other information, the Stream ID would just mean "group
allocations with the same ID together". After some discussion, it looks
like the latest patch has generic "lifetime" hints rather than "stream IDs",
but the end result is largely the same.

It would make sense for you to spend time testing and fixing that patch
series instead of trying to introduce a new interface.  IMHO, there is
no need to make these hints persistent on disk, since their state could
be inferred by the allocation placement directly.

Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 873 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Patch 0/4] RFC : Support for data gradation of a single file.
  2018-04-06 11:41 [Patch 0/4] RFC : Support for data gradation of a single file Sayan Ghosh
  2018-04-06 21:31 ` Andreas Dilger
@ 2018-04-06 22:27 ` Theodore Y. Ts'o
  2018-04-09  4:03   ` Andreas Dilger
  2018-04-10  9:52   ` Sayan Ghosh
  1 sibling, 2 replies; 10+ messages in thread
From: Theodore Y. Ts'o @ 2018-04-06 22:27 UTC (permalink / raw)
  To: Sayan Ghosh
  Cc: linux-ext4, linux-fsdevel, Bhattacharya, Suparna, niloy ganguly,
	Madhumita Mallick, Bharde, Madhumita

Hi Sayan,

It wasn't clear what was your purpose in posting these patches.  There
are a large number of ways in which they simply aren't ready for
upstream merging.  As a short list:

1)  They are against an ancient version of the kernel (4.7.2).

2)  There are a large number of TODO's in it in the code

3) The boundary between the two different tiers of storage is
currently harded in a header file using a #define (!).

If the goal is to gather comments about the design, I wish you had
presented the problem statement to the ext4 mailig list much earlier.
It might have saved you time in terms since we could have given you
feedback before you had done all of this work on this patch set.

Andreas' comments about making the allocation hints persistent not
making any sense are very much on target.  Once the file is written,
the hints won't be needed at all.

In addition, you should strongly think about some way propagating the
fact that some blocks in device-mapper device are backed by DAX, and
others are not, as a device-mapper interface.  And it might not
necessarily a single break point where below a block number is SSD or
HDD storage, and above a block number it's DAX storage.

The other thing to consider is whether it makes any sense at all to
solve this problem by haing a single file system where part of the
storage is DAX, and part is not.  Why not just have two file systems,
one which is 100% DAX, and another which is 100% HDD/SSD, and store
the data in two files in two different file sytsems?

						- Ted

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Patch 0/4] RFC : Support for data gradation of a single file.
  2018-04-06 22:27 ` Theodore Y. Ts'o
@ 2018-04-09  4:03   ` Andreas Dilger
  2018-04-10  9:46     ` Sayan Ghosh
  2018-04-10  9:56     ` Sayan Ghosh
  2018-04-10  9:52   ` Sayan Ghosh
  1 sibling, 2 replies; 10+ messages in thread
From: Andreas Dilger @ 2018-04-09  4:03 UTC (permalink / raw)
  To: Theodore Y. Ts'o
  Cc: Sayan Ghosh, Ext4 Developers List, Linux FS Devel,
	Bhattacharya, Suparna, niloy ganguly, Madhumita Mallick,
	Bharde, Madhumita

[-- Attachment #1: Type: text/plain, Size: 1631 bytes --]

On Apr 6, 2018, at 4:27 PM, Theodore Y. Ts'o <tytso@mit.edu> wrote:
> The other thing to consider is whether it makes any sense at all to
> solve this problem by haing a single file system where part of the
> storage is DAX, and part is not.  Why not just have two file systems,
> one which is 100% DAX, and another which is 100% HDD/SSD, and store
> the data in two files in two different file systems?

I think there definitely *are* benefits to having both flash and HDDs
(and/or other different storage classes such as RAID-10 and RAID-6) in
the same filesystem namespace.  This is the premise behind bcache,
XFS realtime volumes, Btrfs, etc.

That said, having a hard-coded separation of flash vs. disks does not
make sense, even from an intermediate development point of view.  There
definitely should be a block-device interface for querying what the
actual layout is, perhaps something like the SMR zones?

Alternately, ext4 could add something akin to the realtime volume in
XFS, where it can directly address multiple storage devices to handle
different storage classes, but that would need at least some amount of
development.  It was actually one of the options on the table for the
early ext2resize development, to split the ext4 block groups across
devices and then concatenate them logically at runtime.  That would
allow e.g. some number of DAX block groups, NVMe block groups, and HDD
RAID-6 block groups all in the same filesystem.  Even then, there would
need to be some way for ext4 to query the storage type of the underlying
devices, so that these could be mapped to the lifetime hints.

Cheers, Andreas

[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 873 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Patch 0/4] RFC : Support for data gradation of a single file.
  2018-04-09  4:03   ` Andreas Dilger
@ 2018-04-10  9:46     ` Sayan Ghosh
  2018-04-10 18:40       ` Andreas Dilger
  2018-04-10  9:56     ` Sayan Ghosh
  1 sibling, 1 reply; 10+ messages in thread
From: Sayan Ghosh @ 2018-04-10  9:46 UTC (permalink / raw)
  To: Theodore Y. Ts'o, Andreas Dilger
  Cc: Ext4 Developers List, Linux FS Devel, Bhattacharya, Suparna,
	niloy ganguly, Madhumita Mallick, Bharde, Madhumita

Hello,

Thank you Andreas and Theodore for taking time in reviewing the
patchset and also for providing comments and suggestions.
I am describing the problem statement in this mail.

The goal of our project is broadly to support data gradation of a
single file. If the contents of the file is graded in terms of its
importance then a corresponding application might need to view/analyse
only the important portions. It also helps if the important portions
can be accessed quickly without having to go through the entire file.
For an example, we can think of a leaning video with
indexing/annotations, in which the annotations contain the important
parts of the video. A learner can just be interested in those parts,
and it will help him if he can be provided with a reduced view with
just the parts he’s interested in. An example of such videos is ACM
Webinar videos where an user can navigate using table-of-contents or
phrase cloud.

The below link is one similar video -
https://videoken.com/video-detail?videoID=IpGxLWOIZy4&videoDuration=1853&videoName=A%20Friendly%20Introduction%20to%20Machine%20Learning&keyword=A%20Friendly%20Introduction%20to%20Machine%20Learning

There’s a word-cluster associated with the video, and upon clicking on
a word the red-black arrowheads (down) point to the portions where the
word had been used. A more sophisticated version of the same would be
to provide the user a complete reduced clipping with the annotated
portions of the word cluster, rather than the user having to manually
click on the portions he’s interested in.

These kind of video file can serve as an input to our system where we
know which parts of the file has been marked. Our goal then is to
properly place respective important blocks and provide a reduced view
of just the important parts of the file. Placing the important blocks
in a faster tier (SSD,PM etc) greatly enhances the performance of
reading and writing of the file.

As stated above, we are interested in providing a reduced view of a
single file where important and unimportant portions are interspersed
- hence splitting it in two filesystems with important and unimportant
parts would not serve our objective. Let’s say in the example, an user
wants the full view of the video. In this case splitting the video in
two filesystems would not be ideal, as the user needs to be provided
with both important and unimportant blocks. Creating a sparse layout
to overlay two files will unnecessarily be complicated. It’ll hence be
ideal if a file has those graded information as a metadata (extended
attributes in our case), and use those information to properly place
and fetch when necessary.

Regards,
Sayan Ghosh

‌On Mon, Apr 9, 2018 at 9:33 AM, Andreas Dilger <adilger@dilger.ca> wrote:
> On Apr 6, 2018, at 4:27 PM, Theodore Y. Ts'o <tytso@mit.edu> wrote:
>> The other thing to consider is whether it makes any sense at all to
>> solve this problem by haing a single file system where part of the
>> storage is DAX, and part is not.  Why not just have two file systems,
>> one which is 100% DAX, and another which is 100% HDD/SSD, and store
>> the data in two files in two different file systems?
>
> I think there definitely *are* benefits to having both flash and HDDs
> (and/or other different storage classes such as RAID-10 and RAID-6) in
> the same filesystem namespace.  This is the premise behind bcache,
> XFS realtime volumes, Btrfs, etc.
>
> That said, having a hard-coded separation of flash vs. disks does not
> make sense, even from an intermediate development point of view.  There
> definitely should be a block-device interface for querying what the
> actual layout is, perhaps something like the SMR zones?
>
> Alternately, ext4 could add something akin to the realtime volume in
> XFS, where it can directly address multiple storage devices to handle
> different storage classes, but that would need at least some amount of
> development.  It was actually one of the options on the table for the
> early ext2resize development, to split the ext4 block groups across
> devices and then concatenate them logically at runtime.  That would
> allow e.g. some number of DAX block groups, NVMe block groups, and HDD
> RAID-6 block groups all in the same filesystem.  Even then, there would
> need to be some way for ext4 to query the storage type of the underlying
> devices, so that these could be mapped to the lifetime hints.
>
> Cheers, Andreas
>
>
>
>
>

</tytso@mit.edu></adilger@dilger.ca>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Patch 0/4] RFC : Support for data gradation of a single file.
  2018-04-10  9:46     ` Sayan Ghosh
@ 2018-04-10 18:40       ` Andreas Dilger
  2018-04-11  9:20         ` Bhattacharya, Suparna
  0 siblings, 1 reply; 10+ messages in thread
From: Andreas Dilger @ 2018-04-10 18:40 UTC (permalink / raw)
  To: Sayan Ghosh
  Cc: Theodore Y. Ts'o, Ext4 Developers List, Linux FS Devel,
	Bhattacharya, Suparna, niloy ganguly, Madhumita Mallick,
	Bharde, Madhumita

[-- Attachment #1: Type: text/plain, Size: 6588 bytes --]

On Apr 10, 2018, at 3:46 AM, Sayan Ghosh <sgdgp.2014@gmail.com> wrote:
> 
> Hello,
> 
> Thank you Andreas and Theodore for taking time in reviewing the
> patchset and also for providing comments and suggestions.
> I am describing the problem statement in this mail.
> 
> 
> The goal of our project is broadly to support data gradation of a
> single file. If the contents of the file is graded in terms of its
> importance then a corresponding application might need to view/analyse
> only the important portions. It also helps if the important portions
> can be accessed quickly without having to go through the entire file.
> For an example, we can think of a leaning video with
> indexing/annotations, in which the annotations contain the important
> parts of the video. A learner can just be interested in those parts,
> and it will help him if he can be provided with a reduced view with
> just the parts he’s interested in. An example of such videos is ACM
> Webinar videos where an user can navigate using table-of-contents or
> phrase cloud.
> 
> The below link is one similar video -
> https://videoken.com/video-detail?videoID=IpGxLWOIZy4&videoDuration=1853&videoName=A%20Friendly%20Introduction%20to%20Machine%20Learning&keyword=A%20Friendly%20Introduction%20to%20Machine%20Learning
> 
> 
> There’s a word-cluster associated with the video, and upon clicking on
> a word the red-black arrowheads (down) point to the portions where the
> word had been used. A more sophisticated version of the same would be
> to provide the user a complete reduced clipping with the annotated
> portions of the word cluster, rather than the user having to manually
> click on the portions he’s interested in.
> 
> These kind of video file can serve as an input to our system where we
> know which parts of the file has been marked. Our goal then is to
> properly place respective important blocks and provide a reduced view
> of just the important parts of the file. Placing the important blocks
> in a faster tier (SSD,PM etc) greatly enhances the performance of
> reading and writing of the file.
> 
> As stated above, we are interested in providing a reduced view of a
> single file where important and unimportant portions are interspersed
> - hence splitting it in two filesystems with important and unimportant
> parts would not serve our objective. Let’s say in the example, an user
> wants the full view of the video. In this case splitting the video in
> two filesystems would not be ideal, as the user needs to be provided
> with both important and unimportant blocks. Creating a sparse layout
> to overlay two files will unnecessarily be complicated. It’ll hence be
> ideal if a file has those graded information as a metadata (extended
> attributes in our case), and use those information to properly place
> and fetch when necessary.

To my thinking, you're always going to have more complex metadata for
the file stored in some kind of external database or a separate index
file.  You're not going to get the filesystem and all filesystem tools
to understand the full "importance of this extent" metrics, as that is
going to be different for each application, so storing a single bit of
"importance" for every block in the filesystem is not very helpful and
you may as well just rely on the external database/index file for this.


What you are really interested in is having the ability to provide hints
for the filesystem block allocator to store in different storage classes
within the same file, and (potentially) some way to retrieve the current
storage class upon request.

That said, the first part (requesting specific storage classes during
write) could be achieved by enhancing the StreamID/Lifetime patches to
allow specifying different hints for each write.  I think this had been
proposed at one time, but there wasn't any proposed use case for having
different storage classes within the same file, but now there is.

As for the interface for determining how the file is currently laid out,
I think that the FIEMAP ioctl could potentially be used for this.  It
will tell you the block number for each extent of the file, which could
be mapped to a different storage class if you are doing the mapping game
with LVM.  It is also possible to have FIEMAP also return the device to
the caller (as Lustre does) if the filesystem can manage multiple devices.
I think that would be useful for XFS (realtime volume), BtrFS (can use
multiple devices directly), and potentially ext4 if someone added the
ability to use multiple devices directly.

Cheers, Andreas


> ‌On Mon, Apr 9, 2018 at 9:33 AM, Andreas Dilger <adilger@dilger.ca> wrote:
>> On Apr 6, 2018, at 4:27 PM, Theodore Y. Ts'o <tytso@mit.edu> wrote:
>>> The other thing to consider is whether it makes any sense at all to
>>> solve this problem by haing a single file system where part of the
>>> storage is DAX, and part is not.  Why not just have two file systems,
>>> one which is 100% DAX, and another which is 100% HDD/SSD, and store
>>> the data in two files in two different file systems?
>> 
>> I think there definitely *are* benefits to having both flash and HDDs
>> (and/or other different storage classes such as RAID-10 and RAID-6) in
>> the same filesystem namespace.  This is the premise behind bcache,
>> XFS realtime volumes, Btrfs, etc.
>> 
>> That said, having a hard-coded separation of flash vs. disks does not
>> make sense, even from an intermediate development point of view.  There
>> definitely should be a block-device interface for querying what the
>> actual layout is, perhaps something like the SMR zones?
>> 
>> Alternately, ext4 could add something akin to the realtime volume in
>> XFS, where it can directly address multiple storage devices to handle
>> different storage classes, but that would need at least some amount of
>> development.  It was actually one of the options on the table for the
>> early ext2resize development, to split the ext4 block groups across
>> devices and then concatenate them logically at runtime.  That would
>> allow e.g. some number of DAX block groups, NVMe block groups, and HDD
>> RAID-6 block groups all in the same filesystem.  Even then, there would
>> need to be some way for ext4 to query the storage type of the underlying
>> devices, so that these could be mapped to the lifetime hints.
>> 
>> Cheers, Andreas
>> 
>> 
>> 
>> 
>> 
> 
> </tytso@mit.edu></adilger@dilger.ca>


Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 873 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [Patch 0/4] RFC : Support for data gradation of a single file.
  2018-04-10 18:40       ` Andreas Dilger
@ 2018-04-11  9:20         ` Bhattacharya, Suparna
  0 siblings, 0 replies; 10+ messages in thread
From: Bhattacharya, Suparna @ 2018-04-11  9:20 UTC (permalink / raw)
  To: Andreas Dilger, Sayan Ghosh
  Cc: Theodore Y. Ts'o, Ext4 Developers List, Linux FS Devel,
	niloy ganguly, Madhumita Mallick, Bharde, Madhumita

Hi Andreas,

> -----Original Message-----
> From: Andreas Dilger [mailto:adilger@dilger.ca]
> Sent: Wednesday, April 11, 2018 12:10 AM
> To: Sayan Ghosh <sgdgp.2014@gmail.com>
> Cc: Theodore Y. Ts'o <tytso@mit.edu>; Ext4 Developers List <linux-
> ext4@vger.kernel.org>; Linux FS Devel <linux-fsdevel@vger.kernel.org>;
> Bhattacharya, Suparna <suparna.bhattacharya@hpe.com>; niloy ganguly
> <ganguly.niloy@gmail.com>; Madhumita Mallick
> <madhu.cse.ju@gmail.com>; Bharde, Madhumita
> <madhumita.bharde@hpe.com>
> Subject: Re: [Patch 0/4] RFC : Support for data gradation of a single file.
> 
> On Apr 10, 2018, at 3:46 AM, Sayan Ghosh <sgdgp.2014@gmail.com>
> wrote:
> >
> > Hello,
> >
> > Thank you Andreas and Theodore for taking time in reviewing the
> > patchset and also for providing comments and suggestions.
> > I am describing the problem statement in this mail.
> >
> >
> > The goal of our project is broadly to support data gradation of a
> > single file. If the contents of the file is graded in terms of its
> > importance then a corresponding application might need to view/analyse
> > only the important portions. It also helps if the important portions
> > can be accessed quickly without having to go through the entire file.
> > For an example, we can think of a leaning video with
> > indexing/annotations, in which the annotations contain the important
> > parts of the video. A learner can just be interested in those parts,
> > and it will help him if he can be provided with a reduced view with
> > just the parts he’s interested in. An example of such videos is ACM
> > Webinar videos where an user can navigate using table-of-contents or
> > phrase cloud.
> >
> > The below link is one similar video -
> > https://videoken.com/video-
> detail?videoID=IpGxLWOIZy4&videoDuration=1853&videoName=A%20Frie
> ndly%20Introduction%20to%20Machine%20Learning&keyword=A%20Frien
> dly%20Introduction%20to%20Machine%20Learning
> >
> >
> > There’s a word-cluster associated with the video, and upon clicking on
> > a word the red-black arrowheads (down) point to the portions where the
> > word had been used. A more sophisticated version of the same would be
> > to provide the user a complete reduced clipping with the annotated
> > portions of the word cluster, rather than the user having to manually
> > click on the portions he’s interested in.
> >
> > These kind of video file can serve as an input to our system where we
> > know which parts of the file has been marked. Our goal then is to
> > properly place respective important blocks and provide a reduced view
> > of just the important parts of the file. Placing the important blocks
> > in a faster tier (SSD,PM etc) greatly enhances the performance of
> > reading and writing of the file.
> >
> > As stated above, we are interested in providing a reduced view of a
> > single file where important and unimportant portions are interspersed
> > - hence splitting it in two filesystems with important and unimportant
> > parts would not serve our objective. Let’s say in the example, an user
> > wants the full view of the video. In this case splitting the video in
> > two filesystems would not be ideal, as the user needs to be provided
> > with both important and unimportant blocks. Creating a sparse layout
> > to overlay two files will unnecessarily be complicated. It’ll hence be
> > ideal if a file has those graded information as a metadata (extended
> > attributes in our case), and use those information to properly place
> > and fetch when necessary.
> 
> To my thinking, you're always going to have more complex metadata for
> the file stored in some kind of external database or a separate index
> file.  You're not going to get the filesystem and all filesystem tools
> to understand the full "importance of this extent" metrics, as that is
> going to be different for each application, so storing a single bit of
> "importance" for every block in the filesystem is not very helpful and
> you may as well just rely on the external database/index file for this.
> 

You have a point there. We wouldn't want to clutter the fs with all kinds of complex application specific metadata interpretation. 
However, the simplicity of accessing a reduced view of the file with existing interfaces is rather appealing. It also provides a natural way to drive hints to optimize not just layout but other things such as readahead decisions ... as it is good clue of what data apps would access / need and even a way to shape what they access instead of having them pull in data won't be useful (while still preserving the ability to see and retain the full view).

As Sayan observed, layout hints can't guarantee where data will be placed, so we can't reverse map the view just from the layout. The grade attributes are one way to specify this kind of control plane information from an application view and it is also easy to change the view (without having to force a reorganization on disk). Are there other ways to convey such context (persistently) that would be more broadly useful?  

Another possibility is a snapshot like approach where a second inode has the reduced (high grade) view, but it gets more complex and trickier to preserve across copies / backups etc.

> 
> What you are really interested in is having the ability to provide hints
> for the filesystem block allocator to store in different storage classes
> within the same file, and (potentially) some way to retrieve the current
> storage class upon request.
> 
> That said, the first part (requesting specific storage classes during
> write) could be achieved by enhancing the StreamID/Lifetime patches to
> allow specifying different hints for each write.  I think this had been
> proposed at one time, but there wasn't any proposed use case for having
> different storage classes within the same file, but now there is.
> 
> As for the interface for determining how the file is currently laid out,
> I think that the FIEMAP ioctl could potentially be used for this.  It
> will tell you the block number for each extent of the file, which could
> be mapped to a different storage class if you are doing the mapping game
> with LVM.  It is also possible to have FIEMAP also return the device to
> the caller (as Lustre does) if the filesystem can manage multiple devices.
> I think that would be useful for XFS (realtime volume), BtrFS (can use
> multiple devices directly), and potentially ext4 if someone added the
> ability to use multiple devices directly.
> 
> Cheers, Andreas
> 
> 
> > ‌On Mon, Apr 9, 2018 at 9:33 AM, Andreas Dilger <adilger@dilger.ca>
> wrote:
> >> On Apr 6, 2018, at 4:27 PM, Theodore Y. Ts'o <tytso@mit.edu> wrote:
> >>> The other thing to consider is whether it makes any sense at all to
> >>> solve this problem by haing a single file system where part of the
> >>> storage is DAX, and part is not.  Why not just have two file systems,
> >>> one which is 100% DAX, and another which is 100% HDD/SSD, and
> store
> >>> the data in two files in two different file systems?
> >>
> >> I think there definitely *are* benefits to having both flash and HDDs
> >> (and/or other different storage classes such as RAID-10 and RAID-6) in
> >> the same filesystem namespace.  This is the premise behind bcache,
> >> XFS realtime volumes, Btrfs, etc.
> >>
> >> That said, having a hard-coded separation of flash vs. disks does not
> >> make sense, even from an intermediate development point of view.
> There
> >> definitely should be a block-device interface for querying what the
> >> actual layout is, perhaps something like the SMR zones?
> >>
> >> Alternately, ext4 could add something akin to the realtime volume in
> >> XFS, where it can directly address multiple storage devices to handle
> >> different storage classes, but that would need at least some amount of
> >> development.  It was actually one of the options on the table for the
> >> early ext2resize development, to split the ext4 block groups across
> >> devices and then concatenate them logically at runtime.  That would
> >> allow e.g. some number of DAX block groups, NVMe block groups, and
> HDD
> >> RAID-6 block groups all in the same filesystem.  Even then, there would
> >> need to be some way for ext4 to query the storage type of the
> underlying
> >> devices, so that these could be mapped to the lifetime hints.
> >>
> >> Cheers, Andreas
> >>
> >>
> >>
> >>
> >>
> >
> > </tytso@mit.edu></adilger@dilger.ca>
> 
> 
> Cheers, Andreas
> 


Regards
Suparna

> 
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Patch 0/4] RFC : Support for data gradation of a single file.
  2018-04-09  4:03   ` Andreas Dilger
  2018-04-10  9:46     ` Sayan Ghosh
@ 2018-04-10  9:56     ` Sayan Ghosh
  2018-04-10 23:39       ` Dave Chinner
  1 sibling, 1 reply; 10+ messages in thread
From: Sayan Ghosh @ 2018-04-10  9:56 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Theodore Y. Ts'o, Ext4 Developers List, Linux FS Devel,
	Bhattacharya, Suparna, niloy ganguly, Madhumita Mallick,
	Bharde, Madhumita

Hi Andreas,

> In the absence of other information, the Stream ID would just mean "group
> allocations with the same ID together". After some discussion, it looks
> like the latest patch has generic "lifetime" hints rather than "stream IDs",
> but the end result is largely the same.

I looked up the links you provided for StreamID which provides
lifetime hints for a file. In our case we have different importance
levels/grade levels pertaining to different blocks of a single file
itself. I am not sure if akin to lifetime hints, different *allocation
type hint* can be achieved by using StreamID. However I am yet to read
details about the concept of StreamID to see if we can use StreamID to
our advantage in allocations of different blocks of a single file to
separate tiers, as well as in providing a reduced view. Any insight on
this would be really helpful.

> series instead of trying to introduce a new interface.  IMHO, there is
> no need to make these hints persistent on disk, since their state could
> be inferred by the allocation placement directly

The problem with not making the hints persistent can be 1) if the
higher graded block got stored in HDD due to for e.g -  overflowing of
the higher tier, but is critical from application point of view(can be
accessed from hdd in case of our code) and, 2) to preserve grade
information even when the file is copied : Suppose the higher tier
gets full, thus we store the high graded blocks of file in the lower
tier, and after storing we delete the grade metadata as well. Now if
we copy this file to some other mixed block device which has
sufficient space in higher tier we would still not be able to store
that high graded block in higher tier here (in case of inferring the
state by the allocation placement).

> That said, having a hard-coded separation of flash vs. disks does not
> make sense, even from an intermediate development point of view.  There
> definitely should be a block-device interface for querying what the
> actual layout is, perhaps something like the SMR zones?

Yes, I agree, that the ideal situation would be to have a mechanism to
identify the segment boundaries automatically inside the LVM. But we
were not able to get a method to access the boundaries or rather the
location of a free block in each segment by such system call.
So, in order to just test out the system we proceeded by hardcoding
the boundaries as per our simulated LVM. But since this is not
practical we provided the TODO/FIX IT in those areas. We are still
looking for a good mechanism, and would welcome any
advice/suggestions.

Also, we chose to use Ext4 since it is generally the most commonly
used file system in linux based systems. However, I am not aware if
the problem of getting the boundaries can be solved in a simpler
manner by using XFS.

Regards,
Sayan Ghosh

‌On Mon, Apr 9, 2018 at 9:33 AM, Andreas Dilger <adilger@dilger.ca> wrote:
> On Apr 6, 2018, at 4:27 PM, Theodore Y. Ts'o <tytso@mit.edu> wrote:
>> The other thing to consider is whether it makes any sense at all to
>> solve this problem by haing a single file system where part of the
>> storage is DAX, and part is not.  Why not just have two file systems,
>> one which is 100% DAX, and another which is 100% HDD/SSD, and store
>> the data in two files in two different file systems?
>
> I think there definitely *are* benefits to having both flash and HDDs
> (and/or other different storage classes such as RAID-10 and RAID-6) in
> the same filesystem namespace.  This is the premise behind bcache,
> XFS realtime volumes, Btrfs, etc.
>
> That said, having a hard-coded separation of flash vs. disks does not
> make sense, even from an intermediate development point of view.  There
> definitely should be a block-device interface for querying what the
> actual layout is, perhaps something like the SMR zones?
>
> Alternately, ext4 could add something akin to the realtime volume in
> XFS, where it can directly address multiple storage devices to handle
> different storage classes, but that would need at least some amount of
> development.  It was actually one of the options on the table for the
> early ext2resize development, to split the ext4 block groups across
> devices and then concatenate them logically at runtime.  That would
> allow e.g. some number of DAX block groups, NVMe block groups, and HDD
> RAID-6 block groups all in the same filesystem.  Even then, there would
> need to be some way for ext4 to query the storage type of the underlying
> devices, so that these could be mapped to the lifetime hints.
>
> Cheers, Andreas
>
>
>
>
>

</tytso@mit.edu></adilger@dilger.ca>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Patch 0/4] RFC : Support for data gradation of a single file.
  2018-04-10  9:56     ` Sayan Ghosh
@ 2018-04-10 23:39       ` Dave Chinner
  0 siblings, 0 replies; 10+ messages in thread
From: Dave Chinner @ 2018-04-10 23:39 UTC (permalink / raw)
  To: Sayan Ghosh
  Cc: Andreas Dilger, Theodore Y. Ts'o, Ext4 Developers List,
	Linux FS Devel, Bhattacharya, Suparna, niloy ganguly,
	Madhumita Mallick, Bharde, Madhumita

On Tue, Apr 10, 2018 at 03:26:11PM +0530, Sayan Ghosh wrote:
> > That said, having a hard-coded separation of flash vs. disks does not
> > make sense, even from an intermediate development point of view.  There
> > definitely should be a block-device interface for querying what the
> > actual layout is, perhaps something like the SMR zones?
> 
> Yes, I agree, that the ideal situation would be to have a mechanism to
> identify the segment boundaries automatically inside the LVM. But we
> were not able to get a method to access the boundaries or rather the
> location of a free block in each segment by such system call.
> So, in order to just test out the system we proceeded by hardcoding
> the boundaries as per our simulated LVM. But since this is not
> practical we provided the TODO/FIX IT in those areas. We are still
> looking for a good mechanism, and would welcome any
> advice/suggestions.
> 
> Also, we chose to use Ext4 since it is generally the most commonly
> used file system in linux based systems. However, I am not aware if
> the problem of getting the boundaries can be solved in a simpler
> manner by using XFS.

SSD for the data device, HDD for the realtime device, device
auto-selection based on initial allocation size patchset like this
one:

https://marc.info/?l=linux-xfs&m=151190613327238&w=2

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Patch 0/4] RFC : Support for data gradation of a single file.
  2018-04-06 22:27 ` Theodore Y. Ts'o
  2018-04-09  4:03   ` Andreas Dilger
@ 2018-04-10  9:52   ` Sayan Ghosh
  1 sibling, 0 replies; 10+ messages in thread
From: Sayan Ghosh @ 2018-04-10  9:52 UTC (permalink / raw)
  To: Theodore Y. Ts'o
  Cc: linux-ext4, linux-fsdevel, Bhattacharya, Suparna, niloy ganguly,
	Madhumita Mallick, Bharde, Madhumita

Hi Theodore,

> It wasn't clear what was your purpose in posting these patches.  There
> are a large number of ways in which they simply aren't ready for
> upstream merging.  As a short list:
>
> 1)  They are against an ancient version of the kernel (4.7.2).
>
> 2)  There are a large number of TODO's in it in the code
>
> 3) The boundary between the two different tiers of storage is
> currently harded in a header file using a #define (!).
>
>
> If the goal is to gather comments about the design, I wish you had
> presented the problem statement to the ext4 mailig list much earlier.

Yes, we want to get an early feedback of the problem statement as well
as the patchset in general. The next task is to modify the codes
against the current kernel version. Also as mentioned in the TO-DOs,
we are looking for better ideas on 1) finding a way to not hard code
and automatically finding the boundaries of the storage tiers. 2) to
automatically detect what the faster tier is for block allocation and
view. However solving the 1st TO-DO about boundaries is more important
for making the system robust.


> The other thing to consider is whether it makes any sense at all to
> solve this problem by haing a single file system where part of the
> storage is DAX, and part is not.  Why not just have two file systems,
> one which is 100% DAX, and another which is 100% HDD/SSD, and store
> the data in two files in two different file sytsems?

As we mentioned in the problem statement, we are interested in
providing a reduced view of a single file where important and
unimportant portions are interspersed - hence splitting it in two
filesystems with important and unimportant parts would not serve our
objective. Let’s say in the example, an user wants the full view of
the video. In this case splitting the video in two filesystems would
not be ideal, as the user needs to be provided with both important and
unimportant blocks. Creating a sparse layout to overlay two files will
unnecessarily be complicated. It’ll hence be ideal if a file has those
graded information as a metadata (extended attributes in our case),
and use those information to properly place and fetch when necessary.


Regards,
Sayan Ghosh
‌On Sat, Apr 7, 2018 at 3:57 AM, Theodore Y. Ts'o <tytso@mit.edu> wrote:
> Hi Sayan,
>
> It wasn't clear what was your purpose in posting these patches.  There
> are a large number of ways in which they simply aren't ready for
> upstream merging.  As a short list:
>
> 1)  They are against an ancient version of the kernel (4.7.2).
>
> 2)  There are a large number of TODO's in it in the code
>
> 3) The boundary between the two different tiers of storage is
> currently harded in a header file using a #define (!).
>
>
> If the goal is to gather comments about the design, I wish you had
> presented the problem statement to the ext4 mailig list much earlier.
> It might have saved you time in terms since we could have given you
> feedback before you had done all of this work on this patch set.
>
> Andreas' comments about making the allocation hints persistent not
> making any sense are very much on target.  Once the file is written,
> the hints won't be needed at all.
>
> In addition, you should strongly think about some way propagating the
> fact that some blocks in device-mapper device are backed by DAX, and
> others are not, as a device-mapper interface.  And it might not
> necessarily a single break point where below a block number is SSD or
> HDD storage, and above a block number it's DAX storage.
>
> The other thing to consider is whether it makes any sense at all to
> solve this problem by haing a single file system where part of the
> storage is DAX, and part is not.  Why not just have two file systems,
> one which is 100% DAX, and another which is 100% HDD/SSD, and store
> the data in two files in two different file sytsems?
>
>                                                 - Ted

</tytso@mit.edu>

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2018-04-11  9:20 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-04-06 11:41 [Patch 0/4] RFC : Support for data gradation of a single file Sayan Ghosh
2018-04-06 21:31 ` Andreas Dilger
2018-04-06 22:27 ` Theodore Y. Ts'o
2018-04-09  4:03   ` Andreas Dilger
2018-04-10  9:46     ` Sayan Ghosh
2018-04-10 18:40       ` Andreas Dilger
2018-04-11  9:20         ` Bhattacharya, Suparna
2018-04-10  9:56     ` Sayan Ghosh
2018-04-10 23:39       ` Dave Chinner
2018-04-10  9:52   ` Sayan Ghosh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).