* Re: [PATCH 7/9] exofs: mkexofs
[not found] ` <20081229121423.efde9d06.akpm@linux-foundation.org>
@ 2008-12-31 15:19 ` Boaz Harrosh
2008-12-31 15:57 ` James Bottomley
2008-12-31 19:25 ` Andrew Morton
0 siblings, 2 replies; 32+ messages in thread
From: Boaz Harrosh @ 2008-12-31 15:19 UTC (permalink / raw)
To: Andrew Morton, James Bottomley
Cc: avishay, jeff, viro, linux-fsdevel, osd-dev, linux-kernel,
linux-scsi
Andrew Morton wrote:
> On Tue, 16 Dec 2008 17:33:48 +0200
> Boaz Harrosh <bharrosh@panasas.com> wrote:
>
>> We need a mechanism to prepare the file system (mkfs).
>> I chose to implement that by means of a couple of
>> mount-options. Because there is no user-mode API for committing
>> OSD commands. And also, all this stuff is highly internal to
>> the file system itself.
>>
>> - Added two mount options mkfs=0/1,format=capacity_in_meg, so mkfs/format
>> can be executed by kernel code just before mount. An mkexofs utility
>> can now be implemented by means of a script that mounts and unmount the
>> file system with proper options.
>
> Doing mkfs in-kernel is unusual. I don't think the above description
> sufficiently helps the uninitiated understand why mkfs cannot be done
> in userspace as usual. Please flesh it out a bit.
There are a few main reasons.
- There is no user-mode API for initiating OSD commands. Such a subsystem
would be hundredfold bigger then the mkfs code submitted. I think it would be
hard and stupid to maintain a complex user-mode API just for creating
a couple of objects and writing a couple of on disk structures.
- I intend to refactor the code further to make use of more super.c services,
so to make this addition even smaller. Also future direction of raid over
multiple objects will make even more kernel infrastructure needed which
will need even more user-mode code duplication.
- I anticipate problems that are not yet addressed in this body of work
but will be in the future, mainly that a single OSD-target (lun) can
be shared by lots of FSs, and a single FS can span many OSD-targets.
Some central management is much easier to do in Kernel.
>
> What are the dependencies for this filesystem code? I assume that it
> depends on various block- and scsi-level patches? Which ones, and
> what is their status, and is this code even compileable without them?
>
This OSD-based file system is dependent on the open-osd initiator library
code that I've submitted for inclusion for 2.6.29. It has been sitting
in linux-next for a while now, and has not been receiving any comments
for the last two updated patchsets I've sent to scsi-misc/lkml. However
it has not yet been submitted into Jame's scsi-misc git tree, and James
is the ultimate maintainer that should submit this work. I hope it will
still be submitted into 2.6.29, as this code is totally self sufficient
and does not endangers or changes any other Kernel subsystems.
(All the needed ground work was already submitted to Linus since 2.6.26)
So why should it not?
Once the open-osd initiator library is accepted this file system
could be accepted. I was hoping as a 2.6.30 time frame. (One Kernel
after the open-osd library)
> Thanks.
Thank you dear Andrew for your most valuable input.
I will constify all the const needed code. will fix the global name space
litter, will inline the macros and lower case the inlines. Will remove
the typedefs.
I will reply to individual patches, I have a couple of questions. But
all your comments are right and I will take care of them.
When, if, all is fixed, through which tree/maintainer can exofs be submitted?
Thanks
Boaz
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 7/9] exofs: mkexofs
2008-12-31 15:19 ` [PATCH 7/9] exofs: mkexofs Boaz Harrosh
@ 2008-12-31 15:57 ` James Bottomley
2009-01-01 9:22 ` [osd-dev] " Benny Halevy
` (2 more replies)
2008-12-31 19:25 ` Andrew Morton
1 sibling, 3 replies; 32+ messages in thread
From: James Bottomley @ 2008-12-31 15:57 UTC (permalink / raw)
To: Boaz Harrosh
Cc: Andrew Morton, avishay, jeff, viro, linux-fsdevel, osd-dev,
linux-kernel, linux-scsi
On Wed, 2008-12-31 at 17:19 +0200, Boaz Harrosh wrote:
> Andrew Morton wrote:
> > On Tue, 16 Dec 2008 17:33:48 +0200
> > Boaz Harrosh <bharrosh@panasas.com> wrote:
> >
> >> We need a mechanism to prepare the file system (mkfs).
> >> I chose to implement that by means of a couple of
> >> mount-options. Because there is no user-mode API for committing
> >> OSD commands. And also, all this stuff is highly internal to
> >> the file system itself.
> >>
> >> - Added two mount options mkfs=0/1,format=capacity_in_meg, so mkfs/format
> >> can be executed by kernel code just before mount. An mkexofs utility
> >> can now be implemented by means of a script that mounts and unmount the
> >> file system with proper options.
> >
> > Doing mkfs in-kernel is unusual. I don't think the above description
> > sufficiently helps the uninitiated understand why mkfs cannot be done
> > in userspace as usual. Please flesh it out a bit.
>
> There are a few main reasons.
> - There is no user-mode API for initiating OSD commands. Such a subsystem
> would be hundredfold bigger then the mkfs code submitted. I think it would be
> hard and stupid to maintain a complex user-mode API just for creating
> a couple of objects and writing a couple of on disk structures.
This is really a reflection of the whole problem with the OSD paradigm.
In theory, a filesystem on OSD is a thin layer of metadata mapping
objects to files. Get this right and the storage will manage things,
like security and access and attributes (there's even a natural mapping
to the VFS concept of extended attributes). Plus, the storage has
enough information to manage persistence, backups and replication.
The real problem is that no-one has actually managed to come up with a
useful VFS<->OSD mapping layer (even by extending or altering the VFS).
Every filesystem that currently uses OSD has a separate direct OSD
speaking interface (i.e. it slices out the block layer to do this and
talks directly to the storage).
I suppose this could be taken to show that such a layer is impossibly
complex, as you assert, but its lack is reflected in strange looking
design decisions like in-kernel mkfs. It would also mean that there
would be very little layered code sharing between ODS based filesystems.
> - I intend to refactor the code further to make use of more super.c services,
> so to make this addition even smaller. Also future direction of raid over
> multiple objects will make even more kernel infrastructure needed which
> will need even more user-mode code duplication.
> - I anticipate problems that are not yet addressed in this body of work
> but will be in the future, mainly that a single OSD-target (lun) can
> be shared by lots of FSs, and a single FS can span many OSD-targets.
> Some central management is much easier to do in Kernel.
>
> >
> > What are the dependencies for this filesystem code? I assume that it
> > depends on various block- and scsi-level patches? Which ones, and
> > what is their status, and is this code even compileable without them?
> >
>
> This OSD-based file system is dependent on the open-osd initiator library
> code that I've submitted for inclusion for 2.6.29. It has been sitting
> in linux-next for a while now, and has not been receiving any comments
> for the last two updated patchsets I've sent to scsi-misc/lkml. However
> it has not yet been submitted into Jame's scsi-misc git tree, and James
> is the ultimate maintainer that should submit this work. I hope it will
> still be submitted into 2.6.29, as this code is totally self sufficient
> and does not endangers or changes any other Kernel subsystems.
> (All the needed ground work was already submitted to Linus since 2.6.26)
> So why should it not?
I don't like it mainly because it's not truly a useful general framework
for others to build on. However, as argued above, there might not
actually be such a useful framework, so as long as the only two
consumers (you and Lustre) want an interface like this, I'll put it in.
James
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 7/9] exofs: mkexofs
2008-12-31 15:19 ` [PATCH 7/9] exofs: mkexofs Boaz Harrosh
2008-12-31 15:57 ` James Bottomley
@ 2008-12-31 19:25 ` Andrew Morton
2009-01-01 13:33 ` Boaz Harrosh
1 sibling, 1 reply; 32+ messages in thread
From: Andrew Morton @ 2008-12-31 19:25 UTC (permalink / raw)
To: Boaz Harrosh
Cc: James Bottomley, avishay, jeff, viro, linux-fsdevel, osd-dev,
linux-kernel, linux-scsi
On Wed, 31 Dec 2008 17:19:44 +0200 Boaz Harrosh <bharrosh@panasas.com> wrote:
> Andrew Morton wrote:
> > On Tue, 16 Dec 2008 17:33:48 +0200
> > Boaz Harrosh <bharrosh@panasas.com> wrote:
> >
> >> We need a mechanism to prepare the file system (mkfs).
> >> I chose to implement that by means of a couple of
> >> mount-options. Because there is no user-mode API for committing
> >> OSD commands. And also, all this stuff is highly internal to
> >> the file system itself.
> >>
> >> - Added two mount options mkfs=0/1,format=capacity_in_meg, so mkfs/format
> >> can be executed by kernel code just before mount. An mkexofs utility
> >> can now be implemented by means of a script that mounts and unmount the
> >> file system with proper options.
> >
> > Doing mkfs in-kernel is unusual. I don't think the above description
> > sufficiently helps the uninitiated understand why mkfs cannot be done
> > in userspace as usual. Please flesh it out a bit.
>
> There are a few main reasons.
> - There is no user-mode API for initiating OSD commands. Such a subsystem
> would be hundredfold bigger then the mkfs code submitted. I think it would be
> hard and stupid to maintain a complex user-mode API just for creating
> a couple of objects and writing a couple of on disk structures.
> - I intend to refactor the code further to make use of more super.c services,
> so to make this addition even smaller. Also future direction of raid over
> multiple objects will make even more kernel infrastructure needed which
> will need even more user-mode code duplication.
> - I anticipate problems that are not yet addressed in this body of work
> but will be in the future, mainly that a single OSD-target (lun) can
> be shared by lots of FSs, and a single FS can span many OSD-targets.
> Some central management is much easier to do in Kernel.
OK. Please add the above info to the changelog for that patch.
> >
> > What are the dependencies for this filesystem code? I assume that it
> > depends on various block- and scsi-level patches? Which ones, and
> > what is their status, and is this code even compileable without them?
> >
>
> This OSD-based file system is dependent on the open-osd initiator library
> code that I've submitted for inclusion for 2.6.29. It has been sitting
> in linux-next for a while now, and has not been receiving any comments
> for the last two updated patchsets I've sent to scsi-misc/lkml. However
> it has not yet been submitted into Jame's scsi-misc git tree, and James
> is the ultimate maintainer that should submit this work. I hope it will
> still be submitted into 2.6.29, as this code is totally self sufficient
> and does not endangers or changes any other Kernel subsystems.
> (All the needed ground work was already submitted to Linus since 2.6.26)
> So why should it not?
>
> Once the open-osd initiator library is accepted this file system
> could be accepted. I was hoping as a 2.6.30 time frame. (One Kernel
> after the open-osd library)
>
> > Thanks.
>
> Thank you dear Andrew for your most valuable input.
>
> I will constify all the const needed code. will fix the global name space
> litter, will inline the macros and lower case the inlines. Will remove
> the typedefs.
>
> I will reply to individual patches, I have a couple of questions. But
> all your comments are right and I will take care of them.
>
> When, if, all is fixed, through which tree/maintainer can exofs be submitted?
I can merge them. Or you can run a git tree of your own, add it to
linux-next and ask Linus to pull it at the appropriate time.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [osd-dev] [PATCH 7/9] exofs: mkexofs
2008-12-31 15:57 ` James Bottomley
@ 2009-01-01 9:22 ` Benny Halevy
2009-01-01 9:54 ` Jeff Garzik
2009-01-01 23:26 ` J. Bruce Fields
2009-01-04 15:20 ` Boaz Harrosh
2009-01-06 8:40 ` Andreas Dilger
2 siblings, 2 replies; 32+ messages in thread
From: Benny Halevy @ 2009-01-01 9:22 UTC (permalink / raw)
To: James Bottomley
Cc: open-osd development, Boaz Harrosh, linux-scsi, jeff,
linux-kernel, avishay, viro, linux-fsdevel, Andrew Morton
On Dec. 31, 2008, 17:57 +0200, James Bottomley <James.Bottomley@HansenPartnership.com> wrote:
> On Wed, 2008-12-31 at 17:19 +0200, Boaz Harrosh wrote:
>> Andrew Morton wrote:
>>> On Tue, 16 Dec 2008 17:33:48 +0200
>>> Boaz Harrosh <bharrosh@panasas.com> wrote:
>>>
>>>> We need a mechanism to prepare the file system (mkfs).
>>>> I chose to implement that by means of a couple of
>>>> mount-options. Because there is no user-mode API for committing
>>>> OSD commands. And also, all this stuff is highly internal to
>>>> the file system itself.
>>>>
>>>> - Added two mount options mkfs=0/1,format=capacity_in_meg, so mkfs/format
>>>> can be executed by kernel code just before mount. An mkexofs utility
>>>> can now be implemented by means of a script that mounts and unmount the
>>>> file system with proper options.
>>> Doing mkfs in-kernel is unusual. I don't think the above description
>>> sufficiently helps the uninitiated understand why mkfs cannot be done
>>> in userspace as usual. Please flesh it out a bit.
>> There are a few main reasons.
>> - There is no user-mode API for initiating OSD commands. Such a subsystem
>> would be hundredfold bigger then the mkfs code submitted. I think it would be
>> hard and stupid to maintain a complex user-mode API just for creating
>> a couple of objects and writing a couple of on disk structures.
>
> This is really a reflection of the whole problem with the OSD paradigm.
>
> In theory, a filesystem on OSD is a thin layer of metadata mapping
> objects to files. Get this right and the storage will manage things,
> like security and access and attributes (there's even a natural mapping
> to the VFS concept of extended attributes). Plus, the storage has
> enough information to manage persistence, backups and replication.
>
> The real problem is that no-one has actually managed to come up with a
> useful VFS<->OSD mapping layer (even by extending or altering the VFS).
> Every filesystem that currently uses OSD has a separate direct OSD
> speaking interface (i.e. it slices out the block layer to do this and
> talks directly to the storage).
>
> I suppose this could be taken to show that such a layer is impossibly
> complex, as you assert, but its lack is reflected in strange looking
> design decisions like in-kernel mkfs. It would also mean that there
> would be very little layered code sharing between ODS based filesystems.
I think that we may need to gain some more experience to extract the
commonalities of such file systems. Currently we came up with the
lowest possible denominator the osd initiator library that deals
with command formatting and execution, including attrs, sense status,
and security.
To provide a higher level abstraction that would help with "administrative"
tasks like mkfs and the like we already tossed an idea in the past -
a file system that will represent the contents of an OSD in a namespace,
for example: partition_id / object_id / {data, attrs / ..., ctl / ...}.
Such a file system could provide a generic mapping which one could
use to easily develop management applications for the OSD. That said,
it's out of the scope of exofs which focuses mostly on the filesystem
data and metadata paths.
>
>> - I intend to refactor the code further to make use of more super.c services,
>> so to make this addition even smaller. Also future direction of raid over
>> multiple objects will make even more kernel infrastructure needed which
>> will need even more user-mode code duplication.
>> - I anticipate problems that are not yet addressed in this body of work
>> but will be in the future, mainly that a single OSD-target (lun) can
>> be shared by lots of FSs, and a single FS can span many OSD-targets.
>> Some central management is much easier to do in Kernel.
>>
>>> What are the dependencies for this filesystem code? I assume that it
>>> depends on various block- and scsi-level patches? Which ones, and
>>> what is their status, and is this code even compileable without them?
>>>
>> This OSD-based file system is dependent on the open-osd initiator library
>> code that I've submitted for inclusion for 2.6.29. It has been sitting
>> in linux-next for a while now, and has not been receiving any comments
>> for the last two updated patchsets I've sent to scsi-misc/lkml. However
>> it has not yet been submitted into Jame's scsi-misc git tree, and James
>> is the ultimate maintainer that should submit this work. I hope it will
>> still be submitted into 2.6.29, as this code is totally self sufficient
>> and does not endangers or changes any other Kernel subsystems.
>> (All the needed ground work was already submitted to Linus since 2.6.26)
>> So why should it not?
>
> I don't like it mainly because it's not truly a useful general framework
> for others to build on. However, as argued above, there might not
> actually be such a useful framework, so as long as the only two
> consumers (you and Lustre) want an interface like this, I'll put it in.
Not to mention pnfs over objects which is coming up around the corner.
The pnfs-obj layout driver will use the osd initiator library as well
for distributed data I/O access (while the metadata server, to be based
on exofs accesses the OSD for metadata and security ops too)
Benny
>
> James
>
>
> _______________________________________________
> osd-dev mailing list
> osd-dev@open-osd.org
> http://mailman.open-osd.org/mailman/listinfo/osd-dev
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [osd-dev] [PATCH 7/9] exofs: mkexofs
2009-01-01 9:22 ` [osd-dev] " Benny Halevy
@ 2009-01-01 9:54 ` Jeff Garzik
2009-01-01 14:23 ` Benny Halevy
2009-01-01 23:26 ` J. Bruce Fields
1 sibling, 1 reply; 32+ messages in thread
From: Jeff Garzik @ 2009-01-01 9:54 UTC (permalink / raw)
To: Benny Halevy
Cc: James Bottomley, open-osd development, Boaz Harrosh, linux-scsi,
linux-kernel, avishay, viro, linux-fsdevel, Andrew Morton
Benny Halevy wrote:
> On Dec. 31, 2008, 17:57 +0200, James Bottomley <James.Bottomley@HansenPartnership.com> wrote:
>> On Wed, 2008-12-31 at 17:19 +0200, Boaz Harrosh wrote:
>>> Andrew Morton wrote:
>>>> On Tue, 16 Dec 2008 17:33:48 +0200
>>>> Boaz Harrosh <bharrosh@panasas.com> wrote:
>>>>
>>>>> We need a mechanism to prepare the file system (mkfs).
>>>>> I chose to implement that by means of a couple of
>>>>> mount-options. Because there is no user-mode API for committing
>>>>> OSD commands. And also, all this stuff is highly internal to
>>>>> the file system itself.
>>>>>
>>>>> - Added two mount options mkfs=0/1,format=capacity_in_meg, so mkfs/format
>>>>> can be executed by kernel code just before mount. An mkexofs utility
>>>>> can now be implemented by means of a script that mounts and unmount the
>>>>> file system with proper options.
>>>> Doing mkfs in-kernel is unusual. I don't think the above description
>>>> sufficiently helps the uninitiated understand why mkfs cannot be done
>>>> in userspace as usual. Please flesh it out a bit.
>>> There are a few main reasons.
>>> - There is no user-mode API for initiating OSD commands. Such a subsystem
>>> would be hundredfold bigger then the mkfs code submitted. I think it would be
>>> hard and stupid to maintain a complex user-mode API just for creating
>>> a couple of objects and writing a couple of on disk structures.
>> This is really a reflection of the whole problem with the OSD paradigm.
>>
>> In theory, a filesystem on OSD is a thin layer of metadata mapping
>> objects to files. Get this right and the storage will manage things,
>> like security and access and attributes (there's even a natural mapping
>> to the VFS concept of extended attributes). Plus, the storage has
>> enough information to manage persistence, backups and replication.
>>
>> The real problem is that no-one has actually managed to come up with a
>> useful VFS<->OSD mapping layer (even by extending or altering the VFS).
>> Every filesystem that currently uses OSD has a separate direct OSD
>> speaking interface (i.e. it slices out the block layer to do this and
>> talks directly to the storage).
>>
>> I suppose this could be taken to show that such a layer is impossibly
>> complex, as you assert, but its lack is reflected in strange looking
>> design decisions like in-kernel mkfs. It would also mean that there
>> would be very little layered code sharing between ODS based filesystems.
>
> I think that we may need to gain some more experience to extract the
> commonalities of such file systems. Currently we came up with the
> lowest possible denominator the osd initiator library that deals
> with command formatting and execution, including attrs, sense status,
> and security.
Not putting words in James' mouth, but I definitely agree that the
in-kernel mkfs raises a red flag or two. mkfs.ext3 for block-based
filesystems has direct and intimate knowledge of ext3 filesystem
structure, and it writes that information from userland directly to the
block(s) necessary.
Similarly, mkfs for an object-based filesystem should be issuing SCSI
commands to the OSD device from userland, AFAICS.
> To provide a higher level abstraction that would help with "administrative"
> tasks like mkfs and the like we already tossed an idea in the past -
> a file system that will represent the contents of an OSD in a namespace,
> for example: partition_id / object_id / {data, attrs / ..., ctl / ...}.
> Such a file system could provide a generic mapping which one could
> use to easily develop management applications for the OSD. That said,
> it's out of the scope of exofs which focuses mostly on the filesystem
> data and metadata paths.
That's far too complex for what is necessary. Just issue SCSI commands
from userland. We don't need an abstract interface specifically for
low-level details. The VFS is that abstract interface; anything else
should be low-level and purpose-built.
Jeff
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 7/9] exofs: mkexofs
2008-12-31 19:25 ` Andrew Morton
@ 2009-01-01 13:33 ` Boaz Harrosh
2009-01-02 22:46 ` James Bottomley
0 siblings, 1 reply; 32+ messages in thread
From: Boaz Harrosh @ 2009-01-01 13:33 UTC (permalink / raw)
To: Andrew Morton, James Bottomley
Cc: avishay, jeff, viro, linux-fsdevel, osd-dev, linux-kernel,
linux-scsi, Linus Torvalds
Andrew Morton wrote:
>>> Boaz Harrosh <bharrosh@panasas.com> wrote:
>> When, if, all is fixed, through which tree/maintainer can exofs be submitted?
>
> I can merge them. Or you can run a git tree of your own, add it to
> linux-next and ask Linus to pull it at the appropriate time.
>
Hi James
Andrew suggested that maybe I should push exofs file system directly to
Linus as it is pretty orthogonal to any other work. Sitting in linux-next
will quickly expose any advancements in VFS and will force me to keep
the tree uptodate.
If that is so, and is accepted by Linus, would you rather that also the
open-osd initiator library will be submitted through the same tree?
The conflicts with scsi are very very narrow. The only real dependency
is the ULD being a SCSI ULD. I will routinely ask your ACK on any scsi
or ULD related patches. Which are very few. This way it will be easier
to manage the dependencies between the OSD work, the OSD pNFS-Objects
trees at pNFS project, and the pNFSD+EXOFS export. One less dependency.
[I already have such a public tree at git.open-osd.org for a while now]
Thanks
Boaz
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [osd-dev] [PATCH 7/9] exofs: mkexofs
2009-01-01 9:54 ` Jeff Garzik
@ 2009-01-01 14:23 ` Benny Halevy
2009-01-01 14:28 ` Matthew Wilcox
2009-01-01 18:12 ` Jörn Engel
0 siblings, 2 replies; 32+ messages in thread
From: Benny Halevy @ 2009-01-01 14:23 UTC (permalink / raw)
To: Jeff Garzik
Cc: James Bottomley, open-osd development, Boaz Harrosh, linux-scsi,
linux-kernel, avishay, viro, linux-fsdevel, Andrew Morton
On Jan. 01, 2009, 11:54 +0200, Jeff Garzik <jeff@garzik.org> wrote:
> Benny Halevy wrote:
>> On Dec. 31, 2008, 17:57 +0200, James Bottomley <James.Bottomley@HansenPartnership.com> wrote:
>>> On Wed, 2008-12-31 at 17:19 +0200, Boaz Harrosh wrote:
>>>> Andrew Morton wrote:
>>>>> On Tue, 16 Dec 2008 17:33:48 +0200
>>>>> Boaz Harrosh <bharrosh@panasas.com> wrote:
>>>>>
>>>>>> We need a mechanism to prepare the file system (mkfs).
>>>>>> I chose to implement that by means of a couple of
>>>>>> mount-options. Because there is no user-mode API for committing
>>>>>> OSD commands. And also, all this stuff is highly internal to
>>>>>> the file system itself.
>>>>>>
>>>>>> - Added two mount options mkfs=0/1,format=capacity_in_meg, so mkfs/format
>>>>>> can be executed by kernel code just before mount. An mkexofs utility
>>>>>> can now be implemented by means of a script that mounts and unmount the
>>>>>> file system with proper options.
>>>>> Doing mkfs in-kernel is unusual. I don't think the above description
>>>>> sufficiently helps the uninitiated understand why mkfs cannot be done
>>>>> in userspace as usual. Please flesh it out a bit.
>>>> There are a few main reasons.
>>>> - There is no user-mode API for initiating OSD commands. Such a subsystem
>>>> would be hundredfold bigger then the mkfs code submitted. I think it would be
>>>> hard and stupid to maintain a complex user-mode API just for creating
>>>> a couple of objects and writing a couple of on disk structures.
>>> This is really a reflection of the whole problem with the OSD paradigm.
>>>
>>> In theory, a filesystem on OSD is a thin layer of metadata mapping
>>> objects to files. Get this right and the storage will manage things,
>>> like security and access and attributes (there's even a natural mapping
>>> to the VFS concept of extended attributes). Plus, the storage has
>>> enough information to manage persistence, backups and replication.
>>>
>>> The real problem is that no-one has actually managed to come up with a
>>> useful VFS<->OSD mapping layer (even by extending or altering the VFS).
>>> Every filesystem that currently uses OSD has a separate direct OSD
>>> speaking interface (i.e. it slices out the block layer to do this and
>>> talks directly to the storage).
>>>
>>> I suppose this could be taken to show that such a layer is impossibly
>>> complex, as you assert, but its lack is reflected in strange looking
>>> design decisions like in-kernel mkfs. It would also mean that there
>>> would be very little layered code sharing between ODS based filesystems.
>> I think that we may need to gain some more experience to extract the
>> commonalities of such file systems. Currently we came up with the
>> lowest possible denominator the osd initiator library that deals
>> with command formatting and execution, including attrs, sense status,
>> and security.
>
> Not putting words in James' mouth, but I definitely agree that the
> in-kernel mkfs raises a red flag or two. mkfs.ext3 for block-based
> filesystems has direct and intimate knowledge of ext3 filesystem
> structure, and it writes that information from userland directly to the
> block(s) necessary.
Personally, I'm not sure if maintaining that intimate knowledge in a
user space program is an ideal model with respect to keeping both
in sync, avoiding code duplication, and dealing with upgrade issues
(e.g. upgrading the kernel and not the user space utils)
The main advantage I can see in doing that is keeping the kernel
code small without bloating it with rarely-used logic. However,
the mkfs logic for exofs has such a small footprint that it
doesn't add much to the module footprint so justifying the user space
util using that parameter is questionable IMO.
>
> Similarly, mkfs for an object-based filesystem should be issuing SCSI
> commands to the OSD device from userland, AFAICS.
That's possible...
Benny
>
>
>> To provide a higher level abstraction that would help with "administrative"
>> tasks like mkfs and the like we already tossed an idea in the past -
>> a file system that will represent the contents of an OSD in a namespace,
>> for example: partition_id / object_id / {data, attrs / ..., ctl / ...}.
>> Such a file system could provide a generic mapping which one could
>> use to easily develop management applications for the OSD. That said,
>> it's out of the scope of exofs which focuses mostly on the filesystem
>> data and metadata paths.
>
> That's far too complex for what is necessary. Just issue SCSI commands
> from userland. We don't need an abstract interface specifically for
> low-level details. The VFS is that abstract interface; anything else
> should be low-level and purpose-built.
>
> Jeff
>
>
>
>
>
>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [osd-dev] [PATCH 7/9] exofs: mkexofs
2009-01-01 14:23 ` Benny Halevy
@ 2009-01-01 14:28 ` Matthew Wilcox
2009-01-01 18:12 ` Jörn Engel
1 sibling, 0 replies; 32+ messages in thread
From: Matthew Wilcox @ 2009-01-01 14:28 UTC (permalink / raw)
To: Benny Halevy
Cc: Jeff Garzik, James Bottomley, open-osd development, Boaz Harrosh,
linux-scsi, linux-kernel, avishay, viro, linux-fsdevel,
Andrew Morton
On Thu, Jan 01, 2009 at 04:23:00PM +0200, Benny Halevy wrote:
> Personally, I'm not sure if maintaining that intimate knowledge in a
> user space program is an ideal model with respect to keeping both
> in sync, avoiding code duplication, and dealing with upgrade issues
> (e.g. upgrading the kernel and not the user space utils)
The other 30-40 filesystems that Linux supports manage to do it this
way. I'm not sure why osdfs is different in this regard.
You need to be careful with the filesystem layout anyway -- when you
upgrade the kernel, it still needs to be able to access all the files
contained in existing filesystems. And it needs to create new files
which are still readable by older kernels (users have this pesky habit
of downgrading).
--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [osd-dev] [PATCH 7/9] exofs: mkexofs
2009-01-01 14:23 ` Benny Halevy
2009-01-01 14:28 ` Matthew Wilcox
@ 2009-01-01 18:12 ` Jörn Engel
1 sibling, 0 replies; 32+ messages in thread
From: Jörn Engel @ 2009-01-01 18:12 UTC (permalink / raw)
To: Benny Halevy
Cc: Jeff Garzik, James Bottomley, open-osd development, Boaz Harrosh,
linux-scsi, linux-kernel, avishay, viro, linux-fsdevel,
Andrew Morton
On Thu, 1 January 2009 16:23:00 +0200, Benny Halevy wrote:
>
> Personally, I'm not sure if maintaining that intimate knowledge in a
> user space program is an ideal model with respect to keeping both
> in sync, avoiding code duplication, and dealing with upgrade issues
> (e.g. upgrading the kernel and not the user space utils)
None of those problems actually matter, because you will have them
anyway. If your filesystem is any good, someone will reimplement it for
Windows, Grub, UBoot, Solaris or some other system. And even if it
isn't any good, you still need to stay compatible with your own
implementation from last year.
Ok, maybe code duplication is a valid concern. But that will hardly
outweigh the arguments in favor of a userland mkfs. The only exception
I am aware of is jffs2, where a newly erased flash happens to be a valid
(empty) filesystem. And even there you can view flash_eraseall as a
trivial mkfs program. ;)
Jörn
--
It's just what we asked for, but not what we want!
-- anonymous
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [osd-dev] [PATCH 7/9] exofs: mkexofs
2009-01-01 9:22 ` [osd-dev] " Benny Halevy
2009-01-01 9:54 ` Jeff Garzik
@ 2009-01-01 23:26 ` J. Bruce Fields
2009-01-02 7:14 ` Benny Halevy
1 sibling, 1 reply; 32+ messages in thread
From: J. Bruce Fields @ 2009-01-01 23:26 UTC (permalink / raw)
To: Benny Halevy
Cc: James Bottomley, open-osd development, Boaz Harrosh, linux-scsi,
jeff, linux-kernel, avishay, viro, linux-fsdevel, Andrew Morton
On Thu, Jan 01, 2009 at 11:22:45AM +0200, Benny Halevy wrote:
> On Dec. 31, 2008, 17:57 +0200, James Bottomley <James.Bottomley@HansenPartnership.com> wrote:
> > I don't like it mainly because it's not truly a useful general framework
> > for others to build on. However, as argued above, there might not
> > actually be such a useful framework, so as long as the only two
> > consumers (you and Lustre) want an interface like this, I'll put it in.
>
> Not to mention pnfs over objects which is coming up around the corner.
> The pnfs-obj layout driver will use the osd initiator library as well
> for distributed data I/O access (while the metadata server, to be based
> on exofs accesses the OSD for metadata and security ops too)
What state is that project in right now?
--b.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [osd-dev] [PATCH 7/9] exofs: mkexofs
2009-01-01 23:26 ` J. Bruce Fields
@ 2009-01-02 7:14 ` Benny Halevy
0 siblings, 0 replies; 32+ messages in thread
From: Benny Halevy @ 2009-01-02 7:14 UTC (permalink / raw)
To: J. Bruce Fields
Cc: James Bottomley, open-osd development, Boaz Harrosh, linux-scsi,
jeff, linux-kernel, avishay, viro, linux-fsdevel, Andrew Morton
On Jan. 02, 2009, 1:26 +0200, "J. Bruce Fields" <bfields@fieldses.org> wrote:
> On Thu, Jan 01, 2009 at 11:22:45AM +0200, Benny Halevy wrote:
>> On Dec. 31, 2008, 17:57 +0200, James Bottomley <James.Bottomley@HansenPartnership.com> wrote:
>>> I don't like it mainly because it's not truly a useful general framework
>>> for others to build on. However, as argued above, there might not
>>> actually be such a useful framework, so as long as the only two
>>> consumers (you and Lustre) want an interface like this, I'll put it in.
>> Not to mention pnfs over objects which is coming up around the corner.
>> The pnfs-obj layout driver will use the osd initiator library as well
>> for distributed data I/O access (while the metadata server, to be based
>> on exofs accesses the OSD for metadata and security ops too)
>
> What state is that project in right now?
I hope to release the pnfs-obj layout driver in a few weeks,
after finishing with cleaning up the nfs41 and pnfs patch sets.
Still, there's more work to be done on the back end side, exporting
exofs over (p)NFS, and then we'd be able to provide full pnfs
over objects functionality.
Benny
>
> --b.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 7/9] exofs: mkexofs
2009-01-01 13:33 ` Boaz Harrosh
@ 2009-01-02 22:46 ` James Bottomley
2009-01-04 8:59 ` Boaz Harrosh
0 siblings, 1 reply; 32+ messages in thread
From: James Bottomley @ 2009-01-02 22:46 UTC (permalink / raw)
To: Boaz Harrosh
Cc: Andrew Morton, avishay, jeff, viro, linux-fsdevel, osd-dev,
linux-kernel, linux-scsi, Linus Torvalds
On Thu, 2009-01-01 at 15:33 +0200, Boaz Harrosh wrote:
> Andrew Morton wrote:
> >>> Boaz Harrosh <bharrosh@panasas.com> wrote:
> >> When, if, all is fixed, through which tree/maintainer can exofs be submitted?
> >
> > I can merge them. Or you can run a git tree of your own, add it to
> > linux-next and ask Linus to pull it at the appropriate time.
> >
>
> Hi James
>
> Andrew suggested that maybe I should push exofs file system directly to
> Linus as it is pretty orthogonal to any other work. Sitting in linux-next
> will quickly expose any advancements in VFS and will force me to keep
> the tree uptodate.
>
> If that is so, and is accepted by Linus, would you rather that also the
> open-osd initiator library will be submitted through the same tree?
> The conflicts with scsi are very very narrow. The only real dependency
> is the ULD being a SCSI ULD. I will routinely ask your ACK on any scsi
> or ULD related patches. Which are very few. This way it will be easier
> to manage the dependencies between the OSD work, the OSD pNFS-Objects
> trees at pNFS project, and the pNFSD+EXOFS export. One less dependency.
>
> [I already have such a public tree at git.open-osd.org for a while now]
Since it's sitting in SCSI, at least the libosd piece belongs over the
SCSI mailing list, so I think it makes sense to continue updating it via
the SCSI tree.
What's the status of the major number request from LANANA. That's patch
number one, and I haven't heard that they've confirmed the selection of
260 yet; or is LANANA now dead and it's who gets the major into the tree
first?
James
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 7/9] exofs: mkexofs
2009-01-02 22:46 ` James Bottomley
@ 2009-01-04 8:59 ` Boaz Harrosh
0 siblings, 0 replies; 32+ messages in thread
From: Boaz Harrosh @ 2009-01-04 8:59 UTC (permalink / raw)
To: James Bottomley, Andrew Morton
Cc: avishay, jeff, viro, linux-fsdevel, osd-dev, linux-kernel,
linux-scsi, Linus Torvalds
James Bottomley wrote:
> On Thu, 2009-01-01 at 15:33 +0200, Boaz Harrosh wrote:
>> Andrew Morton wrote:
>>>>> Boaz Harrosh <bharrosh@panasas.com> wrote:
>>>> When, if, all is fixed, through which tree/maintainer can exofs be submitted?
>>> I can merge them. Or you can run a git tree of your own, add it to
>>> linux-next and ask Linus to pull it at the appropriate time.
>>>
>> Hi James
>>
>> Andrew suggested that maybe I should push exofs file system directly to
>> Linus as it is pretty orthogonal to any other work. Sitting in linux-next
>> will quickly expose any advancements in VFS and will force me to keep
>> the tree uptodate.
>>
>> If that is so, and is accepted by Linus, would you rather that also the
>> open-osd initiator library will be submitted through the same tree?
>> The conflicts with scsi are very very narrow. The only real dependency
>> is the ULD being a SCSI ULD. I will routinely ask your ACK on any scsi
>> or ULD related patches. Which are very few. This way it will be easier
>> to manage the dependencies between the OSD work, the OSD pNFS-Objects
>> trees at pNFS project, and the pNFSD+EXOFS export. One less dependency.
>>
>> [I already have such a public tree at git.open-osd.org for a while now]
>
> Since it's sitting in SCSI, at least the libosd piece belongs over the
> SCSI mailing list, so I think it makes sense to continue updating it via
> the SCSI tree.
>
> What's the status of the major number request from LANANA. That's patch
> number one, and I haven't heard that they've confirmed the selection of
> 260 yet; or is LANANA now dead and it's who gets the major into the tree
> first?
>
> James
>
LANANA seems dead. I was unable to get any response from any e-mail.
Andrew?
Thanks James. I will personally prefer if these patches will carry
your sign-off on them, thous gaining your long acquired instincts.
That could be really grate.
I will send a new batch tomorrow morning, as Andrew had concerns with
some members names. Unless you prefer a git tree, drop me a note and
I'll send you a URL instead.
Thanks
Boaz
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 7/9] exofs: mkexofs
2008-12-31 15:57 ` James Bottomley
2009-01-01 9:22 ` [osd-dev] " Benny Halevy
@ 2009-01-04 15:20 ` Boaz Harrosh
2009-01-04 15:38 ` Christoph Hellwig
2009-01-12 18:12 ` James Bottomley
2009-01-06 8:40 ` Andreas Dilger
2 siblings, 2 replies; 32+ messages in thread
From: Boaz Harrosh @ 2009-01-04 15:20 UTC (permalink / raw)
To: James Bottomley, Matthew Wilcox, Benny Halevy, Jeff Garzik
Cc: Andrew Morton, Al Viro, Avishay Traeger, open-osd development,
linux-scsi, linux-kernel, linux-fsdevel
James Bottomley wrote:
> On Wed, 2008-12-31 at 17:19 +0200, Boaz Harrosh wrote:
>> Andrew Morton wrote:
>>> On Tue, 16 Dec 2008 17:33:48 +0200
>>> Boaz Harrosh <bharrosh@panasas.com> wrote:
>>>
>>>> We need a mechanism to prepare the file system (mkfs).
>>>> I chose to implement that by means of a couple of
>>>> mount-options. Because there is no user-mode API for committing
>>>> OSD commands. And also, all this stuff is highly internal to
>>>> the file system itself.
>>>>
>>>> - Added two mount options mkfs=0/1,format=capacity_in_meg, so mkfs/format
>>>> can be executed by kernel code just before mount. An mkexofs utility
>>>> can now be implemented by means of a script that mounts and unmount the
>>>> file system with proper options.
>>> Doing mkfs in-kernel is unusual. I don't think the above description
>>> sufficiently helps the uninitiated understand why mkfs cannot be done
>>> in userspace as usual. Please flesh it out a bit.
>> There are a few main reasons.
>> - There is no user-mode API for initiating OSD commands. Such a subsystem
>> would be hundredfold bigger then the mkfs code submitted. I think it would be
>> hard and stupid to maintain a complex user-mode API just for creating
>> a couple of objects and writing a couple of on disk structures.
>
> This is really a reflection of the whole problem with the OSD paradigm.
Certainly not a problem of the OSD paradigm, just maybe a problem
of the current code boundaries laid out by years of block-devices.
> In theory, a filesystem on OSD is a thin layer of metadata mapping
> objects to files. Get this right and the storage will manage things,
- objects to files. Get this right and the storage will manage things,
+ files to objects. Get this right and the storage will manage things,
[objects to files is what some of the osd-targets do.]
> like security and access and attributes (there's even a natural mapping
> to the VFS concept of extended attributes). Plus, the storage has
> enough information to manage persistence, backups and replication.
>
Sounds perfect to me.
> The real problem is that no-one has actually managed to come up with a
> useful VFS<->OSD mapping layer (even by extending or altering the VFS).
> Every filesystem that currently uses OSD has a separate direct OSD
> speaking interface (i.e. it slices out the block layer to do this and
> talks directly to the storage).
I'm not sure what you mean.
Lets take VFS<->BLOCKS mapping for example. Each FS has it's own
interpretation of what that means, brtfs is less perfect then xfs
or vice versa?
I guess you did not mean "mapping" but meant "Interface" or API.
(or more likely I misunderstand the meaning of "mapping" ;)
Well that is exactly what I was attempting to submit. A general-purpose
low-level but easy-to-use, objects API for kernel clients. be it a
dead-simple exofs, or a complex multi-head beast like a pNFS-Objects
file system. The same library/API/Interface will be used for NFS-Clients
NFSD-Servers, reconstruction, security what ever.
The block-layer is not sliced out, Only the elevator function is, since
BIO merging, if any, are not device global but per-object/file, and the
elevator does not currently support that. (Profiling shows that it will
be needed)
BTW. The block-based filesystems are just a big minority in Kernel. The
majority does not use block-layer either.
>
> I suppose this could be taken to show that such a layer is impossibly
> complex, as you assert, but its lack is reflected in strange looking
> design decisions like in-kernel mkfs. It would also mean that there
> would be very little layered code sharing between ODS based filesystems.
- would be very little layered code sharing between ODS based filesystems.
+ would be very little layered code sharing between OSD based filesystems.
I disagree.
All the OSD-Based file systems (In Linux) should absolutely only use the
open-osd library submitted. I myself will work on a couple. If anything is
missing that could not be added later, I would like to know about it.
User-mode Interface is another matter. There are some ideas and some already
implemented.
[Hosted on open-osd.org
see: http://git.open-osd.org/gitweb.cgi?p=osc-osd/.git;a=summary
look inside the osd-initiator directory]
And I have a toy interface that adds no new entries into the Kernel in
the form of an OSDVFS module, that will let you access the raw OSD device
through the VFS name-space.
The lack of any user-mode API is just the lack of any current need/priority,
or that I'm the only one working on OSD. But nothing that could not be solved
in two weeks of pragmatic work. Surly it's not a paradigm problem.
>
>> - I intend to refactor the code further to make use of more super.c services,
>> so to make this addition even smaller. Also future direction of raid over
>> multiple objects will make even more kernel infrastructure needed which
>> will need even more user-mode code duplication.
>> - I anticipate problems that are not yet addressed in this body of work
>> but will be in the future, mainly that a single OSD-target (lun) can
>> be shared by lots of FSs, and a single FS can span many OSD-targets.
>> Some central management is much easier to do in Kernel.
>>
>>> What are the dependencies for this filesystem code? I assume that it
>>> depends on various block- and scsi-level patches? Which ones, and
>>> what is their status, and is this code even compileable without them?
>>>
>> This OSD-based file system is dependent on the open-osd initiator library
>> code that I've submitted for inclusion for 2.6.29. It has been sitting
>> in linux-next for a while now, and has not been receiving any comments
>> for the last two updated patchsets I've sent to scsi-misc/lkml. However
>> it has not yet been submitted into Jame's scsi-misc git tree, and James
>> is the ultimate maintainer that should submit this work. I hope it will
>> still be submitted into 2.6.29, as this code is totally self sufficient
>> and does not endangers or changes any other Kernel subsystems.
>> (All the needed ground work was already submitted to Linus since 2.6.26)
>> So why should it not?
>
> I don't like it mainly because it's not truly a useful general framework
> for others to build on. However, as argued above, there might not
> actually be such a useful framework, so as long as the only two
> consumers (you and Lustre) want an interface like this, I'll put it in.
>
Time will tell, but I believe the exact opposite. I believe and strive
for this OSD body of work to be useful for anybody that needs to talk
T10-OSD in Linux, be it for any-purpose. Any thing missing should be
easily added.
> James
>
>
To summarize the way I see it:
- James is right in that we can not currently see the full OSD picture since
we do not have a user-mode API, so the usefulness of it all is unclear.
[I will send an RFD soon, and hope all interested will chime in on the
discussion]
- That said, all the submitted code is still relevant and useful,
though at few places it takes the route of pragmatic-easy vs
long-term-correctness. [Which can be fixed]
- exofs/OSD is not the first FS that depends on a none-block-dev/its-own
stack. The lower level (OSD) is represented to kernel as a char-dev +
Additional API, common to other FS/stack models. Though the lower OSD
level has the potential to be a generic layer that can be used by lots
of users and use cases, not only FS type.
Thank you James for your consideration
Boaz
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 7/9] exofs: mkexofs
2009-01-04 15:20 ` Boaz Harrosh
@ 2009-01-04 15:38 ` Christoph Hellwig
2009-01-12 18:12 ` James Bottomley
1 sibling, 0 replies; 32+ messages in thread
From: Christoph Hellwig @ 2009-01-04 15:38 UTC (permalink / raw)
To: Boaz Harrosh
Cc: James Bottomley, Matthew Wilcox, Benny Halevy, Jeff Garzik,
Andrew Morton, Al Viro, Avishay Traeger, open-osd development,
linux-scsi, linux-kernel, linux-fsdevel
On Sun, Jan 04, 2009 at 05:20:42PM +0200, Boaz Harrosh wrote:
>
> User-mode Interface is another matter. There are some ideas and some already
> implemented.
> [Hosted on open-osd.org
> see: http://git.open-osd.org/gitweb.cgi?p=osc-osd/.git;a=summary
> look inside the osd-initiator directory]
> And I have a toy interface that adds no new entries into the Kernel in
> the form of an OSDVFS module, that will let you access the raw OSD device
> through the VFS name-space.
>
> The lack of any user-mode API is just the lack of any current need/priority,
> or that I'm the only one working on OSD. But nothing that could not be solved
> in two weeks of pragmatic work. Surly it's not a paradigm problem.
For mkfs/repair direct use by databases, etc you want a userspace
library, too. The easiest way to get started would to simply take the
kernel libosd and make it work ontop of SG_IO.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 7/9] exofs: mkexofs
2008-12-31 15:57 ` James Bottomley
2009-01-01 9:22 ` [osd-dev] " Benny Halevy
2009-01-04 15:20 ` Boaz Harrosh
@ 2009-01-06 8:40 ` Andreas Dilger
2 siblings, 0 replies; 32+ messages in thread
From: Andreas Dilger @ 2009-01-06 8:40 UTC (permalink / raw)
To: James Bottomley
Cc: Boaz Harrosh, Andrew Morton, avishay, jeff, viro, linux-fsdevel,
osd-dev, linux-kernel, linux-scsi
On Dec 31, 2008 15:57 +0000, James Bottomley wrote:
> I don't like it mainly because it's not truly a useful general framework
> for others to build on. However, as argued above, there might not
> actually be such a useful framework, so as long as the only two
> consumers (you and Lustre) want an interface like this, I'll put it in.
To be clear - Lustre has nothing to do with T10-OSD interfaces.
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 7/9] exofs: mkexofs
2009-01-04 15:20 ` Boaz Harrosh
2009-01-04 15:38 ` Christoph Hellwig
@ 2009-01-12 18:12 ` James Bottomley
2009-01-12 19:23 ` Jeff Garzik
2009-01-12 22:48 ` Jamie Lokier
1 sibling, 2 replies; 32+ messages in thread
From: James Bottomley @ 2009-01-12 18:12 UTC (permalink / raw)
To: Boaz Harrosh
Cc: Matthew Wilcox, Benny Halevy, Jeff Garzik, Andrew Morton, Al Viro,
Avishay Traeger, open-osd development, linux-scsi, linux-kernel,
linux-fsdevel
On Sun, 2009-01-04 at 17:20 +0200, Boaz Harrosh wrote:
> James Bottomley wrote:
> > On Wed, 2008-12-31 at 17:19 +0200, Boaz Harrosh wrote:
> >> Andrew Morton wrote:
> >>> On Tue, 16 Dec 2008 17:33:48 +0200
> >>> Boaz Harrosh <bharrosh@panasas.com> wrote:
> >>>
> >>>> We need a mechanism to prepare the file system (mkfs).
> >>>> I chose to implement that by means of a couple of
> >>>> mount-options. Because there is no user-mode API for committing
> >>>> OSD commands. And also, all this stuff is highly internal to
> >>>> the file system itself.
> >>>>
> >>>> - Added two mount options mkfs=0/1,format=capacity_in_meg, so mkfs/format
> >>>> can be executed by kernel code just before mount. An mkexofs utility
> >>>> can now be implemented by means of a script that mounts and unmount the
> >>>> file system with proper options.
> >>> Doing mkfs in-kernel is unusual. I don't think the above description
> >>> sufficiently helps the uninitiated understand why mkfs cannot be done
> >>> in userspace as usual. Please flesh it out a bit.
> >> There are a few main reasons.
> >> - There is no user-mode API for initiating OSD commands. Such a subsystem
> >> would be hundredfold bigger then the mkfs code submitted. I think it would be
> >> hard and stupid to maintain a complex user-mode API just for creating
> >> a couple of objects and writing a couple of on disk structures.
> >
> > This is really a reflection of the whole problem with the OSD paradigm.
>
> Certainly not a problem of the OSD paradigm, just maybe a problem
> of the current code boundaries laid out by years of block-devices.
Not having a suggestion for redrawing the boundaries is a problem of the
paradigm. Right at the moment using OSD is an all or nothing, there's
no migration path for block based filesystems, or even a good idea how
they'd take advantage of OSD. Most OSD based filesystems are for
special purpose things (mainly cluster FS).
> > In theory, a filesystem on OSD is a thin layer of metadata mapping
> > objects to files. Get this right and the storage will manage things,
> - objects to files. Get this right and the storage will manage things,
> + files to objects. Get this right and the storage will manage things,
> [objects to files is what some of the osd-targets do.]
> > like security and access and attributes (there's even a natural mapping
> > to the VFS concept of extended attributes). Plus, the storage has
> > enough information to manage persistence, backups and replication.
> >
>
> Sounds perfect to me.
>
> > The real problem is that no-one has actually managed to come up with a
> > useful VFS<->OSD mapping layer (even by extending or altering the VFS).
> > Every filesystem that currently uses OSD has a separate direct OSD
> > speaking interface (i.e. it slices out the block layer to do this and
> > talks directly to the storage).
>
> I'm not sure what you mean.
> Lets take VFS<->BLOCKS mapping for example. Each FS has it's own
> interpretation of what that means, brtfs is less perfect then xfs
> or vice versa?
> I guess you did not mean "mapping" but meant "Interface" or API.
> (or more likely I misunderstand the meaning of "mapping" ;)
No ... by mapping I mean mapping of VFS functions.
For example, an OSD filesystem should be user mountable: if the user has
the security key (could possibly do this in userspace). Additionally,
an OSD with attributes should be pluggable into the VFS layer
sufficiently to allow attribute search, even if the VFS has no idea of
the metadata layout, we can still get objects back. We'd also better be
able to do backup and restore of object based devices.
The basic problem for OSD, at least as I see it is that unless it can
provide some compelling relevance to current filesystem problems (like
attribute search is 10x faster over OSD vs block or X filesystem gets a
2x performance improvement using OSD vs block ...) it's doomed forever
to be a niche player: nice idea but no relevance to the real world.
> Well that is exactly what I was attempting to submit. A general-purpose
> low-level but easy-to-use, objects API for kernel clients. be it a
> dead-simple exofs, or a complex multi-head beast like a pNFS-Objects
> file system. The same library/API/Interface will be used for NFS-Clients
> NFSD-Servers, reconstruction, security what ever.
OK ... perhaps I missed the description of how a general purpose
filesystem might use this then?
> The block-layer is not sliced out, Only the elevator function is, since
> BIO merging, if any, are not device global but per-object/file, and the
> elevator does not currently support that. (Profiling shows that it will
> be needed)
Um, your submission path is character. You pick up block again because
SCSI uses it for queues, but it's not really part of your paradigm.
> BTW. The block-based filesystems are just a big minority in Kernel. The
> majority does not use block-layer either.
>
> >
> > I suppose this could be taken to show that such a layer is impossibly
> > complex, as you assert, but its lack is reflected in strange looking
> > design decisions like in-kernel mkfs. It would also mean that there
> > would be very little layered code sharing between ODS based filesystems.
> - would be very little layered code sharing between ODS based filesystems.
> + would be very little layered code sharing between OSD based filesystems.
>
> I disagree.
> All the OSD-Based file systems (In Linux) should absolutely only use the
> open-osd library submitted. I myself will work on a couple. If anything is
> missing that could not be added later, I would like to know about it.
But that's precisely the problem: "OSD based filesystems" implying that
if you want to use OSD you write a new filesystem.
> User-mode Interface is another matter. There are some ideas and some already
> implemented.
> [Hosted on open-osd.org
> see: http://git.open-osd.org/gitweb.cgi?p=osc-osd/.git;a=summary
> look inside the osd-initiator directory]
> And I have a toy interface that adds no new entries into the Kernel in
> the form of an OSDVFS module, that will let you access the raw OSD device
> through the VFS name-space.
OK, so this is moving it more towards general usability.
> The lack of any user-mode API is just the lack of any current need/priority,
> or that I'm the only one working on OSD. But nothing that could not be solved
> in two weeks of pragmatic work. Surly it's not a paradigm problem.
It's an indicator of one. If you buy my premise that OSD cannot be
relevant without compelling user cases, then the lack of a user API can
be viewed as a symptom of this.
> >
> >> - I intend to refactor the code further to make use of more super.c services,
> >> so to make this addition even smaller. Also future direction of raid over
> >> multiple objects will make even more kernel infrastructure needed which
> >> will need even more user-mode code duplication.
> >> - I anticipate problems that are not yet addressed in this body of work
> >> but will be in the future, mainly that a single OSD-target (lun) can
> >> be shared by lots of FSs, and a single FS can span many OSD-targets.
> >> Some central management is much easier to do in Kernel.
> >>
> >>> What are the dependencies for this filesystem code? I assume that it
> >>> depends on various block- and scsi-level patches? Which ones, and
> >>> what is their status, and is this code even compileable without them?
> >>>
> >> This OSD-based file system is dependent on the open-osd initiator library
> >> code that I've submitted for inclusion for 2.6.29. It has been sitting
> >> in linux-next for a while now, and has not been receiving any comments
> >> for the last two updated patchsets I've sent to scsi-misc/lkml. However
> >> it has not yet been submitted into Jame's scsi-misc git tree, and James
> >> is the ultimate maintainer that should submit this work. I hope it will
> >> still be submitted into 2.6.29, as this code is totally self sufficient
> >> and does not endangers or changes any other Kernel subsystems.
> >> (All the needed ground work was already submitted to Linus since 2.6.26)
> >> So why should it not?
> >
> > I don't like it mainly because it's not truly a useful general framework
> > for others to build on. However, as argued above, there might not
> > actually be such a useful framework, so as long as the only two
> > consumers (you and Lustre) want an interface like this, I'll put it in.
> >
>
> Time will tell, but I believe the exact opposite. I believe and strive
> for this OSD body of work to be useful for anybody that needs to talk
> T10-OSD in Linux, be it for any-purpose. Any thing missing should be
> easily added.
>
> > James
> >
> >
>
> To summarize the way I see it:
> - James is right in that we can not currently see the full OSD picture since
> we do not have a user-mode API, so the usefulness of it all is unclear.
> [I will send an RFD soon, and hope all interested will chime in on the
> discussion]
> - That said, all the submitted code is still relevant and useful,
> though at few places it takes the route of pragmatic-easy vs
> long-term-correctness. [Which can be fixed]
> - exofs/OSD is not the first FS that depends on a none-block-dev/its-own
> stack. The lower level (OSD) is represented to kernel as a char-dev +
> Additional API, common to other FS/stack models. Though the lower OSD
> level has the potential to be a generic layer that can be used by lots
> of users and use cases, not only FS type.
Right, so I'm reasonably happy to accept libosd for what it is: an
enabler for a few specialised applications.
I think your choice of using a character device will turn out to be a
design mistake because the migration path of existing filesystems is
bound to be a block device with extra features (which they may or may
not make use of) but only if there's a way to make ODS relevant to
users.
James
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 7/9] exofs: mkexofs
2009-01-12 18:12 ` James Bottomley
@ 2009-01-12 19:23 ` Jeff Garzik
2009-01-12 19:56 ` James Bottomley
2009-01-12 22:48 ` Jamie Lokier
1 sibling, 1 reply; 32+ messages in thread
From: Jeff Garzik @ 2009-01-12 19:23 UTC (permalink / raw)
To: James Bottomley
Cc: Boaz Harrosh, Matthew Wilcox, Benny Halevy, Andrew Morton,
Al Viro, Avishay Traeger, open-osd development, linux-scsi,
linux-kernel, linux-fsdevel
James Bottomley wrote:
> On Sun, 2009-01-04 at 17:20 +0200, Boaz Harrosh wrote:
>> James Bottomley wrote:
>>> On Wed, 2008-12-31 at 17:19 +0200, Boaz Harrosh wrote:
>>>> Andrew Morton wrote:
>>>>> On Tue, 16 Dec 2008 17:33:48 +0200
>>>>> Boaz Harrosh <bharrosh@panasas.com> wrote:
>>>>> Doing mkfs in-kernel is unusual. I don't think the above description
>>>>> sufficiently helps the uninitiated understand why mkfs cannot be done
>>>>> in userspace as usual. Please flesh it out a bit.
>>>> There are a few main reasons.
>>>> - There is no user-mode API for initiating OSD commands. Such a subsystem
>>>> would be hundredfold bigger then the mkfs code submitted. I think it would be
>>>> hard and stupid to maintain a complex user-mode API just for creating
>>>> a couple of objects and writing a couple of on disk structures.
>>> This is really a reflection of the whole problem with the OSD paradigm.
>> Certainly not a problem of the OSD paradigm, just maybe a problem
>> of the current code boundaries laid out by years of block-devices.
>
> Not having a suggestion for redrawing the boundaries is a problem of the
> paradigm. Right at the moment using OSD is an all or nothing, there's
> no migration path for block based filesystems, or even a good idea how
> they'd take advantage of OSD. Most OSD based filesystems are for
> special purpose things (mainly cluster FS).
I think you both are talking past each other a bit.
There is no inherent "problem with the paradigm" with regards to
creating a userspace mkfs and userspace filesystem access library.
Yes, it's annoying to maintain two parallel codebases, but from
experience we have found that that is what is best. A userspace library
is used by a wide variety of users: specialized filesystem tools,
filesystem repair tools, filesystem creation and optimization tools,
FUSE implementations, the list goes on.
It has nothing to do with "block-based code boundaries".
History and experience have shown that we want a minimal, purpose-built
filesystem in the kernel, with all the other filesystem tools external
to the kernel. That has proven the most robust over time, IMO (although
noises about in-kernel fsck are beginning to appear)
>>> In theory, a filesystem on OSD is a thin layer of metadata mapping
>>> objects to files. Get this right and the storage will manage things,
>> - objects to files. Get this right and the storage will manage things,
>> + files to objects. Get this right and the storage will manage things,
>> [objects to files is what some of the osd-targets do.]
>>> like security and access and attributes (there's even a natural mapping
>>> to the VFS concept of extended attributes). Plus, the storage has
>>> enough information to manage persistence, backups and replication.
I'm a bit lost in the quoting, but to respond...
One should not make assumptions that an in-kernel OSD filesystem will
simply turn all the "inode-ish" (object manipulation) duties wholesale
to the OSD storage device(s). That is an implementation detail.
To conjure an example, an OSD filesystem designer may wish to store
collections of VFS extended attributes as a single OSD object, for
performance or caching reasons.
Or, as discussed at the filesystem/storage summit I attended, a separate
layer handles replication and OSD device aggregation (read: RAID) just
like MD manages RAID[0156] now.
>>> The real problem is that no-one has actually managed to come up with a
>>> useful VFS<->OSD mapping layer (even by extending or altering the VFS).
>>> Every filesystem that currently uses OSD has a separate direct OSD
>>> speaking interface (i.e. it slices out the block layer to do this and
>>> talks directly to the storage).
>> I'm not sure what you mean.
>> Lets take VFS<->BLOCKS mapping for example. Each FS has it's own
>> interpretation of what that means, brtfs is less perfect then xfs
>> or vice versa?
>> I guess you did not mean "mapping" but meant "Interface" or API.
>> (or more likely I misunderstand the meaning of "mapping" ;)
>
> No ... by mapping I mean mapping of VFS functions.
>
> For example, an OSD filesystem should be user mountable: if the user has
I think that's setting the bar too high. It would be nice if an OSD
filesystem were user-mountable, but that obviously is less compatible
with existing tools, admin knowledge, and site policies.
> the security key (could possibly do this in userspace). Additionally,
> an OSD with attributes should be pluggable into the VFS layer
> sufficiently to allow attribute search, even if the VFS has no idea of
> the metadata layout, we can still get objects back. We'd also better be
> able to do backup and restore of object based devices.
Sure. tar/cpio/pax at the userspace level, or exofs-specific
dump+restore tools running in userspace. Just like with other
filesystems :)
> The basic problem for OSD, at least as I see it is that unless it can
> provide some compelling relevance to current filesystem problems (like
> attribute search is 10x faster over OSD vs block or X filesystem gets a
> 2x performance improvement using OSD vs block ...) it's doomed forever
> to be a niche player: nice idea but no relevance to the real world.
Let's get exofs into the kernel, and prove you wrong (or right).
I know you have wonderful anecdotes about how OSD has been around
forever and you consider it a failed paradigm; but new work is occuring,
and people are talking about how this might be the successor to
sector-based devices.
Let's not be closed-minded and close doors before they can be opened.
At this point, OSD is a fun and interesting research experiment that
might have promise for the future.
That's Linux's bread-n-butter: be on the cutting edge, experimenting
with new technologies. Some pan out, others don't.
But I don't see any compelling reason for an overall pushback _against_
OSD devices and filesystems.
>> Well that is exactly what I was attempting to submit. A general-purpose
>> low-level but easy-to-use, objects API for kernel clients. be it a
>> dead-simple exofs, or a complex multi-head beast like a pNFS-Objects
>> file system. The same library/API/Interface will be used for NFS-Clients
>> NFSD-Servers, reconstruction, security what ever.
>
> OK ... perhaps I missed the description of how a general purpose
> filesystem might use this then?
>
>> The block-layer is not sliced out, Only the elevator function is, since
>> BIO merging, if any, are not device global but per-object/file, and the
>> elevator does not currently support that. (Profiling shows that it will
>> be needed)
>
> Um, your submission path is character. You pick up block again because
> SCSI uses it for queues, but it's not really part of your paradigm.
>
>> BTW. The block-based filesystems are just a big minority in Kernel. The
>> majority does not use block-layer either.
>>
>>> I suppose this could be taken to show that such a layer is impossibly
>>> complex, as you assert, but its lack is reflected in strange looking
>>> design decisions like in-kernel mkfs. It would also mean that there
>>> would be very little layered code sharing between ODS based filesystems.
>> - would be very little layered code sharing between ODS based filesystems.
>> + would be very little layered code sharing between OSD based filesystems.
>>
>> I disagree.
>> All the OSD-Based file systems (In Linux) should absolutely only use the
>> open-osd library submitted. I myself will work on a couple. If anything is
>> missing that could not be added later, I would like to know about it.
>
> But that's precisely the problem: "OSD based filesystems" implying that
> if you want to use OSD you write a new filesystem.
Are you somehow assuming that existing block-based filesystems will take
advantage of OSD? I hope not; that would be silly.
_Of course_ using OSD implies a new filesystem. You are using a wholly
different method of interacting with storage.
Just like NFS implies a new filesystem, because networked RPC is wholly
different from sector-based storage as well.
> It's an indicator of one. If you buy my premise that OSD cannot be
> relevant without compelling user cases, then the lack of a user API can
> be viewed as a symptom of this.
If having a compelling user case was a prereq for kernel inclusion, well
over half the code would be gone.
> I think your choice of using a character device will turn out to be a
> design mistake because the migration path of existing filesystems is
> bound to be a block device with extra features (which they may or may
> not make use of) but only if there's a way to make ODS relevant to
> users.
It is fantasy to think we will be migrating ext4 to OSD. That fantasy
is not a compelling reason to block OSD development.
To sum,
* exofs needs a userspace library, around which the standard filesystem
tools will be built, most notably mkfs, dump, restore, fsck
* talk of migrating existing filesystems is wildly premature (and a bit
of a silly argument, since you are also arguing that OSD lacks
compelling use cases)
* an in-kernel OSD-based filesystem needs some sort of generic in-kernel
libosd API, so that multiple OSD filesystems do not reinvent the wheel
each time.
* OSD was bound to be annoying, because it forces the kernel filesystem
to either (a) talk SCSI or (b) use messages that can be converted to
SCSI OSD commands, like existing drivers convert the block layer's READ
and WRITE to device-specific commands.
* Trying to force OSD to export a block device is pushing a square peg
through a round hole. Thus, the best (and only) alternative is
character device. What you really want is a Third Way(tm): a mmap'able
message device, since you really want to export an API to userspace.
Jeff
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 7/9] exofs: mkexofs
2009-01-12 19:23 ` Jeff Garzik
@ 2009-01-12 19:56 ` James Bottomley
2009-01-12 20:22 ` Jeff Garzik
0 siblings, 1 reply; 32+ messages in thread
From: James Bottomley @ 2009-01-12 19:56 UTC (permalink / raw)
To: Jeff Garzik
Cc: Boaz Harrosh, Matthew Wilcox, Benny Halevy, Andrew Morton,
Al Viro, Avishay Traeger, open-osd development, linux-scsi,
linux-kernel, linux-fsdevel
On Mon, 2009-01-12 at 14:23 -0500, Jeff Garzik wrote:
> > It's an indicator of one. If you buy my premise that OSD cannot be
> > relevant without compelling user cases, then the lack of a user API can
> > be viewed as a symptom of this.
>
> If having a compelling user case was a prereq for kernel inclusion, well
> over half the code would be gone.
I'm not holding this against inclusion ... I'm saying it's a symptom of
the generic relevance to user issues problem that OSD has.
> > I think your choice of using a character device will turn out to be a
> > design mistake because the migration path of existing filesystems is
> > bound to be a block device with extra features (which they may or may
> > not make use of) but only if there's a way to make ODS relevant to
> > users.
>
> It is fantasy to think we will be migrating ext4 to OSD. That fantasy
> is not a compelling reason to block OSD development.
OK, so your quote managed to miss this bit:
"Right, so I'm reasonably happy to accept libosd for what it is: an
enabler for a few specialised applications. "
I can't see how that can be construed as "blocking OSD development".
The word "accept" is conventionally used in Linux parlance to mean "will
send upstream".
> To sum,
>
> * exofs needs a userspace library, around which the standard filesystem
> tools will be built, most notably mkfs, dump, restore, fsck
>
> * talk of migrating existing filesystems is wildly premature (and a bit
> of a silly argument, since you are also arguing that OSD lacks
> compelling use cases)
So criticising lacking compelling use cases while at the same time
suggesting how to find them is wrong?
Actually, If the only use case OSD can bring to the table is requiring
new filesystems, then there's nothing of general user relevance for it
on the horizon ... anywhere. There's never going to be a compelling
reason to move the consumer OSDs in the various development labs to
production because nothing would be able to use them on a mass scale.
If we could derive a benefit from OSD in existing filesystems, then they
do have user relevance, and Seagate and the others might just consider
releasing the devices.
Note that "providing benefit to" does not equate to "rewriting the
filesystem for" ... and it shouldn't; the benefit really should be
incremental. And that's the crux of my criticism. While OSD are
separate things that we have to rewrite whole filesystems for, they're
never going to set the world on fire. If they could be used with only
incremental effort, they might. The bridge for the incremental effort
will come from a properly designed kernel API.
> * an in-kernel OSD-based filesystem needs some sort of generic in-kernel
> libosd API, so that multiple OSD filesystems do not reinvent the wheel
> each time.
>
> * OSD was bound to be annoying, because it forces the kernel filesystem
> to either (a) talk SCSI or (b) use messages that can be converted to
> SCSI OSD commands, like existing drivers convert the block layer's READ
> and WRITE to device-specific commands.
OK, so what you're arguing is that unlike block devices where we can
produce a useful generic abstraction that is protocol agnostic, for OSD
we can't? As I've said before, I think this might be true, but fear it
dooms OSD to being too difficult to use.
> * Trying to force OSD to export a block device is pushing a square peg
> through a round hole. Thus, the best (and only) alternative is
> character device. What you really want is a Third Way(tm): a mmap'able
> message device, since you really want to export an API to userspace.
only allowing a character tap raises the effort bar on getting other
filesystems to use it, because they're all block based ... that's what I
think is the mistake.
James
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 7/9] exofs: mkexofs
2009-01-12 19:56 ` James Bottomley
@ 2009-01-12 20:22 ` Jeff Garzik
2009-01-12 23:25 ` James Bottomley
0 siblings, 1 reply; 32+ messages in thread
From: Jeff Garzik @ 2009-01-12 20:22 UTC (permalink / raw)
To: James Bottomley
Cc: Boaz Harrosh, Matthew Wilcox, Benny Halevy, Andrew Morton,
Al Viro, Avishay Traeger, open-osd development, linux-scsi,
linux-kernel, linux-fsdevel
James Bottomley wrote:
> On Mon, 2009-01-12 at 14:23 -0500, Jeff Garzik wrote:
>>> It's an indicator of one. If you buy my premise that OSD cannot be
>>> relevant without compelling user cases, then the lack of a user API can
>>> be viewed as a symptom of this.
>> If having a compelling user case was a prereq for kernel inclusion, well
>> over half the code would be gone.
>
> I'm not holding this against inclusion ... I'm saying it's a symptom of
> the generic relevance to user issues problem that OSD has.
>
>>> I think your choice of using a character device will turn out to be a
>>> design mistake because the migration path of existing filesystems is
>>> bound to be a block device with extra features (which they may or may
>>> not make use of) but only if there's a way to make ODS relevant to
>>> users.
>> It is fantasy to think we will be migrating ext4 to OSD. That fantasy
>> is not a compelling reason to block OSD development.
>
> OK, so your quote managed to miss this bit:
>
> "Right, so I'm reasonably happy to accept libosd for what it is: an
> enabler for a few specialised applications. "
>
> I can't see how that can be construed as "blocking OSD development".
> The word "accept" is conventionally used in Linux parlance to mean "will
> send upstream".
Yet you continue to expend energy complaining about migrating
block-based filesystems to OSD, a complex, overhead-laden undertaking
_no one_ has proposed or entertained.
>> To sum,
>>
>> * exofs needs a userspace library, around which the standard filesystem
>> tools will be built, most notably mkfs, dump, restore, fsck
>>
>> * talk of migrating existing filesystems is wildly premature (and a bit
>> of a silly argument, since you are also arguing that OSD lacks
>> compelling use cases)
>
> So criticising lacking compelling use cases while at the same time
> suggesting how to find them is wrong?
>
> Actually, If the only use case OSD can bring to the table is requiring
> new filesystems, then there's nothing of general user relevance for it
> on the horizon ... anywhere. There's never going to be a compelling
> reason to move the consumer OSDs in the various development labs to
> production because nothing would be able to use them on a mass scale.
> If we could derive a benefit from OSD in existing filesystems, then they
> do have user relevance, and Seagate and the others might just consider
> releasing the devices.
If Seagate were to release a production OSD device, do you really think
they would prefer a block-based filesystem hacked to work with OSDs? I
don't think so.
Existing block filesystems are very much purpose built for sector-based
storage as implemented on modern storage devices. No kernel API can
hand-wave that away.
The whole point of OSDs is to move some of the overhead to the storage
device, not _add_ to the overhead.
> Note that "providing benefit to" does not equate to "rewriting the
> filesystem for" ... and it shouldn't; the benefit really should be
> incremental. And that's the crux of my criticism. While OSD are
> separate things that we have to rewrite whole filesystems for, they're
> never going to set the world on fire. If they could be used with only
> incremental effort, they might. The bridge for the incremental effort
> will come from a properly designed kernel API.
Well, hey, if you wanna expend energy creating a kernel API that
presents a complex OSD as simple block-based storage, go for it. AFAICS
it's just extra overhead and complexity when a new filesystem could do
the job much better.
And I seriously doubt Linus or anyone else will want to hack up a
block-based filesystem in this manner. Better to create a silly "for
argument's sake" OSD block device, upon which any block-based filesystem
can be mounted. (Note I said block device, _not_ filesystem)
>> * an in-kernel OSD-based filesystem needs some sort of generic in-kernel
>> libosd API, so that multiple OSD filesystems do not reinvent the wheel
>> each time.
>>
>> * OSD was bound to be annoying, because it forces the kernel filesystem
>> to either (a) talk SCSI or (b) use messages that can be converted to
>> SCSI OSD commands, like existing drivers convert the block layer's READ
>> and WRITE to device-specific commands.
>
> OK, so what you're arguing is that unlike block devices where we can
> produce a useful generic abstraction that is protocol agnostic, for OSD
> we can't? As I've said before, I think this might be true, but fear it
> dooms OSD to being too difficult to use.
No, a generic abstraction is "(b)" in my quoted paragraph.
But it's certainly easy to create an OSD block device client, that
simulates sector-based storage, if you are motivated in that direction.
But that only makes sense if you want the extra overhead (square peg,
round hole), which no sane person will want. Face it, only screwballs
want to mount ext4 on an OSD.
>> * Trying to force OSD to export a block device is pushing a square peg
>> through a round hole. Thus, the best (and only) alternative is
>> character device. What you really want is a Third Way(tm): a mmap'able
>> message device, since you really want to export an API to userspace.
>
> only allowing a character tap raises the effort bar on getting other
> filesystems to use it, because they're all block based ...
That's irrelevant, since no one is calling for block-based filesystems
to be converted to use OSD.
And I can only imagine the push-back, should someone actually propose
doing so. Filesystems are very much purpose-built for their storage
paradigm.
Jeff
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 7/9] exofs: mkexofs
2009-01-12 18:12 ` James Bottomley
2009-01-12 19:23 ` Jeff Garzik
@ 2009-01-12 22:48 ` Jamie Lokier
1 sibling, 0 replies; 32+ messages in thread
From: Jamie Lokier @ 2009-01-12 22:48 UTC (permalink / raw)
To: James Bottomley
Cc: Boaz Harrosh, Matthew Wilcox, Benny Halevy, Jeff Garzik,
Andrew Morton, Al Viro, Avishay Traeger, open-osd development,
linux-scsi, linux-kernel, linux-fsdevel
James Bottomley wrote:
> Um, your submission path is character. You pick up block again because
> SCSI uses it for queues, but it's not really part of your paradigm.
> I think your choice of using a character device will turn out to be a
> design mistake because the migration path of existing filesystems is
> bound to be a block device with extra features (which they may or may
> not make use of) but only if there's a way to make ODS relevant to
> users.
We mount character devices already when it's appropriate.
Look at JFFS, JFFS2, UBIFS and LOGFS. All of them operate on MTD
devices, which are character device interfaces to flash storage, using
the common MTD interface instead of the block layer.
This is quite correct, because block devices have specific
characteristics (generic block caching and ability to read/write each
block independently) which neither flash nor OSDs have.
Imho, OSDs are similar to flash in this respected. There is no
fixed-size block/sector indexed storage device, therefore a block
device would be wrong.
Admittedly lumping everything else under "character" is daft, when you
can't read and write character streams to the device, but that's unix
for you. Character device used to mean serial ports etc. until it
become "any old crap that's not a block device". :-)
-- Jamie
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 7/9] exofs: mkexofs
2009-01-12 20:22 ` Jeff Garzik
@ 2009-01-12 23:25 ` James Bottomley
2009-01-13 13:03 ` [osd-dev] " Benny Halevy
2009-01-13 13:44 ` Jeff Garzik
0 siblings, 2 replies; 32+ messages in thread
From: James Bottomley @ 2009-01-12 23:25 UTC (permalink / raw)
To: Jeff Garzik
Cc: Boaz Harrosh, Matthew Wilcox, Benny Halevy, Andrew Morton,
Al Viro, Avishay Traeger, open-osd development, linux-scsi,
linux-kernel, linux-fsdevel
On Mon, 2009-01-12 at 15:22 -0500, Jeff Garzik wrote:
> James Bottomley wrote:
> > On Mon, 2009-01-12 at 14:23 -0500, Jeff Garzik wrote:
> >>> It's an indicator of one. If you buy my premise that OSD cannot be
> >>> relevant without compelling user cases, then the lack of a user API can
> >>> be viewed as a symptom of this.
> >> If having a compelling user case was a prereq for kernel inclusion, well
> >> over half the code would be gone.
> >
> > I'm not holding this against inclusion ... I'm saying it's a symptom of
> > the generic relevance to user issues problem that OSD has.
> >
> >>> I think your choice of using a character device will turn out to be a
> >>> design mistake because the migration path of existing filesystems is
> >>> bound to be a block device with extra features (which they may or may
> >>> not make use of) but only if there's a way to make ODS relevant to
> >>> users.
> >> It is fantasy to think we will be migrating ext4 to OSD. That fantasy
> >> is not a compelling reason to block OSD development.
> >
> > OK, so your quote managed to miss this bit:
> >
> > "Right, so I'm reasonably happy to accept libosd for what it is: an
> > enabler for a few specialised applications. "
> >
> > I can't see how that can be construed as "blocking OSD development".
> > The word "accept" is conventionally used in Linux parlance to mean "will
> > send upstream".
>
> Yet you continue to expend energy complaining about migrating
> block-based filesystems to OSD, a complex, overhead-laden undertaking
> _no one_ has proposed or entertained.
You're the one who keeps suggesting migration, not me. I keep
suggesting ways to make OSD more relevant to current user problems.
A maintainer doesn't have to like everything they merge.
> >> To sum,
> >>
> >> * exofs needs a userspace library, around which the standard filesystem
> >> tools will be built, most notably mkfs, dump, restore, fsck
> >>
> >> * talk of migrating existing filesystems is wildly premature (and a bit
> >> of a silly argument, since you are also arguing that OSD lacks
> >> compelling use cases)
> >
> > So criticising lacking compelling use cases while at the same time
> > suggesting how to find them is wrong?
> >
> > Actually, If the only use case OSD can bring to the table is requiring
> > new filesystems, then there's nothing of general user relevance for it
> > on the horizon ... anywhere. There's never going to be a compelling
> > reason to move the consumer OSDs in the various development labs to
> > production because nothing would be able to use them on a mass scale.
>
> > If we could derive a benefit from OSD in existing filesystems, then they
> > do have user relevance, and Seagate and the others might just consider
> > releasing the devices.
>
> If Seagate were to release a production OSD device, do you really think
> they would prefer a block-based filesystem hacked to work with OSDs? I
> don't think so.
Um, speaking with my business hat on, I'd really beg to differ ... you
don't release a product into an empty market. you pick an existing one,
or fill a fundamental need that a market nucleates around. If that
means block based filesystems hacked to work with OSDs, I think they'd
take it, yes.
> Existing block filesystems are very much purpose built for sector-based
> storage as implemented on modern storage devices. No kernel API can
> hand-wave that away.
>
> The whole point of OSDs is to move some of the overhead to the storage
> device, not _add_ to the overhead.
Well, that was the idea, with OSD version 1. The problem is that the
benchmarks didn't confirm that letting the disk take care of object
placement was a win over block based filesystems. If you want to
migrate objects across disks (i.e. cfs paradigm), then it is a win, but
not really for performance. That's why OSDv2 has been beefing up
attributes and security.
The interesting question is what does it take to allow arbitrary
filesystems to benefit from this.
> > Note that "providing benefit to" does not equate to "rewriting the
> > filesystem for" ... and it shouldn't; the benefit really should be
> > incremental. And that's the crux of my criticism. While OSD are
> > separate things that we have to rewrite whole filesystems for, they're
> > never going to set the world on fire. If they could be used with only
> > incremental effort, they might. The bridge for the incremental effort
> > will come from a properly designed kernel API.
>
> Well, hey, if you wanna expend energy creating a kernel API that
> presents a complex OSD as simple block-based storage, go for it. AFAICS
> it's just extra overhead and complexity when a new filesystem could do
> the job much better.
Because writing a new filesystem is so much easier?
> And I seriously doubt Linus or anyone else will want to hack up a
> block-based filesystem in this manner. Better to create a silly "for
> argument's sake" OSD block device, upon which any block-based filesystem
> can be mounted. (Note I said block device, _not_ filesystem)
That's a possibility ... as I said before: a block device with extra
features that allows incremental use in the filesystem.
> >> * an in-kernel OSD-based filesystem needs some sort of generic in-kernel
> >> libosd API, so that multiple OSD filesystems do not reinvent the wheel
> >> each time.
> >>
> >> * OSD was bound to be annoying, because it forces the kernel filesystem
> >> to either (a) talk SCSI or (b) use messages that can be converted to
> >> SCSI OSD commands, like existing drivers convert the block layer's READ
> >> and WRITE to device-specific commands.
> >
> > OK, so what you're arguing is that unlike block devices where we can
> > produce a useful generic abstraction that is protocol agnostic, for OSD
> > we can't? As I've said before, I think this might be true, but fear it
> > dooms OSD to being too difficult to use.
>
> No, a generic abstraction is "(b)" in my quoted paragraph.
>
> But it's certainly easy to create an OSD block device client, that
> simulates sector-based storage, if you are motivated in that direction.
>
> But that only makes sense if you want the extra overhead (square peg,
> round hole), which no sane person will want. Face it, only screwballs
> want to mount ext4 on an OSD.
So what's your proposal for lowering the barrier to adoption then?
> >> * Trying to force OSD to export a block device is pushing a square peg
> >> through a round hole. Thus, the best (and only) alternative is
> >> character device. What you really want is a Third Way(tm): a mmap'able
> >> message device, since you really want to export an API to userspace.
> >
> > only allowing a character tap raises the effort bar on getting other
> > filesystems to use it, because they're all block based ...
>
> That's irrelevant, since no one is calling for block-based filesystems
> to be converted to use OSD.
It's relevant to lowering the barrier to adoption, unless there's some
other means I haven't seen.
> And I can only imagine the push-back, should someone actually propose
> doing so. Filesystems are very much purpose-built for their storage
> paradigm.
Filesystems are complex and difficult beasts to get right. Btrfs took a
year to get to the point of kernel inclusion and will take some little
time longer to get enterprises to the point of trusting data to it. So
if we say a two year lead time, that would mean that even if someone
started a general purpose OSD based filesystem today, it wouldn't be
ready for the consumer market until 2011. That's not really going to
convince the disk vendors that OSD based devices should be marketed
today.
James
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [osd-dev] [PATCH 7/9] exofs: mkexofs
2009-01-12 23:25 ` James Bottomley
@ 2009-01-13 13:03 ` Benny Halevy
2009-01-13 13:24 ` Jeff Garzik
2009-01-13 13:44 ` Jeff Garzik
1 sibling, 1 reply; 32+ messages in thread
From: Benny Halevy @ 2009-01-13 13:03 UTC (permalink / raw)
To: James Bottomley
Cc: open-osd development, Jeff Garzik, linux-scsi, Matthew Wilcox,
linux-kernel, Avishay Traeger, Al Viro, linux-fsdevel,
Andrew Morton
On Jan. 13, 2009, 1:25 +0200, James Bottomley <James.Bottomley@HansenPartnership.com> wrote:
> On Mon, 2009-01-12 at 15:22 -0500, Jeff Garzik wrote:
>> James Bottomley wrote:
>>> On Mon, 2009-01-12 at 14:23 -0500, Jeff Garzik wrote:
>>>>> It's an indicator of one. If you buy my premise that OSD cannot be
>>>>> relevant without compelling user cases, then the lack of a user API can
>>>>> be viewed as a symptom of this.
>>>> If having a compelling user case was a prereq for kernel inclusion, well
>>>> over half the code would be gone.
>>> I'm not holding this against inclusion ... I'm saying it's a symptom of
>>> the generic relevance to user issues problem that OSD has.
>>>
>>>>> I think your choice of using a character device will turn out to be a
>>>>> design mistake because the migration path of existing filesystems is
>>>>> bound to be a block device with extra features (which they may or may
>>>>> not make use of) but only if there's a way to make ODS relevant to
>>>>> users.
>>>> It is fantasy to think we will be migrating ext4 to OSD. That fantasy
>>>> is not a compelling reason to block OSD development.
>>> OK, so your quote managed to miss this bit:
>>>
>>> "Right, so I'm reasonably happy to accept libosd for what it is: an
>>> enabler for a few specialised applications. "
>>>
>>> I can't see how that can be construed as "blocking OSD development".
>>> The word "accept" is conventionally used in Linux parlance to mean "will
>>> send upstream".
>> Yet you continue to expend energy complaining about migrating
>> block-based filesystems to OSD, a complex, overhead-laden undertaking
>> _no one_ has proposed or entertained.
>
> You're the one who keeps suggesting migration, not me. I keep
> suggesting ways to make OSD more relevant to current user problems.
>
> A maintainer doesn't have to like everything they merge.
>
>>>> To sum,
>>>>
>>>> * exofs needs a userspace library, around which the standard filesystem
>>>> tools will be built, most notably mkfs, dump, restore, fsck
>>>>
>>>> * talk of migrating existing filesystems is wildly premature (and a bit
>>>> of a silly argument, since you are also arguing that OSD lacks
>>>> compelling use cases)
>>> So criticising lacking compelling use cases while at the same time
>>> suggesting how to find them is wrong?
>>>
>>> Actually, If the only use case OSD can bring to the table is requiring
>>> new filesystems, then there's nothing of general user relevance for it
>>> on the horizon ... anywhere. There's never going to be a compelling
>>> reason to move the consumer OSDs in the various development labs to
>>> production because nothing would be able to use them on a mass scale.
>>> If we could derive a benefit from OSD in existing filesystems, then they
>>> do have user relevance, and Seagate and the others might just consider
>>> releasing the devices.
>> If Seagate were to release a production OSD device, do you really think
>> they would prefer a block-based filesystem hacked to work with OSDs? I
>> don't think so.
>
> Um, speaking with my business hat on, I'd really beg to differ ... you
> don't release a product into an empty market. you pick an existing one,
> or fill a fundamental need that a market nucleates around. If that
> means block based filesystems hacked to work with OSDs, I think they'd
> take it, yes.
>
>> Existing block filesystems are very much purpose built for sector-based
>> storage as implemented on modern storage devices. No kernel API can
>> hand-wave that away.
>>
>> The whole point of OSDs is to move some of the overhead to the storage
>> device, not _add_ to the overhead.
>
> Well, that was the idea, with OSD version 1. The problem is that the
> benchmarks didn't confirm that letting the disk take care of object
> placement was a win over block based filesystems. If you want to
> migrate objects across disks (i.e. cfs paradigm), then it is a win, but
> not really for performance. That's why OSDv2 has been beefing up
> attributes and security.
IMO the main advantage of moving block allocation down to the OSD target
is more apparent with distributed file systems a-la pNFS over objects
where paralleling that task is a key for scalable performance.
The thing is that the target needs to implement its own mapping from
object logical offsets into disk blocks and this is usually done
using some kind of a (possibly trimmed down) local file system.
Therefore the I/O performance of a single OSD is likely to be similar
to a single file server's. I'm not sure what will be case comparing
an OSD with a local file system mounted over a block device over
a storage network, e.g. FC or iSCSI - that could be an interesting
research topic. I guess that the main issue there is to cache enough
metadata on the host to minimize transfer latencies (assuming
latency of a directly attached device is always better than
a fabric-attached one).
Anyhow, capacity management via partitions and object allocation,
plus quotas, and the fine grain OSD security model is a big one
that's worth investigating, to say the least.
>
> The interesting question is what does it take to allow arbitrary
> filesystems to benefit from this.
One direction is to mount the file system over an object or a set
of object exported via exofs using mount -o loop.
And the user doesn't have to be necessarily a filesystem. It could
be a database either... or anything that's typically working
over a block device.
Benny
>
>>> Note that "providing benefit to" does not equate to "rewriting the
>>> filesystem for" ... and it shouldn't; the benefit really should be
>>> incremental. And that's the crux of my criticism. While OSD are
>>> separate things that we have to rewrite whole filesystems for, they're
>>> never going to set the world on fire. If they could be used with only
>>> incremental effort, they might. The bridge for the incremental effort
>>> will come from a properly designed kernel API.
>> Well, hey, if you wanna expend energy creating a kernel API that
>> presents a complex OSD as simple block-based storage, go for it. AFAICS
>> it's just extra overhead and complexity when a new filesystem could do
>> the job much better.
>
> Because writing a new filesystem is so much easier?
>
>> And I seriously doubt Linus or anyone else will want to hack up a
>> block-based filesystem in this manner. Better to create a silly "for
>> argument's sake" OSD block device, upon which any block-based filesystem
>> can be mounted. (Note I said block device, _not_ filesystem)
>
> That's a possibility ... as I said before: a block device with extra
> features that allows incremental use in the filesystem.
I can understand representing a single object as a block device (although I
think that using a file for that should be good enough and easier) but
why representing the whole OSD as a block device? The OSD holds partitions
and objects each with attributes and OSD security related support. Hence
representing that in a namespace using a filesystem seems straight forward.
Benny
>
>>>> * an in-kernel OSD-based filesystem needs some sort of generic in-kernel
>>>> libosd API, so that multiple OSD filesystems do not reinvent the wheel
>>>> each time.
>>>>
>>>> * OSD was bound to be annoying, because it forces the kernel filesystem
>>>> to either (a) talk SCSI or (b) use messages that can be converted to
>>>> SCSI OSD commands, like existing drivers convert the block layer's READ
>>>> and WRITE to device-specific commands.
>>> OK, so what you're arguing is that unlike block devices where we can
>>> produce a useful generic abstraction that is protocol agnostic, for OSD
>>> we can't? As I've said before, I think this might be true, but fear it
>>> dooms OSD to being too difficult to use.
>> No, a generic abstraction is "(b)" in my quoted paragraph.
>>
>> But it's certainly easy to create an OSD block device client, that
>> simulates sector-based storage, if you are motivated in that direction.
>>
>> But that only makes sense if you want the extra overhead (square peg,
>> round hole), which no sane person will want. Face it, only screwballs
>> want to mount ext4 on an OSD.
>
> So what's your proposal for lowering the barrier to adoption then?
>
>>>> * Trying to force OSD to export a block device is pushing a square peg
>>>> through a round hole. Thus, the best (and only) alternative is
>>>> character device. What you really want is a Third Way(tm): a mmap'able
>>>> message device, since you really want to export an API to userspace.
>>> only allowing a character tap raises the effort bar on getting other
>>> filesystems to use it, because they're all block based ...
>> That's irrelevant, since no one is calling for block-based filesystems
>> to be converted to use OSD.
>
> It's relevant to lowering the barrier to adoption, unless there's some
> other means I haven't seen.
>
>> And I can only imagine the push-back, should someone actually propose
>> doing so. Filesystems are very much purpose-built for their storage
>> paradigm.
>
> Filesystems are complex and difficult beasts to get right. Btrfs took a
> year to get to the point of kernel inclusion and will take some little
> time longer to get enterprises to the point of trusting data to it. So
> if we say a two year lead time, that would mean that even if someone
> started a general purpose OSD based filesystem today, it wouldn't be
> ready for the consumer market until 2011. That's not really going to
> convince the disk vendors that OSD based devices should be marketed
> today.
>
> James
>
>
> _______________________________________________
> osd-dev mailing list
> osd-dev@open-osd.org
> http://mailman.open-osd.org/mailman/listinfo/osd-dev
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [osd-dev] [PATCH 7/9] exofs: mkexofs
2009-01-13 13:03 ` [osd-dev] " Benny Halevy
@ 2009-01-13 13:24 ` Jeff Garzik
2009-01-13 13:32 ` Benny Halevy
0 siblings, 1 reply; 32+ messages in thread
From: Jeff Garzik @ 2009-01-13 13:24 UTC (permalink / raw)
To: Benny Halevy
Cc: James Bottomley, open-osd development, linux-scsi, Matthew Wilcox,
linux-kernel, Avishay Traeger, Al Viro, linux-fsdevel,
Andrew Morton
Benny Halevy wrote:
> IMO the main advantage of moving block allocation down to the OSD target
> is more apparent with distributed file systems a-la pNFS over objects
> where paralleling that task is a key for scalable performance.
>
> The thing is that the target needs to implement its own mapping from
> object logical offsets into disk blocks and this is usually done
> using some kind of a (possibly trimmed down) local file system.
> Therefore the I/O performance of a single OSD is likely to be similar
> to a single file server's.
Well, modern SATA devices are already mini-filesystems internally, when
you consider logical block remapping etc.
And the claim by drive research guys at the filesystem/storage summit
was that OSD offered the potential to better optimize storage based on
access/usage patterns.
(of course, whether or not reality bears out this guess is another question)
> I can understand representing a single object as a block device (although I
> think that using a file for that should be good enough and easier) but
> why representing the whole OSD as a block device? The OSD holds partitions
> and objects each with attributes and OSD security related support. Hence
> representing that in a namespace using a filesystem seems straight forward.
I am actually considering writing a simple "osdblk" driver, that would
represent a single object as a block device.
This would NOT replace exofs or other OSD filesystems, but it would be
nice to have, and it will give me more experience with OSDs.
Jeff
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [osd-dev] [PATCH 7/9] exofs: mkexofs
2009-01-13 13:24 ` Jeff Garzik
@ 2009-01-13 13:32 ` Benny Halevy
0 siblings, 0 replies; 32+ messages in thread
From: Benny Halevy @ 2009-01-13 13:32 UTC (permalink / raw)
To: Jeff Garzik
Cc: linux-scsi, Matthew Wilcox, linux-kernel, James Bottomley,
Avishay Traeger, open-osd development, linux-fsdevel,
Andrew Morton, Al Viro, Boaz Harrosh
On Jan. 13, 2009, 15:24 +0200, Jeff Garzik <jeff@garzik.org> wrote:
> Benny Halevy wrote:
>> IMO the main advantage of moving block allocation down to the OSD target
>> is more apparent with distributed file systems a-la pNFS over objects
>> where paralleling that task is a key for scalable performance.
>>
>> The thing is that the target needs to implement its own mapping from
>> object logical offsets into disk blocks and this is usually done
>> using some kind of a (possibly trimmed down) local file system.
>> Therefore the I/O performance of a single OSD is likely to be similar
>> to a single file server's.
>
> Well, modern SATA devices are already mini-filesystems internally, when
> you consider logical block remapping etc.
>
> And the claim by drive research guys at the filesystem/storage summit
> was that OSD offered the potential to better optimize storage based on
> access/usage patterns.
>
> (of course, whether or not reality bears out this guess is another question)
That's true for multi-user access where knowing the context for each I/O
request - i.e. the object that holds it provides a crucial hint for
read-ahead and write allocation, where for a dumb device that doesn't
know anything about the filesystem's internals, it's much harder to
associate different blocks with their respective containers, or "streams"
(in case the container is typically accessed in a sequential pattern).
>
>
>> I can understand representing a single object as a block device (although I
>> think that using a file for that should be good enough and easier) but
>> why representing the whole OSD as a block device? The OSD holds partitions
>> and objects each with attributes and OSD security related support. Hence
>> representing that in a namespace using a filesystem seems straight forward.
>
> I am actually considering writing a simple "osdblk" driver, that would
> represent a single object as a block device.
>
> This would NOT replace exofs or other OSD filesystems, but it would be
> nice to have, and it will give me more experience with OSDs.
That's awesome!
It be really interesting to benchmark one against the other.
Benny
>
> Jeff
>
>
> _______________________________________________
> osd-dev mailing list
> osd-dev@open-osd.org
> http://mailman.open-osd.org/mailman/listinfo/osd-dev
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 7/9] exofs: mkexofs
2009-01-12 23:25 ` James Bottomley
2009-01-13 13:03 ` [osd-dev] " Benny Halevy
@ 2009-01-13 13:44 ` Jeff Garzik
2009-01-13 14:03 ` Alan Cox
1 sibling, 1 reply; 32+ messages in thread
From: Jeff Garzik @ 2009-01-13 13:44 UTC (permalink / raw)
To: James Bottomley
Cc: Boaz Harrosh, Matthew Wilcox, Benny Halevy, Andrew Morton,
Al Viro, Avishay Traeger, open-osd development, linux-scsi,
linux-kernel, linux-fsdevel
James Bottomley wrote:
> On Mon, 2009-01-12 at 15:22 -0500, Jeff Garzik wrote:
>> If Seagate were to release a production OSD device, do you really think
>> they would prefer a block-based filesystem hacked to work with OSDs? I
>> don't think so.
>
> Um, speaking with my business hat on, I'd really beg to differ ... you
> don't release a product into an empty market. you pick an existing one,
> or fill a fundamental need that a market nucleates around. If that
> means block based filesystems hacked to work with OSDs, I think they'd
> take it, yes.
It seems unlikely drive manufacturers would get excited about a
sub-optimal solution that does not even approach using the full
potential of the product.
Plus, given the existence of an OSD-specific filesystem (exofs, at the
very least), it seems unlikely that end users who own OSDs would choose
the sub-optimal solution when an OSD-specific filesystem exists.
>>> Note that "providing benefit to" does not equate to "rewriting the
>>> filesystem for" ... and it shouldn't; the benefit really should be
>>> incremental. And that's the crux of my criticism. While OSD are
>>> separate things that we have to rewrite whole filesystems for, they're
>>> never going to set the world on fire. If they could be used with only
>>> incremental effort, they might. The bridge for the incremental effort
>>> will come from a properly designed kernel API.
>> Well, hey, if you wanna expend energy creating a kernel API that
>> presents a complex OSD as simple block-based storage, go for it. AFAICS
>> it's just extra overhead and complexity when a new filesystem could do
>> the job much better.
>
> Because writing a new filesystem is so much easier?
Yes, easier -- both technically and politically -- than hacking XFS or
ext4 to support two vastly different storage APIs (linear sector or
object-based).
It might be a tad easier to hack btrfs to do objects.
>>>> * an in-kernel OSD-based filesystem needs some sort of generic in-kernel
>>>> libosd API, so that multiple OSD filesystems do not reinvent the wheel
>>>> each time.
>>>>
>>>> * OSD was bound to be annoying, because it forces the kernel filesystem
>>>> to either (a) talk SCSI or (b) use messages that can be converted to
>>>> SCSI OSD commands, like existing drivers convert the block layer's READ
>>>> and WRITE to device-specific commands.
>>> OK, so what you're arguing is that unlike block devices where we can
>>> produce a useful generic abstraction that is protocol agnostic, for OSD
>>> we can't? As I've said before, I think this might be true, but fear it
>>> dooms OSD to being too difficult to use.
>> No, a generic abstraction is "(b)" in my quoted paragraph.
>>
>> But it's certainly easy to create an OSD block device client, that
>> simulates sector-based storage, if you are motivated in that direction.
>>
>> But that only makes sense if you want the extra overhead (square peg,
>> round hole), which no sane person will want. Face it, only screwballs
>> want to mount ext4 on an OSD.
>
> So what's your proposal for lowering the barrier to adoption then?
Once exofs is in upstream, installers can easily choose that when an OSD
device is detected.
> Filesystems are complex and difficult beasts to get right. Btrfs took a
> year to get to the point of kernel inclusion and will take some little
> time longer to get enterprises to the point of trusting data to it. So
> if we say a two year lead time, that would mean that even if someone
> started a general purpose OSD based filesystem today, it wouldn't be
> ready for the consumer market until 2011. That's not really going to
> convince the disk vendors that OSD based devices should be marketed
> today.
And you have a similar sales job and lag time, when hacking -- read
destabilizing -- a filesystem to work with OSDs as well as sector-based
devices.
Jeff
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 7/9] exofs: mkexofs
2009-01-13 13:44 ` Jeff Garzik
@ 2009-01-13 14:03 ` Alan Cox
2009-01-13 14:17 ` Jeff Garzik
0 siblings, 1 reply; 32+ messages in thread
From: Alan Cox @ 2009-01-13 14:03 UTC (permalink / raw)
To: Jeff Garzik
Cc: James Bottomley, Boaz Harrosh, Matthew Wilcox, Benny Halevy,
Andrew Morton, Al Viro, Avishay Traeger, open-osd development,
linux-scsi, linux-kernel, linux-fsdevel
> It seems unlikely drive manufacturers would get excited about a
> sub-optimal solution that does not even approach using the full
> potential of the product.
You forgot the more important people
Mr Customer, would you like your data centre to use a new magic OSD fs or
the existing one you trust.
Now in my experience that is a *dumb* question because the answer is
obvious...
> Plus, given the existence of an OSD-specific filesystem (exofs, at the
> very least), it seems unlikely that end users who own OSDs would choose
> the sub-optimal solution when an OSD-specific filesystem exists.
Actually until you can show zillions of users stably using them the
people with the money won't buy them in the first place 8)
> > ready for the consumer market until 2011. That's not really going to
> > convince the disk vendors that OSD based devices should be marketed
> > today.
>
> And you have a similar sales job and lag time, when hacking -- read
> destabilizing -- a filesystem to work with OSDs as well as sector-based
> devices.
2011 sounds optimistic for major OSD adoption in any space except for
flash storage where OSD type knowledge means you can do much better jobs
on erase management.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 7/9] exofs: mkexofs
2009-01-13 14:03 ` Alan Cox
@ 2009-01-13 14:17 ` Jeff Garzik
2009-01-13 16:14 ` Alan Cox
0 siblings, 1 reply; 32+ messages in thread
From: Jeff Garzik @ 2009-01-13 14:17 UTC (permalink / raw)
To: Alan Cox
Cc: James Bottomley, Boaz Harrosh, Matthew Wilcox, Benny Halevy,
Andrew Morton, Al Viro, Avishay Traeger, open-osd development,
linux-scsi, linux-kernel, linux-fsdevel
Alan Cox wrote:
>> It seems unlikely drive manufacturers would get excited about a
>> sub-optimal solution that does not even approach using the full
>> potential of the product.
>
> You forgot the more important people
>
> Mr Customer, would you like your data centre to use a new magic OSD fs or
> the existing one you trust.
>
> Now in my experience that is a *dumb* question because the answer is
> obvious...
The choice is between "new magic OSD fs" and "new fs that used to be
ext4, before we hacked it up".
"existing one you trust" is not an option...
>> Plus, given the existence of an OSD-specific filesystem (exofs, at the
>> very least), it seems unlikely that end users who own OSDs would choose
>> the sub-optimal solution when an OSD-specific filesystem exists.
>
> Actually until you can show zillions of users stably using them the
> people with the money won't buy them in the first place 8)
Yeah, at this point the discussion devolves into talk of carts, horses,
chickens and eggs... :)
>>> ready for the consumer market until 2011. That's not really going to
>>> convince the disk vendors that OSD based devices should be marketed
>>> today.
>> And you have a similar sales job and lag time, when hacking -- read
>> destabilizing -- a filesystem to work with OSDs as well as sector-based
>> devices.
>
> 2011 sounds optimistic for major OSD adoption in any space except for
> flash storage where OSD type knowledge means you can do much better jobs
> on erase management.
His number, not mine...
At this point OSD is a fun and interesting research project.
Overall, I think Linux should have OSD support so that we are ready for
whatever the future brings. Even if OSD goes nowhere, it will still
have more users than many of the existing Linux drivers and architectures :)
Jeff
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 7/9] exofs: mkexofs
2009-01-13 14:17 ` Jeff Garzik
@ 2009-01-13 16:14 ` Alan Cox
2009-01-13 17:21 ` Boaz Harrosh
0 siblings, 1 reply; 32+ messages in thread
From: Alan Cox @ 2009-01-13 16:14 UTC (permalink / raw)
To: Jeff Garzik
Cc: James Bottomley, Boaz Harrosh, Matthew Wilcox, Benny Halevy,
Andrew Morton, Al Viro, Avishay Traeger, open-osd development,
linux-scsi, linux-kernel, linux-fsdevel
> > Now in my experience that is a *dumb* question because the answer is
> > obvious...
>
> The choice is between "new magic OSD fs" and "new fs that used to be
> ext4, before we hacked it up".
>
> "existing one you trust" is not an option...
No it isn't. The choice is existing technology followed by a "thank you
goodbye Mr OSD salesman".
I'm not saying we shouldn't work on an OSD file system and I'm glad IBM
folks are but that it can be done slowly. Also for most fs folks an OSD
emulator testing might not be a bad idea - say one stacked on ext3 8)
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 7/9] exofs: mkexofs
2009-01-13 16:14 ` Alan Cox
@ 2009-01-13 17:21 ` Boaz Harrosh
2009-01-21 18:13 ` Jeff Garzik
0 siblings, 1 reply; 32+ messages in thread
From: Boaz Harrosh @ 2009-01-13 17:21 UTC (permalink / raw)
To: Alan Cox
Cc: Jeff Garzik, James Bottomley, Matthew Wilcox, Benny Halevy,
Andrew Morton, Al Viro, Avishay Traeger, open-osd development,
linux-scsi, linux-kernel, linux-fsdevel
Alan Cox wrote:
>>> Now in my experience that is a *dumb* question because the answer is
>>> obvious...
>> The choice is between "new magic OSD fs" and "new fs that used to be
>> ext4, before we hacked it up".
>>
>> "existing one you trust" is not an option...
>
> No it isn't. The choice is existing technology followed by a "thank you
> goodbye Mr OSD salesman".
>
> I'm not saying we shouldn't work on an OSD file system and I'm glad IBM
> folks are but that it can be done slowly.
IBM is not working on OSD for a long time now. We at open-osd are.
That is me and Benny (abit) and other people that hang on the mailing-list
So it is mostly Panasas these days.
On git.open-osd.org we are hosting various OSD projects mainly the submitted
work plus inherited code from OSC, which is not active anymore. as of Q3 2008.
Also for most fs folks an OSD
> emulator testing might not be a bad idea - say one stacked on ext3 8)
>
One of the projects on open-osd.org is the OSC's osd-target which is based
on scsi tgt framework and implements an OSD in user-mode over any local
filesystem. It supports any SCSI transport supported by tgt that is: iscsi,
fcoe, iser, kernel-tgt. This is what we test against. I have just been porting
that project to freebsd. It as a very small foot print compared to, lets say NFS.
Thanks
Boaz
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 7/9] exofs: mkexofs
2009-01-13 17:21 ` Boaz Harrosh
@ 2009-01-21 18:13 ` Jeff Garzik
2009-01-21 18:44 ` Boaz Harrosh
0 siblings, 1 reply; 32+ messages in thread
From: Jeff Garzik @ 2009-01-21 18:13 UTC (permalink / raw)
To: Boaz Harrosh
Cc: Alan Cox, James Bottomley, Matthew Wilcox, Benny Halevy,
Andrew Morton, Al Viro, Avishay Traeger, open-osd development,
linux-scsi, linux-kernel, linux-fsdevel
BTW, where can the latest libosd be found?
I'll want to use that for osdblk (export a single OSD object as a Linux
block device).
Jeff
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 7/9] exofs: mkexofs
2009-01-21 18:13 ` Jeff Garzik
@ 2009-01-21 18:44 ` Boaz Harrosh
0 siblings, 0 replies; 32+ messages in thread
From: Boaz Harrosh @ 2009-01-21 18:44 UTC (permalink / raw)
To: Jeff Garzik
Cc: Alan Cox, James Bottomley, Matthew Wilcox, Benny Halevy,
Andrew Morton, Al Viro, Avishay Traeger, open-osd development,
linux-scsi, linux-kernel, linux-fsdevel
Jeff Garzik wrote:
> BTW, where can the latest libosd be found?
>
> I'll want to use that for osdblk (export a single OSD object as a Linux
> block device).
>
> Jeff
>
You are most welcome thank you.
The in kernel patches are at:
git-clone git://git.open-osd.org/linux-open-osd.git linux-next
or on the web at:
http://git.open-osd.org/gitweb.cgi?p=linux-open-osd.git;a=shortlog;h=refs/heads/linux-next
But you might find the out-of-tree project more complete:
git-clone git://git.open-osd.org/open-osd.git
or on the web at:
http://git.open-osd.org/gitweb.cgi?p=open-osd.git;a=shortlog;h=refs/heads/exofs
To setup an osd-target and all that, this is also hosted on open-osd.org
Please start reading at http://open-osd.org and the links from that page.
(Been first, you get to debug my documentation)
I'm patiently awaiting patches ;)
sincerely yours
Boaz
^ permalink raw reply [flat|nested] 32+ messages in thread
end of thread, other threads:[~2009-01-21 18:44 UTC | newest]
Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <4947BFAA.4030208@panasas.com>
[not found] ` <4947CA5C.50104@panasas.com>
[not found] ` <20081229121423.efde9d06.akpm@linux-foundation.org>
2008-12-31 15:19 ` [PATCH 7/9] exofs: mkexofs Boaz Harrosh
2008-12-31 15:57 ` James Bottomley
2009-01-01 9:22 ` [osd-dev] " Benny Halevy
2009-01-01 9:54 ` Jeff Garzik
2009-01-01 14:23 ` Benny Halevy
2009-01-01 14:28 ` Matthew Wilcox
2009-01-01 18:12 ` Jörn Engel
2009-01-01 23:26 ` J. Bruce Fields
2009-01-02 7:14 ` Benny Halevy
2009-01-04 15:20 ` Boaz Harrosh
2009-01-04 15:38 ` Christoph Hellwig
2009-01-12 18:12 ` James Bottomley
2009-01-12 19:23 ` Jeff Garzik
2009-01-12 19:56 ` James Bottomley
2009-01-12 20:22 ` Jeff Garzik
2009-01-12 23:25 ` James Bottomley
2009-01-13 13:03 ` [osd-dev] " Benny Halevy
2009-01-13 13:24 ` Jeff Garzik
2009-01-13 13:32 ` Benny Halevy
2009-01-13 13:44 ` Jeff Garzik
2009-01-13 14:03 ` Alan Cox
2009-01-13 14:17 ` Jeff Garzik
2009-01-13 16:14 ` Alan Cox
2009-01-13 17:21 ` Boaz Harrosh
2009-01-21 18:13 ` Jeff Garzik
2009-01-21 18:44 ` Boaz Harrosh
2009-01-12 22:48 ` Jamie Lokier
2009-01-06 8:40 ` Andreas Dilger
2008-12-31 19:25 ` Andrew Morton
2009-01-01 13:33 ` Boaz Harrosh
2009-01-02 22:46 ` James Bottomley
2009-01-04 8:59 ` Boaz Harrosh
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).