Future Linux filesystems

All of lore.kernel.org
 help / color / mirror / Atom feed

* Future Linux filesystems
@ 2008-06-02 21:46 Thomas King
       [not found] ` <20080603065205.GA19533@infradead.org>
  0 siblings, 1 reply; 12+ messages in thread
From: Thomas King @ 2008-06-02 21:46 UTC (permalink / raw)
  To: linux-btrfs

Folks,

I am writing an article for Linux.com to answer Henry Newman's article at
http://www.enterprisestorageforum.com/sans/features/article.php/3749926
concerning Linux and massive filesystems. Is there someone here that can field
some questions about BTRFS?

Thanks!
Tom King

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Future Linux filesystems
       [not found] ` <20080603065205.GA19533@infradead.org>
@ 2008-06-03 14:37   ` Thomas King
  2008-06-03 15:02     ` Joe Peterson
                       ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Thomas King @ 2008-06-03 14:37 UTC (permalink / raw)
  To: linux-btrfs

> All the issues he complains about actually are solved by XFS, and XFS actually
does better in
> exactly these environments than either zfs on Solaris or JFS2 on AIX.
>
>

I asked the author that question and he states XFS is actually a pretty good
answer to most of those issues but believes it still falls short where "the
metadata areas are not aligned with RAID strips and allocation units are FAR too
small but better than ext." Another detail he brought out was sending data and
metadata to different devices in those environments and referenced RT XFS.
Otherwise having them on the same device increases the possibility of corruption
and/or a longer filesystem check/repair. Will btrfs offer something like this in
the future?

Do y'all foresee btrfs being used in exabtye installations?
Does/Will btrfs have RAID awareness in that it will align "the
superblock and metadata to the RAID stripe"?
What is the largest block allocation available?
Will btrfs be T10 DIF/block protect aware?
I remember reading that CRFS relies on btrfs, but will btrfs support NFS,
specifically version 4.1?

Thanks!
Tom King

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Future Linux filesystems
  2008-06-03 14:37   ` Thomas King
@ 2008-06-03 15:02     ` Joe Peterson
  2008-06-03 16:06       ` Martin K. Petersen
  2008-06-03 15:52     ` Evgeniy Polyakov
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 12+ messages in thread
From: Joe Peterson @ 2008-06-03 15:02 UTC (permalink / raw)
  To: Thomas King; +Cc: linux-btrfs

Thomas King wrote:
>> All the issues he complains about actually are solved by XFS, and XFS actually
> does better in
>> exactly these environments than either zfs on Solaris or JFS2 on AIX.
>>
>>
> 
> I asked the author that question and he states XFS is actually a pretty good
> answer to most of those issues but believes it still falls short where "the
> metadata areas are not aligned with RAID strips and allocation units are FAR too
> small but better than ext." Another detail he brought out was sending data and
> metadata to different devices in those environments and referenced RT XFS.
> Otherwise having them on the same device increases the possibility of corruption
> and/or a longer filesystem check/repair. Will btrfs offer something like this in
> the future?
> 
> Do y'all foresee btrfs being used in exabtye installations?
> Does/Will btrfs have RAID awareness in that it will align "the
> superblock and metadata to the RAID stripe"?
> What is the largest block allocation available?
> Will btrfs be T10 DIF/block protect aware?
> I remember reading that CRFS relies on btrfs, but will btrfs support NFS,
> specifically version 4.1?

You don't mention what I believe is the *key* issue (and I don't think
the author did either, but I skimmed his article): data integrity.  I'm
not talking about blatant failures or known need for an fsck, but rather
silent corruption.

Where I work, we are considering multi-petabyte scenarios, and with the
specs of current drives, we are talking hundreds of silent errors per
read of the volume of data - unacceptable.  With large filesystems (and
he's talking 100 PB, etc.), this is the #1 issue for me.

						-Joe

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Future Linux filesystems
  2008-06-03 14:37   ` Thomas King
  2008-06-03 15:02     ` Joe Peterson
@ 2008-06-03 15:52     ` Evgeniy Polyakov
  2008-06-03 16:17       ` Miguel Sousa Filipe
  2008-06-04  2:14     ` Chris Mason
  2008-06-04  2:34     ` Dongjun Shin
  3 siblings, 1 reply; 12+ messages in thread
From: Evgeniy Polyakov @ 2008-06-03 15:52 UTC (permalink / raw)
  To: Thomas King; +Cc: linux-btrfs

Hi.

On Tue, Jun 03, 2008 at 09:37:27AM -0500, Thomas King (kingttx@tomslinux.homelinux.org) wrote:
> I asked the author that question and he states XFS is actually a pretty good
> answer to most of those issues but believes it still falls short where "the
> metadata areas are not aligned with RAID strips and allocation units are FAR too
> small but better than ext." Another detail he brought out was sending data and
> metadata to different devices in those environments and referenced RT XFS.
> Otherwise having them on the same device increases the possibility of corruption
> and/or a longer filesystem check/repair. Will btrfs offer something like this in
> the future?

Right now btrfs can be created on top of multiple devices.
AFAIK, there are no policies on hwo to put data and metadata between them.

> Do y'all foresee btrfs being used in exabtye installations?
> Does/Will btrfs have RAID awareness in that it will align "the
> superblock and metadata to the RAID stripe"?
> What is the largest block allocation available?
> Will btrfs be T10 DIF/block protect aware?
> I remember reading that CRFS relies on btrfs, but will btrfs support NFS,
> specifically version 4.1?

Original author does not belive in networked filesystem as a key method
to organize large storages :)
Changes to filesystem are quite simple in order fs would be exported via
NFS, so that should not be a problem.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Future Linux filesystems
  2008-06-03 15:02     ` Joe Peterson
@ 2008-06-03 16:06       ` Martin K. Petersen
  2008-06-03 16:46         ` Joe Peterson
  0 siblings, 1 reply; 12+ messages in thread
From: Martin K. Petersen @ 2008-06-03 16:06 UTC (permalink / raw)
  To: Joe Peterson; +Cc: Thomas King, linux-btrfs

>>>>> "Joe" == Joe Peterson <lavajoe@gentoo.org> writes:

Joe> You don't mention what I believe is the *key* issue (and I don't
Joe> think the author did either, but I skimmed his article): data
Joe> integrity.  I'm not talking about blatant failures or known need
Joe> for an fsck, but rather silent corruption.

We're very concerned about data integrity.  With btrfs everything is
checksummed at the logical level.  This allows you to detect data
corruption, repair bad blocks using redundant, good copies, perform
data scrubbing, etc.

A related, but orthogonal data integrity measure is the T10 DIF
infrastructure that I am working on.  DIF enables protection at the
sector level and includes stuff like a data checksum and a locality
check which ensures that the sector ends up the right place on disk.

If there is a mismatch the I/O will be reject by either the HBA or the
storage device.  That allows us to catch a lot of the corruption
scenarios where we accidentally write bad stuff to disk.

Right now the DIF checksum is added at the block layer level.  Work is
in progress to move it up into the filesystems and from there into
user space.  Eventually we'd like to be able to generate the checksum
in the application and pass it along the I/O path all the way out to
the physical disk.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Future Linux filesystems
  2008-06-03 15:52     ` Evgeniy Polyakov
@ 2008-06-03 16:17       ` Miguel Sousa Filipe
  0 siblings, 0 replies; 12+ messages in thread
From: Miguel Sousa Filipe @ 2008-06-03 16:17 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: Thomas King, linux-btrfs

Hi,

On Tue, Jun 3, 2008 at 4:52 PM, Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> Hi.
>
> On Tue, Jun 03, 2008 at 09:37:27AM -0500, Thomas King (kingttx@tomslinux.homelinux.org) wrote:
>> I asked the author that question and he states XFS is actually a pretty good
>> answer to most of those issues but believes it still falls short where "the
>> metadata areas are not aligned with RAID strips and allocation units are FAR too
>> small but better than ext." Another detail he brought out was sending data and
>> metadata to different devices in those environments and referenced RT XFS.
>> Otherwise having them on the same device increases the possibility of corruption
>> and/or a longer filesystem check/repair. Will btrfs offer something like this in
>> the future?
>
> Right now btrfs can be created on top of multiple devices.
> AFAIK, there are no policies on hwo to put data and metadata between them.
>

But it does allow to specify to have different replication/stripping
policies for metadata and data.
Such has: configure a raid0 with N drives, but mirror the metadata
across all of them.

>> Do y'all foresee btrfs being used in exabtye installations?
>> Does/Will btrfs have RAID awareness in that it will align "the
>> superblock and metadata to the RAID stripe"?

This is a feature that is intented to provided in the future, this was
talked about in the
#btrfs@freenode.org irc channel.
There isn't code for this currently.



-- 
Miguel Sousa Filipe

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Future Linux filesystems
  2008-06-03 16:06       ` Martin K. Petersen
@ 2008-06-03 16:46         ` Joe Peterson
  0 siblings, 0 replies; 12+ messages in thread
From: Joe Peterson @ 2008-06-03 16:46 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: Thomas King, linux-btrfs

Martin K. Petersen wrote:
> We're very concerned about data integrity.  With btrfs everything is
> checksummed at the logical level.  This allows you to detect data
> corruption, repair bad blocks using redundant, good copies, perform
> data scrubbing, etc.

That's the main reason I am interesting in btrfs, actually.  :)

> A related, but orthogonal data integrity measure is the T10 DIF
> infrastructure that I am working on.  DIF enables protection at the
> sector level and includes stuff like a data checksum and a locality
> check which ensures that the sector ends up the right place on disk.

Great!  Really great to hear that this issue is being actively worked.

> Right now the DIF checksum is added at the block layer level.  Work is
> in progress to move it up into the filesystems and from there into
> user space.  Eventually we'd like to be able to generate the checksum
> in the application and pass it along the I/O path all the way out to
> the physical disk.

Yep, end-to-end is a great idea.  Kudos to this and to btrfs!

					-Joe

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Future Linux filesystems
  2008-06-03 14:37   ` Thomas King
  2008-06-03 15:02     ` Joe Peterson
  2008-06-03 15:52     ` Evgeniy Polyakov
@ 2008-06-04  2:14     ` Chris Mason
  2008-06-04 14:00       ` Thomas King
  2008-06-04  2:34     ` Dongjun Shin
  3 siblings, 1 reply; 12+ messages in thread
From: Chris Mason @ 2008-06-04  2:14 UTC (permalink / raw)
  To: Thomas King; +Cc: linux-btrfs

On Tue, Jun 03, 2008 at 09:37:27AM -0500, Thomas King wrote:
> > All the issues he complains about actually are solved by XFS, and XFS actually
> does better in
> > exactly these environments than either zfs on Solaris or JFS2 on AIX.
> >
> >
> 
> I asked the author that question and he states XFS is actually a pretty good
> answer to most of those issues but believes it still falls short where "the
> metadata areas are not aligned with RAID strips and allocation units are FAR too
> small but better than ext."

I think it would be best to let the XFS developers answer this part.
But, XFS is designed for and used in massive installations, and I think
it represents a scalability goal for Btrfs.

> Another detail he brought out was sending data and
> metadata to different devices in those environments and referenced RT XFS.
> Otherwise having them on the same device increases the possibility of corruption
> and/or a longer filesystem check/repair. Will btrfs offer something like this in
> the future?

Btrfs can duplicate metadata via the internal raid1 and raid10 code.  On
single spindles it will duplicate metadata as well.  This is different
from RT XFS which I do not understand well.

There is not code today in btrfs to force data and metadata to different
devices, but the disk format has the bits it needs to make that happen.
I think it is an oversimplification to say that splitting the two
between devices changes the chances of a corruption, or changes the time
a repair takes.

Btrfs does split data and metadata allocations, grouping metadata
together in large chunks on the drive.  This does make FS check/repair
faster by reducing seeks between metadata blocks.

> 
> Do y'all foresee btrfs being used in exabtye installations?

Yes

> Does/Will btrfs have RAID awareness in that it will align "the
> superblock and metadata to the RAID stripe"?

Today the superblock is not stripe aligned, but it will be in a future
release that supports super block duplication.  At least, the
blocks that are frequently written will be striped aligned.

> What is the largest block allocation available?

2^64 bytes.  But, in COW filesystems massive extents have different
costs than they do in traditional filesystems.  It isn't always a good
idea to make a huge extent.

> Will btrfs be T10 DIF/block protect aware?

I work closely with Martin, and we'll leverage the T10 DIF code as much
as possible.

> I remember reading that CRFS relies on btrfs, but will btrfs support NFS,
> specifically version 4.1?
> 

We'll definitely support NFS.  It doesn't work today, but it will before
1.0.

-chris

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Future Linux filesystems
  2008-06-03 14:37   ` Thomas King
                       ` (2 preceding siblings ...)
  2008-06-04  2:14     ` Chris Mason
@ 2008-06-04  2:34     ` Dongjun Shin
  3 siblings, 0 replies; 12+ messages in thread
From: Dongjun Shin @ 2008-06-04  2:34 UTC (permalink / raw)
  To: Thomas King; +Cc: linux-btrfs

On Tue, Jun 3, 2008 at 11:37 PM, Thomas King
<kingttx@tomslinux.homelinux.org> wrote:
>> All the issues he complains about actually are solved by XFS, and XFS actually
> does better in
>> exactly these environments than either zfs on Solaris or JFS2 on AIX.
>>
>>
>
> I asked the author that question and he states XFS is actually a pretty good
> answer to most of those issues but believes it still falls short where "the
> metadata areas are not aligned with RAID strips and allocation units are FAR too
> small but better than ext." Another detail he brought out was sending data and
> metadata to different devices in those environments and referenced RT XFS.
> Otherwise having them on the same device increases the possibility of corruption
> and/or a longer filesystem check/repair. Will btrfs offer something like this in
> the future?
>
> Do y'all foresee btrfs being used in exabtye installations?
> Does/Will btrfs have RAID awareness in that it will align "the
> superblock and metadata to the RAID stripe"?
> What is the largest block allocation available?
> Will btrfs be T10 DIF/block protect aware?
> I remember reading that CRFS relies on btrfs, but will btrfs support NFS,
> specifically version 4.1?
>

I also would like to comment that btrfs is ready for the future storage
- the solid state drive. Btrfs performs well on both HDD and SSD.

AFAIK, the ssd option of btrfs only affects the block allocation behavior.
However, under hybrid combination of HDD and SSD with the multi-device
support of btrfs, there can be more interesting optimizations that utilize
the physical characteristics of each device.

-- 
Dongjun

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Future Linux filesystems
  2008-06-04  2:14     ` Chris Mason
@ 2008-06-04 14:00       ` Thomas King
  0 siblings, 0 replies; 12+ messages in thread
From: Thomas King @ 2008-06-04 14:00 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-btrfs

> On Tue, Jun 03, 2008 at 09:37:27AM -0500, Thomas King wrote:
>> > All the issues he complains about actually are solved by XFS, and XFS
>> actually
>> does better in
>> > exactly these environments than either zfs on Solaris or JFS2 on AIX.
>> >
>> >
>>
>> I asked the author that question and he states XFS is actually a pretty good
>> answer to most of those issues but believes it still falls short where "the
>> metadata areas are not aligned with RAID strips and allocation units are FAR
>> too
>> small but better than ext."
>
> I think it would be best to let the XFS developers answer this part.
> But, XFS is designed for and used in massive installations, and I think
> it represents a scalability goal for Btrfs.
>
>> Another detail he brought out was sending data and
>> metadata to different devices in those environments and referenced RT XFS.
>> Otherwise having them on the same device increases the possibility of
>> corruption
>> and/or a longer filesystem check/repair. Will btrfs offer something like this
>> in
>> the future?
>
> Btrfs can duplicate metadata via the internal raid1 and raid10 code.  On
> single spindles it will duplicate metadata as well.  This is different
> from RT XFS which I do not understand well.
>
> There is not code today in btrfs to force data and metadata to different
> devices, but the disk format has the bits it needs to make that happen.
> I think it is an oversimplification to say that splitting the two
> between devices changes the chances of a corruption, or changes the time
> a repair takes.
>
> Btrfs does split data and metadata allocations, grouping metadata
> together in large chunks on the drive.  This does make FS check/repair
> faster by reducing seeks between metadata blocks.
>
>>
>> Do y'all foresee btrfs being used in exabtye installations?
>
> Yes
>
>> Does/Will btrfs have RAID awareness in that it will align "the
>> superblock and metadata to the RAID stripe"?
>
> Today the superblock is not stripe aligned, but it will be in a future
> release that supports super block duplication.  At least, the
> blocks that are frequently written will be striped aligned.
>
>> What is the largest block allocation available?
>
> 2^64 bytes.  But, in COW filesystems massive extents have different
> costs than they do in traditional filesystems.  It isn't always a good
> idea to make a huge extent.
>
>> Will btrfs be T10 DIF/block protect aware?
>
> I work closely with Martin, and we'll leverage the T10 DIF code as much
> as possible.
>
>> I remember reading that CRFS relies on btrfs, but will btrfs support NFS,
>> specifically version 4.1?
>>
>
> We'll definitely support NFS.  It doesn't work today, but it will before
> 1.0.
>
> -chris
>
>
Chris,

Thanks a ton for answering all these questions. I've asked the XFS developers
what was discussed here and they gave some excellent info as well.

Enjoy your day!
Tom King

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Future Linux filesystems
@ 2008-06-11  9:38 Tomasz Chmielewski
  2008-06-11 16:27 ` Zach Brown
  0 siblings, 1 reply; 12+ messages in thread
From: Tomasz Chmielewski @ 2008-06-11  9:38 UTC (permalink / raw)
  To: linux-btrfs

> I also would like to comment that btrfs is ready for the future storage
> - the solid state drive. Btrfs performs well on both HDD and SSD.

SSD is still very expensive when compared to traditional hard disks.

*If* btrfs supported compression, I would second your opinion that btrfs 
is (will be, when it's stable) ready for the future storage.


-- 
Tomasz Chmielewski
http://wpkg.org


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Future Linux filesystems
  2008-06-11  9:38 Future Linux filesystems Tomasz Chmielewski
@ 2008-06-11 16:27 ` Zach Brown
  0 siblings, 0 replies; 12+ messages in thread
From: Zach Brown @ 2008-06-11 16:27 UTC (permalink / raw)
  To: Tomasz Chmielewski; +Cc: linux-btrfs


> SSD is still very expensive when compared to traditional hard disks.

When measured by GB/$, sure.

Many data centers, though, care more about (ops/sec) / ($ * power *
heat).  SSDs look much more compelling by that metric.

- z

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2008-06-11 16:27 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-06-11  9:38 Future Linux filesystems Tomasz Chmielewski
2008-06-11 16:27 ` Zach Brown
  -- strict thread matches above, loose matches on Subject: below --
2008-06-02 21:46 Thomas King
     [not found] ` <20080603065205.GA19533@infradead.org>
2008-06-03 14:37   ` Thomas King
2008-06-03 15:02     ` Joe Peterson
2008-06-03 16:06       ` Martin K. Petersen
2008-06-03 16:46         ` Joe Peterson
2008-06-03 15:52     ` Evgeniy Polyakov
2008-06-03 16:17       ` Miguel Sousa Filipe
2008-06-04  2:14     ` Chris Mason
2008-06-04 14:00       ` Thomas King
2008-06-04  2:34     ` Dongjun Shin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.