* Re: big flash disks?
2008-06-01 18:42 big flash disks? Jamie Lokier
@ 2008-06-01 20:41 ` Josh Boyer
2008-06-02 5:59 ` Artem Bityutskiy
2008-06-02 7:28 ` Jörn Engel
2 siblings, 0 replies; 17+ messages in thread
From: Josh Boyer @ 2008-06-01 20:41 UTC (permalink / raw)
To: Jamie Lokier; +Cc: linux-mtd
On Sun, 1 Jun 2008 19:42:39 +0100
Jamie Lokier <jamie@shareable.org> wrote:
> Some people developing newer flash filesystems (UBIFS, Logfs,
> FAT-over-UBI :-) and interested in flash filesystem performance might
> be interested in this slashdot comment:
>
> http://slashdot.org/comments.pl?sid=569439&cid=23618215
Relying on information from slashdot comments is generally considered
dumb. Though this one surprisingly has a grain of truth :).
> They're implying that UBIFS and Logfs aren't suitable for high
> performance writes and/or large flash, and don't work well with up and
> coming flash disks either.
UBIFS, Logfs, JFFS2, and Yaffs1/2 all rely directly on the MTD layer
(ok, Logfs doesn't _require_ it per se). That layer can't handle more
than 4GiB, so some of the newer flash _chip_ are even out of the
question.
As for the SSDs, well those aren't raw flash devices so with the
exception of perhaps Logfs none of the filesystems are really going to
be comparable. These are really no different than the CompactFlash
cards are, in regards to the Linux flash filesystem options. They
simply don't apply.
As for running them in embedded devices to manage raw flash, they are
likely quite good. JFFS2 has been around forever, and tends to be
fairly stable. UBIFS and Logfs are quite new, but I've heard good
things about both. I have no personal experience with Yaffs.
> Also that patents may get in the way.
They tend to do that. It's really about all they are good for.
> I've never heard of MFT before.
Nor I.
josh
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: big flash disks?
2008-06-01 18:42 big flash disks? Jamie Lokier
2008-06-01 20:41 ` Josh Boyer
@ 2008-06-02 5:59 ` Artem Bityutskiy
2008-06-02 8:23 ` Jörn Engel
2008-06-02 7:28 ` Jörn Engel
2 siblings, 1 reply; 17+ messages in thread
From: Artem Bityutskiy @ 2008-06-02 5:59 UTC (permalink / raw)
To: Jamie Lokier; +Cc: linux-mtd
On Sun, 2008-06-01 at 19:42 +0100, Jamie Lokier wrote:
> Some people developing newer flash filesystems (UBIFS, Logfs,
> FAT-over-UBI :-) and interested in flash filesystem performance might
> be interested in this slashdot comment:
>
> http://slashdot.org/comments.pl?sid=569439&cid=23618215
>
> They're implying that UBIFS and Logfs aren't suitable for high
> performance writes and/or large flash, and don't work well with up and
> coming flash disks either.
>
> Also that patents may get in the way.
>
> I've never heard of MFT before.
People should understand that UBIFS is designed for embedded systems. It
is good for low-price devices where you have just bare flash which is
cheap. SSD is a completely different area and irrelevant to UBIFS. The
same applies for LogFS and YAFFS IMO. Talking about using these FS-es on
SSD is just silly (again IMHO).
--
Best regards,
Artem Bityutskiy (Битюцкий Артём)
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: big flash disks?
2008-06-02 5:59 ` Artem Bityutskiy
@ 2008-06-02 8:23 ` Jörn Engel
2008-06-02 10:43 ` Jamie Lokier
0 siblings, 1 reply; 17+ messages in thread
From: Jörn Engel @ 2008-06-02 8:23 UTC (permalink / raw)
To: Artem Bityutskiy; +Cc: linux-mtd, Jamie Lokier
On Mon, 2 June 2008 08:59:19 +0300, Artem Bityutskiy wrote:
> cheap. SSD is a completely different area and irrelevant to UBIFS. The
> same applies for LogFS and YAFFS IMO. Talking about using these FS-es on
> SSD is just silly (again IMHO).
While I don't want to change your opinion, people may misunderstand it
if I ignore this comment.
>From day one, I've wanted to use logfs in my laptop. If it is useful
for embedded devices, that is nice and may have the benefit of paying my
bills. But the real price I'm after is getting rid of hard disks and
have a decent filesystem on top of whatever flash I can attach to my
notebook.
At the moment, the flash one can sensibly attach to a notebook is an
SSD with SATA interface. It has a bunch of disadvantages and there is
no point in listing them all. But it is the reality and denying it
doesn't change that. And this reality is the reason why logfs needs
another format change and Linus doesn't want to have it merged yet.
Making it perform well on SSDs comes first. :)
Jörn
--
Joern's library part 4:
http://www.paulgraham.com/spam.html
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: big flash disks?
2008-06-02 8:23 ` Jörn Engel
@ 2008-06-02 10:43 ` Jamie Lokier
2008-06-02 11:55 ` Jörn Engel
0 siblings, 1 reply; 17+ messages in thread
From: Jamie Lokier @ 2008-06-02 10:43 UTC (permalink / raw)
To: Jörn Engel; +Cc: linux-mtd
Jörn Engel wrote:
> At the moment, the flash one can sensibly attach to a notebook is an
> SSD with SATA interface. It has a bunch of disadvantages and there is
> no point in listing them all. But it is the reality and denying it
> doesn't change that. And this reality is the reason why logfs needs
> another format change and Linus doesn't want to have it merged yet.
> Making it perform well on SSDs comes first. :)
I'm surprised.
What sort of format change does SSD require, relative to NOR/NAND flash?
For SSD I would like to see a filesystem that combines the best
characteristics of btrfs and logfs - the are enough similarities
between them.
-- Jamie
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: big flash disks?
2008-06-02 10:43 ` Jamie Lokier
@ 2008-06-02 11:55 ` Jörn Engel
2008-06-02 12:32 ` Jamie Lokier
0 siblings, 1 reply; 17+ messages in thread
From: Jörn Engel @ 2008-06-02 11:55 UTC (permalink / raw)
To: Jamie Lokier; +Cc: linux-mtd
On Mon, 2 June 2008 11:43:30 +0100, Jamie Lokier wrote:
>
> I'm surprised.
> What sort of format change does SSD require, relative to NOR/NAND flash?
Flash allows one to do partial writes to blocks. SSDs generally don't.
Logfs currently does partial writes for atomic transactions, to make
creat(), unlink(), rename() and friends behave well. Depending on your
SSD a simple creat() can blow up to writing several megabytes on the
actual medium.
I never claimed to actually like those suckers. ;)
> For SSD I would like to see a filesystem that combines the best
> characteristics of btrfs and logfs - the are enough similarities
> between them.
True. Right now I think it is a better idea to keep the two seperate.
But in the future a combination would be rather useful.
Jörn
--
Fantasy is more important than knowledge. Knowledge is limited,
while fantasy embraces the whole world.
-- Albert Einstein
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: big flash disks?
2008-06-02 11:55 ` Jörn Engel
@ 2008-06-02 12:32 ` Jamie Lokier
2008-06-03 18:09 ` Jörn Engel
0 siblings, 1 reply; 17+ messages in thread
From: Jamie Lokier @ 2008-06-02 12:32 UTC (permalink / raw)
To: Jörn Engel; +Cc: linux-mtd
Jörn Engel wrote:
> Flash allows one to do partial writes to blocks. SSDs generally don't.
> Logfs currently does partial writes for atomic transactions, to make
> creat(), unlink(), rename() and friends behave well. Depending on your
> SSD a simple creat() can blow up to writing several megabytes on the
> actual medium.
It's a good argument for delaying writes, and committing only the
minimum necessary on fsync/fdatasync/sync_file_range.
According to the slashdot comment which started this thread :-) they
do 4k log writes to large SSDs - and then report a very high write IOP
rate for database applications. It's the high write rate which is
their selling point: price per GB is ridiculous. So I'm inclined to
believe they do actually get the claimed write rate under some
circumstances.
If they can do 4k writes, and you cannot, it sounds like the SSDs you
have used are very different to the SSDs they have used. Is that
right? If so, we need to keep an open mind about the different kinds
of SSD that are becoming available under that name.
-- Jamie
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: big flash disks?
2008-06-02 12:32 ` Jamie Lokier
@ 2008-06-03 18:09 ` Jörn Engel
2008-06-03 18:44 ` Jamie Lokier
0 siblings, 1 reply; 17+ messages in thread
From: Jörn Engel @ 2008-06-03 18:09 UTC (permalink / raw)
To: Jamie Lokier; +Cc: linux-mtd
On Mon, 2 June 2008 13:32:18 +0100, Jamie Lokier wrote:
>
> If they can do 4k writes, and you cannot, it sounds like the SSDs you
> have used are very different to the SSDs they have used. Is that
> right?
It isn't. Their SSDs have shitty performance for 4k random writes.
That's the entire point of their product. They reorder the data,
turning random 4k writes into aligned eraseblock-sized writes. After
that reordering the performance goes way up. Iirc at least one SSD they
used must have 1MB erasesize to explain the performance boost.
Jörn
--
Audacity augments courage; hesitation, fear.
-- Publilius Syrus
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: big flash disks?
2008-06-03 18:09 ` Jörn Engel
@ 2008-06-03 18:44 ` Jamie Lokier
2008-06-04 6:25 ` Jörn Engel
0 siblings, 1 reply; 17+ messages in thread
From: Jamie Lokier @ 2008-06-03 18:44 UTC (permalink / raw)
To: Jörn Engel; +Cc: linux-mtd
Jörn Engel wrote:
> On Mon, 2 June 2008 13:32:18 +0100, Jamie Lokier wrote:
> >
> > If they can do 4k writes, and you cannot, it sounds like the SSDs you
> > have used are very different to the SSDs they have used. Is that
> > right?
>
> It isn't. Their SSDs have shitty performance for 4k random writes.
> That's the entire point of their product. They reorder the data,
> turning random 4k writes into aligned eraseblock-sized writes. After
> that reordering the performance goes way up. Iirc at least one SSD they
> used must have 1MB erasesize to explain the performance boost.
Yes, they reorder - he says as much, that traditional filesystems
perform very poorly.
But he quites a high write IOP rate, which is sometimes taken to mean
a high rate of database commits (e.g. fsync). That can't be done with
eraseblock-sized writes.
If it's not high commit rate, then the quoted IOP rate is misleading
because you can do the same reordering thing with hard disks to get a
high write rate. (Albeit hard disks suffer from random reads more if
ordering writes disorders reads).
If you think it's just reordering, not committing each 4k write one
after the other quickly, I'll ask him about it.
-- Jamie
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: big flash disks?
2008-06-03 18:44 ` Jamie Lokier
@ 2008-06-04 6:25 ` Jörn Engel
0 siblings, 0 replies; 17+ messages in thread
From: Jörn Engel @ 2008-06-04 6:25 UTC (permalink / raw)
To: Jamie Lokier; +Cc: linux-mtd
On Tue, 3 June 2008 19:44:29 +0100, Jamie Lokier wrote:
>
> But he quites a high write IOP rate, which is sometimes taken to mean
> a high rate of database commits (e.g. fsync). That can't be done with
> eraseblock-sized writes.
I don't remember reading the words 'commit' or 'sync' in any of his
posts. ;)
> If it's not high commit rate, then the quoted IOP rate is misleading
> because you can do the same reordering thing with hard disks to get a
> high write rate. (Albeit hard disks suffer from random reads more if
> ordering writes disorders reads).
>
> If you think it's just reordering, not committing each 4k write one
> after the other quickly, I'll ask him about it.
At least on some of the cheaper SSDs I believe the only way to commit a
write is by writing the whole eraseblock. And if you don't explicitly
request that, the SSD will do it for you. Either way you take the
performance hit.
Alternatively he could walk up the tree for commits and store all higher
layers in the same eraseblock. 16k would be enough for a commit on
disks up to 512G in size, 20k up to 256T. On real flash that simply
won't fly, as GC will cause a deadlock sooner or later. With an SSD you
can still get away with it, as 4k writes without erases are slow, but at
least possible.
Jörn
--
Sometimes, asking the right question is already the answer.
-- Unknown
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: big flash disks?
2008-06-01 18:42 big flash disks? Jamie Lokier
2008-06-01 20:41 ` Josh Boyer
2008-06-02 5:59 ` Artem Bityutskiy
@ 2008-06-02 7:28 ` Jörn Engel
2008-06-02 10:41 ` Jamie Lokier
2 siblings, 1 reply; 17+ messages in thread
From: Jörn Engel @ 2008-06-02 7:28 UTC (permalink / raw)
To: Jamie Lokier; +Cc: linux-mtd
On Sun, 1 June 2008 19:42:39 +0100, Jamie Lokier wrote:
>
> Some people developing newer flash filesystems (UBIFS, Logfs,
> FAT-over-UBI :-) and interested in flash filesystem performance might
> be interested in this slashdot comment:
>
> http://slashdot.org/comments.pl?sid=569439&cid=23618215
>
> They're implying that UBIFS and Logfs aren't suitable for high
> performance writes and/or large flash, and don't work well with up and
> coming flash disks either.
>
> Also that patents may get in the way.
He has some good points, but also happens to argue in favor of his
company and against their competition. To be served with a grain of
salt.
> I've never heard of MFT before.
Basically they create a log-structured block device. For a while I've
been thinking of doing the same, essentially strip logfs down to a
single file, which gets a block device interface. Removing all the
filesystem complexities (atomic create/unlink/rename, interactions with
vfs and mm, etc.) makes the project a _lot_ simpler. I'm nor surprised
they have a usable product already.
I decided against it, because I don't believe it to be the best approach
long-term. One of the disadvantages is that block devices have
relatively little knowledge about caching constraints. A filesystem can
easily have gigabytes of dirty data around, where a block device is
expected to return success for every single write in a reasonable
timeframe, usually measures in milliseconds.
Jörn
--
Rules of Optimization:
Rule 1: Don't do it.
Rule 2 (for experts only): Don't do it yet.
-- M.A. Jackson
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: big flash disks?
2008-06-02 7:28 ` Jörn Engel
@ 2008-06-02 10:41 ` Jamie Lokier
2008-06-02 11:43 ` Jörn Engel
0 siblings, 1 reply; 17+ messages in thread
From: Jamie Lokier @ 2008-06-02 10:41 UTC (permalink / raw)
To: Jörn Engel; +Cc: linux-mtd
Jörn Engel wrote:
> Basically they create a log-structured block device. For a while I've
> been thinking of doing the same, essentially strip logfs down to a
> single file, which gets a block device interface. Removing all the
> filesystem complexities (atomic create/unlink/rename, interactions with
> vfs and mm, etc.) makes the project a _lot_ simpler. I'm nor surprised
> they have a usable product already.
>
> I decided against it, because I don't believe it to be the best approach
> long-term. One of the disadvantages is that block devices have
> relatively little knowledge about caching constraints. A filesystem can
> easily have gigabytes of dirty data around, where a block device is
> expected to return success for every single write in a reasonable
> timeframe, usually measures in milliseconds.
Won't you get essentially the same by creating a single file on LogFS,
and using it for a loopback mount?
Sure, it's more complicated under the hood than a stripped-down LogFS,
but will it behave and perform similarly?
-- Jamie
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: big flash disks?
2008-06-02 10:41 ` Jamie Lokier
@ 2008-06-02 11:43 ` Jörn Engel
2008-06-02 12:48 ` Jamie Lokier
0 siblings, 1 reply; 17+ messages in thread
From: Jörn Engel @ 2008-06-02 11:43 UTC (permalink / raw)
To: Jamie Lokier; +Cc: linux-mtd
On Mon, 2 June 2008 11:41:06 +0100, Jamie Lokier wrote:
>
> Won't you get essentially the same by creating a single file on LogFS,
> and using it for a loopback mount?
In a broad sense, yes. Drawbacks of this setup are the usual ones of
loop plus a deeper tree for logfs. Instead of having a single 'file'
with indirect blocks, you also have the inode file with indirect blocks.
So for every sync, another couple of writes are necessary that don't
give you any gains in such a setup.
> Sure, it's more complicated under the hood than a stripped-down LogFS,
> but will it behave and perform similarly?
With plenty of memory and sync being a sufficiently rare event, it
might.
Jörn
--
Invincibility is in oneself, vulnerability is in the opponent.
-- Sun Tzu
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: big flash disks?
2008-06-02 11:43 ` Jörn Engel
@ 2008-06-02 12:48 ` Jamie Lokier
2008-06-03 18:12 ` Jörn Engel
0 siblings, 1 reply; 17+ messages in thread
From: Jamie Lokier @ 2008-06-02 12:48 UTC (permalink / raw)
To: Jörn Engel; +Cc: linux-mtd
Jörn Engel wrote:
> On Mon, 2 June 2008 11:41:06 +0100, Jamie Lokier wrote:
> >
> > Won't you get essentially the same by creating a single file on LogFS,
> > and using it for a loopback mount?
>
> In a broad sense, yes. Drawbacks of this setup are the usual ones of
> loop plus a deeper tree for logfs. Instead of having a single 'file'
> with indirect blocks, you also have the inode file with indirect blocks.
> So for every sync, another couple of writes are necessary that don't
> give you any gains in such a setup.
Oh. I've been thinking a lot about log-structured trees (or
tree-structured logs :-) lately, so I tend to assume the tree depth
isn't important, when nodes close to the root are static.
Did you know you can structure the tree such that additional depth
doesn't add many/any additional writes on sync?
The basic idea is for a pointer in a tree node to point not to one
child, but to a small set of potential children. The child-set are a
journal in the jffs2 sense. When reading, you read each block of the
child-set, and pick the most recent. This slows down reading, but
reduces the amount of writing. You still read in O(log tree_size)
blocks, and since most of the extra reads are hot-cache internal tree
blocks, the amount of extra reading is quite small. Child-sets can
overlap to reduce storage duplication, at cost of more operations -
it's a heuristic balancing act. Child-sets are not used for all tree
nodes, especially data. They can be invoked and destroyed dynamically
using heuristics to detect some parts of the tree undergoing lots of
write+sync sequences and others being coalescable writes or not
written.
Put simply: combine logging with phase trees, and you won't have to
replace the whole leaf-to-root path on every sync. Then extra static
depth at the root is free.
-- Jamie
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: big flash disks?
2008-06-02 12:48 ` Jamie Lokier
@ 2008-06-03 18:12 ` Jörn Engel
2008-06-03 18:56 ` Jamie Lokier
0 siblings, 1 reply; 17+ messages in thread
From: Jörn Engel @ 2008-06-03 18:12 UTC (permalink / raw)
To: Jamie Lokier; +Cc: linux-mtd
On Mon, 2 June 2008 13:48:22 +0100, Jamie Lokier wrote:
>
> The basic idea is for a pointer in a tree node to point not to one
> child, but to a small set of potential children. The child-set are a
> journal in the jffs2 sense. When reading, you read each block of the
> child-set, and pick the most recent. This slows down reading, but
> reduces the amount of writing. You still read in O(log tree_size)
> blocks, and since most of the extra reads are hot-cache internal tree
> blocks, the amount of extra reading is quite small. Child-sets can
> overlap to reduce storage duplication, at cost of more operations -
> it's a heuristic balancing act. Child-sets are not used for all tree
> nodes, especially data. They can be invoked and destroyed dynamically
> using heuristics to detect some parts of the tree undergoing lots of
> write+sync sequences and others being coalescable writes or not
> written.
This is actually a good explanation of the logfs journal. :)
Jörn
--
All art is but imitation of nature.
-- Lucius Annaeus Seneca
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: big flash disks?
2008-06-03 18:12 ` Jörn Engel
@ 2008-06-03 18:56 ` Jamie Lokier
2008-06-04 6:18 ` Jörn Engel
0 siblings, 1 reply; 17+ messages in thread
From: Jamie Lokier @ 2008-06-03 18:56 UTC (permalink / raw)
To: Jörn Engel; +Cc: linux-mtd
Jörn Engel wrote:
> On Mon, 2 June 2008 13:48:22 +0100, Jamie Lokier wrote:
> >
> > The basic idea is for a pointer in a tree node to point not to one
> > child, but to a small set of potential children. The child-set are a
> > journal in the jffs2 sense. When reading, you read each block of the
> > child-set, and pick the most recent. This slows down reading, but
> > reduces the amount of writing. You still read in O(log tree_size)
> > blocks, and since most of the extra reads are hot-cache internal tree
> > blocks, the amount of extra reading is quite small. Child-sets can
> > overlap to reduce storage duplication, at cost of more operations -
> > it's a heuristic balancing act. Child-sets are not used for all tree
> > nodes, especially data. They can be invoked and destroyed dynamically
> > using heuristics to detect some parts of the tree undergoing lots of
> > write+sync sequences and others being coalescable writes or not
> > written.
>
> This is actually a good explanation of the logfs journal. :)
Oh. Great, cheers. Great minds think alike :-)
If that's the logfs journal - why would extra static tree depth near
the root add any write-commit overhead as you said in the grandparent
post? :-)
(Btw, I thought a difference between logfs and ubifs is the latter
does async writes? Or do they both?)
-- Jamie
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: big flash disks?
2008-06-03 18:56 ` Jamie Lokier
@ 2008-06-04 6:18 ` Jörn Engel
0 siblings, 0 replies; 17+ messages in thread
From: Jörn Engel @ 2008-06-04 6:18 UTC (permalink / raw)
To: Jamie Lokier; +Cc: linux-mtd
On Tue, 3 June 2008 19:56:59 +0100, Jamie Lokier wrote:
>
> If that's the logfs journal - why would extra static tree depth near
> the root add any write-commit overhead as you said in the grandparent
> post? :-)
Because I only do it for the root of the tree, as of today.
> (Btw, I thought a difference between logfs and ubifs is the latter
> does async writes? Or do they both?)
Logfs does async writes for metadata, not for payload data. That is
enough to get close to jffs2 write performance. For long streaming
writes it should perform identical to ubifs, short bursts look faster in
ubifs, as the data only goes to cache, not to flash. Frequent rewrites
of the same data without sync in between are where ubifs currently wins.
Jörn
--
It does not require a majority to prevail, but rather an irate,
tireless minority keen to set brush fires in people's minds.
-- Samuel Adams
^ permalink raw reply [flat|nested] 17+ messages in thread