From: Martin Steigerwald <Martin@lichtvoll.de>
To: linux-btrfs@vger.kernel.org
Subject: Re: btrfs support for efficient SSD operation (data blocks alignment)
Date: Fri, 10 Feb 2012 19:18:57 +0100 [thread overview]
Message-ID: <201202101918.57855.Martin@lichtvoll.de> (raw)
In-Reply-To: <jgui4j$th5$1@dough.gmane.org>
Hi Martin,
Am Mittwoch, 8. Februar 2012 schrieb Martin:
> My understanding is that for x86 architecture systems, btrfs only
> allows a sector size of 4kB for a HDD/SSD. That is fine for the
> present HDDs assuming the partitions are aligned to a 4kB boundary fo=
r
> that device.
>=20
> However for SSDs...
>=20
> I'm using for example a 60GByte SSD that has:
>=20
> 8kB page size;
> 16kB logical to physical mapping chunk size;
> 2MB erase block size;
> 64MB cache.
>=20
> And the sector size reported to Linux 3.0 is the default 512 bytes!
>=20
>=20
> My first thought is to try formatting with a sector size of 16kB to
> align with the SSD logical mapping chunk size. This is to avoid SSD
> write amplification. Also, the data transfer performance for that
> device is near maximum for writes with a blocksize of 16kB and above.
> Yet, btrfs supports a 4kByte page/sector size only at present...
Thing is as far as I know the better SSDs and even the dumber ones have=
=20
quite some intelligence in the firmware. And at least for me its not cl=
ear=20
what the firmware of my Intel SSD 320 all does on its own and whether a=
ny=20
of my optimization attempts even matter.
So I am not sure, whether just thinking about one write operation of sa=
y 4=20
KB or 2 KB singularily even may sense. I bet often several processes wr=
ite=20
data at once. So there is more amount of data to write.
What now is not clear to me whether the SSD will combine several write=20
requests into a single mapping chunk or erase block or combine them int=
o=20
the already erased space of an erase block. I would bet at least the=20
better SSDs would do it. So even when from the OS point of view, in a=20
simplistic example, one write of 1 MB goes to LBA 40000 and one write o=
f 1=20
MB to LBA 80000 the SSD might still just use a single erase block and=20
combine the writes next to each other. As far as I understand SSDs do C=
OW=20
to spread writes evenly across erase blocks. As far as I furtherly=20
understand from a seek time point of view the exact location where to p=
ut=20
a write request does not matter at all. So for me for an SSD firmware i=
t=20
looks perfectly sane to combine writes as they see fit. And SSDs that c=
arry=20
condensators, like above mentioned Intel SSD, may even cache writes for=
a=20
while to wait for further requests.
The article on write amplication on wikipedia gives me a glimpse of the=
=20
complexity involved=B9. Yes, I set stripe-width as well on my Ext4=20
filesystem, but frankly said I am not even sure whether this has any=20
positive effect except of maybe sparing the SSD controller firmware som=
e=20
reshuffling work.
So from my current point of view most of what you wrote IMHO is more=20
important for really dumb flash. Like as I understood some kernel=20
developers really like to see so that most of the logic could be put in=
to=20
the kernel and be easily modifyable: JBOF - just a bunch of flash cells=
=20
with an interface to access them directly. But for now AFAIK most consu=
mer=20
grade SSDs just provide a SATA interface and hide the internals. So an=20
optimization for one kind or one brand of SSDs may not be suitable for=20
another one.
There are PCI express models but these probably aren=B4t dumb either. A=
nd=20
then there is the idea of auto commit memory (ACM) by Fusion-IO which j=
ust=20
makes a part of the virtual address space persistent.
So its a question on where to put the intelligence. For current SSDs is=
=20
seems the intelligence is really near the storage medium and then IMHO =
it=20
makes sense to even reduce the intelligence on the Linux side.
[1] http://en.wikipedia.org/wiki/Write_amplification
Ciao,
--=20
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
next prev parent reply other threads:[~2012-02-10 18:18 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-02-08 19:24 btrfs support for efficient SSD operation (data blocks alignment) Martin
2012-02-09 1:42 ` Liu Bo
2012-02-10 1:05 ` Martin
2012-02-10 18:18 ` Martin Steigerwald [this message]
2012-05-01 17:04 ` Martin
2012-05-01 17:20 ` Hubert Kario
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=201202101918.57855.Martin@lichtvoll.de \
--to=martin@lichtvoll.de \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).