Re: btrfs support for efficient SSD operation (data blocks alignment)

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Martin Steigerwald <Martin@lichtvoll.de>
To: linux-btrfs@vger.kernel.org
Subject: Re: btrfs support for efficient SSD operation (data blocks alignment)
Date: Fri, 10 Feb 2012 19:18:57 +0100	[thread overview]
Message-ID: <201202101918.57855.Martin@lichtvoll.de> (raw)
In-Reply-To: <jgui4j$th5$1@dough.gmane.org>

Hi Martin,

Am Mittwoch, 8. Februar 2012 schrieb Martin:
> My understanding is that for x86 architecture systems, btrfs only
> allows a sector size of 4kB for a HDD/SSD. That is fine for the
> present HDDs assuming the partitions are aligned to a 4kB boundary fo=
r
> that device.
>=20
> However for SSDs...
>=20
> I'm using for example a 60GByte SSD that has:
>=20
>     8kB page size;
>     16kB logical to physical mapping chunk size;
>     2MB erase block size;
>     64MB cache.
>=20
> And the sector size reported to Linux 3.0 is the default 512 bytes!
>=20
>=20
> My first thought is to try formatting with a sector size of 16kB to
> align with the SSD logical mapping chunk size. This is to avoid SSD
> write amplification. Also, the data transfer performance for that
> device is near maximum for writes with a blocksize of 16kB and above.
> Yet, btrfs supports a 4kByte page/sector size only at present...

Thing is as far as I know the better SSDs and even the dumber ones have=
=20
quite some intelligence in the firmware. And at least for me its not cl=
ear=20
what the firmware of my Intel SSD 320 all does on its own and whether a=
ny=20
of my optimization attempts even matter.

So I am not sure, whether just thinking about one write operation of sa=
y 4=20
KB or 2 KB singularily even may sense. I bet often several processes wr=
ite=20
data at once. So there is more amount of data to write.

What now is not clear to me whether the SSD will combine several write=20
requests into a single mapping chunk or erase block or combine them int=
o=20
the already erased space of an erase block. I would bet at least the=20
better SSDs would do it. So even when from the OS point of view, in a=20
simplistic example, one write of 1 MB goes to LBA 40000 and one write o=
f 1=20
MB to LBA 80000 the SSD might still just use a single erase block and=20
combine the writes next to each other. As far as I understand SSDs do C=
OW=20
to spread writes evenly across erase blocks. As far as I furtherly=20
understand from a seek time point of view the exact location where to p=
ut=20
a write request does not matter at all. So for me for an SSD firmware i=
t=20
looks perfectly sane to combine writes as they see fit. And SSDs that c=
arry=20
condensators, like above mentioned Intel SSD, may even cache writes for=
 a=20
while to wait for further requests.

The article on write amplication on wikipedia gives me a glimpse of the=
=20
complexity involved=B9. Yes, I set stripe-width as well on my Ext4=20
filesystem, but frankly said I am not even sure whether this has any=20
positive effect except of maybe sparing the SSD controller firmware som=
e=20
reshuffling work.

So from my current point of view most of what you wrote IMHO is more=20
important for really dumb flash. Like as I understood some kernel=20
developers really like to see so that most of the logic could be put in=
to=20
the kernel and be easily modifyable: JBOF - just a bunch of flash cells=
=20
with an interface to access them directly. But for now AFAIK most consu=
mer=20
grade SSDs just provide a SATA interface and hide the internals. So an=20
optimization for one kind or one brand of SSDs may not be suitable for=20
another one.

There are PCI express models but these probably aren=B4t dumb either. A=
nd=20
then there is the idea of auto commit memory (ACM) by Fusion-IO which j=
ust=20
makes a part of the virtual address space persistent.

So its a question on where to put the intelligence. For current SSDs is=
=20
seems the intelligence is really near the storage medium and then IMHO =
it=20
makes sense to even reduce the intelligence on the Linux side.

[1] http://en.wikipedia.org/wiki/Write_amplification

Ciao,
--=20
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

next prev parent reply	other threads:[~2012-02-10 18:18 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-02-08 19:24 btrfs support for efficient SSD operation (data blocks alignment) Martin
2012-02-09  1:42 ` Liu Bo
2012-02-10  1:05   ` Martin
2012-02-10 18:18 ` Martin Steigerwald [this message]
2012-05-01 17:04   ` Martin
2012-05-01 17:20     ` Hubert Kario

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=201202101918.57855.Martin@lichtvoll.de \
    --to=martin@lichtvoll.de \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.