All of lore.kernel.org
 help / color / mirror / Atom feed
From: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
To: Goffredo Baroncelli <kreijack@inwind.it>
Cc: "Austin S. Hemmelgarn" <ahferroin7@gmail.com>,
	Chris Murphy <lists@colorremedies.com>,
	Christoph Anton Mitterer <calestyo@scientia.net>,
	Btrfs BTRFS <linux-btrfs@vger.kernel.org>
Subject: Re: Status of RAID5/6
Date: Wed, 4 Apr 2018 02:01:02 -0400	[thread overview]
Message-ID: <20180404060101.GK2446@hungrycats.org> (raw)
In-Reply-To: <cfae662a-d64d-6fd8-9ca3-9892fc349747@inwind.it>

[-- Attachment #1: Type: text/plain, Size: 4958 bytes --]

On Wed, Apr 04, 2018 at 07:15:54AM +0200, Goffredo Baroncelli wrote:
> On 04/04/2018 12:57 AM, Zygo Blaxell wrote:
> >> I have to point out that in any case the extent is physically
> >> interrupted at the disk-stripe size. Assuming disk-stripe=64KB, if
> >> you want to write 128KB, the first half is written in the first disk,
> >> the other in the 2nd disk.  If you want to write 96kb, the first 64
> >> are written in the first disk, the last part in the 2nd, only on a
> >> different BG.
> > The "only on a different BG" part implies something expensive, either
> > a seek or a new erase page depending on the hardware.  Without that,
> > nearby logical blocks are nearby physical blocks as well.
> 
> In any case it happens on a different disk

No it doesn't.  The small-BG could be on the same disk(s) as the big-BG.

> >> So yes there is a fragmentation from a logical point of view; from a
> >> physical point of view the data is spread on the disks in any case.
> 
> > What matters is the extent-tree point of view.  There is (currently)
> > no fragmentation there, even for RAID5/6.  The extent tree is unaware
> > of RAID5/6 (to its peril).
> 
> Before you pointed out that the non-contiguous block written has
> an impact on performance. I am replaying  that the switching from a
> different BG happens at the stripe-disk boundary, so in any case the
> block is physically interrupted and switched to another disk

The difference is that the write is switched to a different local address
on the disk.

It's not "another" disk if it's a different BG.  Recall in this plan
there is a full-width BG that is on _every_ disk, which means every
small-width BG shares a disk with the full-width BG.  Every extent tail
write requires a seek on a minimum of two disks in the array for raid5,
three disks for raid6.  A tail that is strip-width minus one will hit
N - 1 disks twice in an N-disk array.

> However yes: from an extent-tree point of view there will be an increase
> of number extents, because the end of the writing is allocated to
> another BG (if the size is not stripe-boundary)
> 
> > If an application does a loop writing 68K then fsync(), the multiple-BG
> > solution adds two seeks to read every 68K.  That's expensive if sequential
> > read bandwidth is more scarce than free space.
> 
> Why you talk about an additional seeks? In any case (even without the
> additional BG) the read happens from another disks

See above:  not another disk, usually a different location on two or
more of the same disks.

> >> * c),d),e) are applied only for the tail of the extent, in case the
> > size is less than the stripe size.
> > 
> > It's only necessary to split an extent if there are no other writes
> > in the same transaction that could be combined with the extent tail
> > into a single RAID stripe.  As long as everything in the RAID stripe
> > belongs to a single transaction, there is no write hole
> 
> May be that a more "simpler" optimization would be close the transaction
> when the data reach the stripe boundary... But I suspect that it is
> not so simple to implement.

Transactions exist in btrfs to batch up writes into big contiguous extents
already.  The trick is to _not_ do that when one transaction ends and
the next begins, i.e. leave a space at the end of the partially-filled
stripe so that the next transaction begins in an empty stripe.

This does mean that there will only be extra seeks during transaction
commit and fsync()--which were already very seeky to begin with.  It's
not necessary to write a partial stripe when there are other extents to
combine.

So there will be double the amount of seeking, but depending on the
workload, it could double a very small percentage of writes.

> > Not for d.  Balance doesn't know how to get rid of unreachable blocks
> > in extents (it just moves the entire extent around) so after a balance
> > the writes would still be rounded up to the stripe size.  Balance would
> > never be able to free the rounded-up space.  That space would just be
> > gone until the file was overwritten, deleted, or defragged.
> 
> If balance is capable to move the extent, why not place one near the
> other during a balance ? The goal is not to limit the the writing of
> the end of a extent, but avoid writing the end of an extent without
> further data (e.g. the gap to the stripe has to be filled in the
> same transaction)

That's plan f (leave gaps in RAID stripes empty).  Balance will repack
short extents into RAID stripes nicely.

Plan d can't do that because plan d overallocates the extent so that
the extent fills the stripe (only some of the extent is used for data).
Small but important difference.

> BR
> G.Baroncelli
> 
> -- 
> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

  reply	other threads:[~2018-04-04  6:02 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-03-21 16:50 Status of RAID5/6 Menion
2018-03-21 17:24 ` Liu Bo
2018-03-21 20:02   ` Christoph Anton Mitterer
2018-03-22 12:01     ` Austin S. Hemmelgarn
2018-03-29 21:50     ` Zygo Blaxell
2018-03-30  7:21       ` Menion
2018-03-31  4:53         ` Zygo Blaxell
2018-03-30 16:14       ` Goffredo Baroncelli
2018-03-31  5:03         ` Zygo Blaxell
2018-03-31  6:57           ` Goffredo Baroncelli
2018-03-31  7:43             ` Zygo Blaxell
2018-03-31  8:16               ` Goffredo Baroncelli
     [not found]                 ` <28a574db-0f74-b12c-ab5f-400205fd80c8@gmail.com>
2018-03-31 14:40                   ` Zygo Blaxell
2018-03-31 22:34             ` Chris Murphy
2018-04-01  3:45               ` Zygo Blaxell
2018-04-01 20:51                 ` Chris Murphy
2018-04-01 21:11                   ` Chris Murphy
2018-04-02  5:45                     ` Zygo Blaxell
2018-04-02 15:18                       ` Goffredo Baroncelli
2018-04-02 15:49                         ` Austin S. Hemmelgarn
2018-04-02 22:23                           ` Zygo Blaxell
2018-04-03  0:31                             ` Zygo Blaxell
2018-04-03 17:03                               ` Goffredo Baroncelli
2018-04-03 22:57                                 ` Zygo Blaxell
2018-04-04  5:15                                   ` Goffredo Baroncelli
2018-04-04  6:01                                     ` Zygo Blaxell [this message]
2018-04-04 21:31                                       ` Goffredo Baroncelli
2018-04-04 22:38                                         ` Zygo Blaxell
2018-04-04  3:08                                 ` Chris Murphy
2018-04-04  6:20                                   ` Zygo Blaxell
2018-03-21 20:27   ` Menion
2018-03-22 21:13   ` waxhead

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180404060101.GK2446@hungrycats.org \
    --to=ce3g8jdj@umail.furryterror.org \
    --cc=ahferroin7@gmail.com \
    --cc=calestyo@scientia.net \
    --cc=kreijack@inwind.it \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=lists@colorremedies.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.