All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Spelic <spelic@shiftmail.org>
Cc: xfs@oss.sgi.com, linux-ext4@vger.kernel.org,
	device-mapper development <dm-devel@redhat.com>
Subject: Re: Ext4 and xfs problems in dm-thin on allocation and discard
Date: Thu, 21 Jun 2012 08:53:27 +1000	[thread overview]
Message-ID: <20120620225327.GL30705@dastard> (raw)
In-Reply-To: <4FE1BDF3.4080702@shiftmail.org>

On Wed, Jun 20, 2012 at 02:11:31PM +0200, Spelic wrote:
> Ok guys, I think I found the bug. One or more bugs.
> 
> 
> Pool has chunksize 1MB.
> In sysfs the thin volume has: queue/discard_max_bytes and
> queue/discard_granularity are 1048576 .
> And it has discard_alignment = 0, which based on sysfs-block
> documentation is correct (a less misleading name would have been
> discard_offset imho).
> Here is the blktrace from ext4 fstrim:
> ...
> 252,9   17      498     0.030466556   841  Q   D 19898368 + 2048 [fstrim]
> 252,9   17      499     0.030467501   841  Q   D 19900416 + 2048 [fstrim]
> 252,9   17      500     0.030468359   841  Q   D 19902464 + 2048 [fstrim]
> 252,9   17      501     0.030469313   841  Q   D 19904512 + 2048 [fstrim]
> 252,9   17      502     0.030470144   841  Q   D 19906560 + 2048 [fstrim]
> 252,9   17      503     0.030471381   841  Q   D 19908608 + 2048 [fstrim]
> 252,9   17      504     0.030472473   841  Q   D 19910656 + 2048 [fstrim]
> 252,9   17      505     0.030473504   841  Q   D 19912704 + 2048 [fstrim]
> 252,9   17      506     0.030474561   841  Q   D 19914752 + 2048 [fstrim]
> 252,9   17      507     0.030475571   841  Q   D 19916800 + 2048 [fstrim]
> 252,9   17      508     0.030476423   841  Q   D 19918848 + 2048 [fstrim]
> 252,9   17      509     0.030477341   841  Q   D 19920896 + 2048 [fstrim]
> 252,9   17      510     0.034299630   841  Q   D 19922944 + 2048 [fstrim]
> 252,9   17      511     0.034306880   841  Q   D 19924992 + 2048 [fstrim]
> 252,9   17      512     0.034307955   841  Q   D 19927040 + 2048 [fstrim]
> 252,9   17      513     0.034308928   841  Q   D 19929088 + 2048 [fstrim]
> 252,9   17      514     0.034309945   841  Q   D 19931136 + 2048 [fstrim]
> 252,9   17      515     0.034311007   841  Q   D 19933184 + 2048 [fstrim]
> 252,9   17      516     0.034312008   841  Q   D 19935232 + 2048 [fstrim]
> 252,9   17      517     0.034313122   841  Q   D 19937280 + 2048 [fstrim]
> 252,9   17      518     0.034314013   841  Q   D 19939328 + 2048 [fstrim]
> 252,9   17      519     0.034314940   841  Q   D 19941376 + 2048 [fstrim]
> 252,9   17      520     0.034315835   841  Q   D 19943424 + 2048 [fstrim]
> 252,9   17      521     0.034316662   841  Q   D 19945472 + 2048 [fstrim]
> 252,9   17      522     0.034317547   841  Q   D 19947520 + 2048 [fstrim]
> ...
> 
> Here is the blktrace from xfs fstrim:
> 252,12  16        1     0.000000000   554  Q   D 96 + 2048 [fstrim]
> 252,12  16        2     0.000010149   554  Q   D 2144 + 2048 [fstrim]
> 252,12  16        3     0.000011349   554  Q   D 4192 + 2048 [fstrim]
> 252,12  16        4     0.000012584   554  Q   D 6240 + 2048 [fstrim]
> 252,12  16        5     0.000013685   554  Q   D 8288 + 2048 [fstrim]
> 252,12  16        6     0.000014660   554  Q   D 10336 + 2048 [fstrim]
> 252,12  16        7     0.000015707   554  Q   D 12384 + 2048 [fstrim]
> 252,12  16        8     0.000016692   554  Q   D 14432 + 2048 [fstrim]
> 252,12  16        9     0.000017594   554  Q   D 16480 + 2048 [fstrim]
> 252,12  16       10     0.000018539   554  Q   D 18528 + 2048 [fstrim]
> 252,12  16       11     0.000019434   554  Q   D 20576 + 2048 [fstrim]
> 252,12  16       12     0.000020879   554  Q   D 22624 + 2048 [fstrim]
> 252,12  16       13     0.000021856   554  Q   D 24672 + 2048 [fstrim]
> 252,12  16       14     0.000022786   554  Q   D 26720 + 2048 [fstrim]
> 252,12  16       15     0.000023699   554  Q   D 28768 + 2048 [fstrim]
> 252,12  16       16     0.000024672   554  Q   D 30816 + 2048 [fstrim]
> 252,12  16       17     0.000025467   554  Q   D 32864 + 2048 [fstrim]
> 252,12  16       18     0.000026374   554  Q   D 34912 + 2048 [fstrim]
> 252,12  16       19     0.000027194   554  Q   D 36960 + 2048 [fstrim]
> 252,12  16       20     0.000028137   554  Q   D 39008 + 2048 [fstrim]
> 252,12  16       21     0.000029524   554  Q   D 41056 + 2048 [fstrim]
> 252,12  16       22     0.000030479   554  Q   D 43104 + 2048 [fstrim]
> 252,12  16       23     0.000031306   554  Q   D 45152 + 2048 [fstrim]
> 252,12  16       24     0.000032134   554  Q   D 47200 + 2048 [fstrim]
> 252,12  16       25     0.000032964   554  Q   D 49248 + 2048 [fstrim]
> 252,12  16       26     0.000033794   554  Q   D 51296 + 2048 [fstrim]
> 
> 
> As you can see, while ext4 correctly aligns the discards to 1MB, xfs
> does not.

XFs just sends a large extent to blkdev_issue_discard(), and cares
nothing about discard alignment or granularity.

> It looks like an fstrim or xfs bug: they don't look at
> discard_alignment (=0 ... a less misleading name would be
> discard_offset imho) + discard_granularity (=1MB) and they don't
> base alignments on those.

It looks like blkdev_issue_discard() has reduced each discard to
bios of a single "granule" (1MB), and not aligned them, hence they
are ignore by dm-thinp.

what are the discard parameters exposed by dm-thinp in
/sys/block/<thinp-blkdev>/queue/discard*

It looks to me that dmthinp might be setting discard_max_bytes to
1MB rather than discard_granularity. Looking at dm-thin.c:

static void set_discard_limits(struct pool *pool, struct queue_limits *limits)
{
        /*
         * FIXME: these limits may be incompatible with the pool's data device
         */
        limits->max_discard_sectors = pool->sectors_per_block;

        /*
         * This is just a hint, and not enforced.  We have to cope with
         * bios that overlap 2 blocks.
         */
        limits->discard_granularity = pool->sectors_per_block << SECTOR_SHIFT;
        limits->discard_zeroes_data = pool->pf.zero_new_blocks;
}


Yes - discard_max_bytes == discard_granularity, and so
blkdev_issue_discard fails to align the request properly. As it is,
setting discard_max_bytes to the thinp block size is silly - it
means you'll never get range requests, and we sent a discard for
every single block in a range rather than having the thinp code
iterate over a range itself.

i.e. this is not a filesystem bug that is causing the problem....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

WARNING: multiple messages have this Message-ID (diff)
From: Dave Chinner <david@fromorbit.com>
To: Spelic <spelic@shiftmail.org>
Cc: device-mapper development <dm-devel@redhat.com>,
	linux-ext4@vger.kernel.org, xfs@oss.sgi.com
Subject: Re: Ext4 and xfs problems in dm-thin on allocation and discard
Date: Thu, 21 Jun 2012 08:53:27 +1000	[thread overview]
Message-ID: <20120620225327.GL30705@dastard> (raw)
In-Reply-To: <4FE1BDF3.4080702@shiftmail.org>

On Wed, Jun 20, 2012 at 02:11:31PM +0200, Spelic wrote:
> Ok guys, I think I found the bug. One or more bugs.
> 
> 
> Pool has chunksize 1MB.
> In sysfs the thin volume has: queue/discard_max_bytes and
> queue/discard_granularity are 1048576 .
> And it has discard_alignment = 0, which based on sysfs-block
> documentation is correct (a less misleading name would have been
> discard_offset imho).
> Here is the blktrace from ext4 fstrim:
> ...
> 252,9   17      498     0.030466556   841  Q   D 19898368 + 2048 [fstrim]
> 252,9   17      499     0.030467501   841  Q   D 19900416 + 2048 [fstrim]
> 252,9   17      500     0.030468359   841  Q   D 19902464 + 2048 [fstrim]
> 252,9   17      501     0.030469313   841  Q   D 19904512 + 2048 [fstrim]
> 252,9   17      502     0.030470144   841  Q   D 19906560 + 2048 [fstrim]
> 252,9   17      503     0.030471381   841  Q   D 19908608 + 2048 [fstrim]
> 252,9   17      504     0.030472473   841  Q   D 19910656 + 2048 [fstrim]
> 252,9   17      505     0.030473504   841  Q   D 19912704 + 2048 [fstrim]
> 252,9   17      506     0.030474561   841  Q   D 19914752 + 2048 [fstrim]
> 252,9   17      507     0.030475571   841  Q   D 19916800 + 2048 [fstrim]
> 252,9   17      508     0.030476423   841  Q   D 19918848 + 2048 [fstrim]
> 252,9   17      509     0.030477341   841  Q   D 19920896 + 2048 [fstrim]
> 252,9   17      510     0.034299630   841  Q   D 19922944 + 2048 [fstrim]
> 252,9   17      511     0.034306880   841  Q   D 19924992 + 2048 [fstrim]
> 252,9   17      512     0.034307955   841  Q   D 19927040 + 2048 [fstrim]
> 252,9   17      513     0.034308928   841  Q   D 19929088 + 2048 [fstrim]
> 252,9   17      514     0.034309945   841  Q   D 19931136 + 2048 [fstrim]
> 252,9   17      515     0.034311007   841  Q   D 19933184 + 2048 [fstrim]
> 252,9   17      516     0.034312008   841  Q   D 19935232 + 2048 [fstrim]
> 252,9   17      517     0.034313122   841  Q   D 19937280 + 2048 [fstrim]
> 252,9   17      518     0.034314013   841  Q   D 19939328 + 2048 [fstrim]
> 252,9   17      519     0.034314940   841  Q   D 19941376 + 2048 [fstrim]
> 252,9   17      520     0.034315835   841  Q   D 19943424 + 2048 [fstrim]
> 252,9   17      521     0.034316662   841  Q   D 19945472 + 2048 [fstrim]
> 252,9   17      522     0.034317547   841  Q   D 19947520 + 2048 [fstrim]
> ...
> 
> Here is the blktrace from xfs fstrim:
> 252,12  16        1     0.000000000   554  Q   D 96 + 2048 [fstrim]
> 252,12  16        2     0.000010149   554  Q   D 2144 + 2048 [fstrim]
> 252,12  16        3     0.000011349   554  Q   D 4192 + 2048 [fstrim]
> 252,12  16        4     0.000012584   554  Q   D 6240 + 2048 [fstrim]
> 252,12  16        5     0.000013685   554  Q   D 8288 + 2048 [fstrim]
> 252,12  16        6     0.000014660   554  Q   D 10336 + 2048 [fstrim]
> 252,12  16        7     0.000015707   554  Q   D 12384 + 2048 [fstrim]
> 252,12  16        8     0.000016692   554  Q   D 14432 + 2048 [fstrim]
> 252,12  16        9     0.000017594   554  Q   D 16480 + 2048 [fstrim]
> 252,12  16       10     0.000018539   554  Q   D 18528 + 2048 [fstrim]
> 252,12  16       11     0.000019434   554  Q   D 20576 + 2048 [fstrim]
> 252,12  16       12     0.000020879   554  Q   D 22624 + 2048 [fstrim]
> 252,12  16       13     0.000021856   554  Q   D 24672 + 2048 [fstrim]
> 252,12  16       14     0.000022786   554  Q   D 26720 + 2048 [fstrim]
> 252,12  16       15     0.000023699   554  Q   D 28768 + 2048 [fstrim]
> 252,12  16       16     0.000024672   554  Q   D 30816 + 2048 [fstrim]
> 252,12  16       17     0.000025467   554  Q   D 32864 + 2048 [fstrim]
> 252,12  16       18     0.000026374   554  Q   D 34912 + 2048 [fstrim]
> 252,12  16       19     0.000027194   554  Q   D 36960 + 2048 [fstrim]
> 252,12  16       20     0.000028137   554  Q   D 39008 + 2048 [fstrim]
> 252,12  16       21     0.000029524   554  Q   D 41056 + 2048 [fstrim]
> 252,12  16       22     0.000030479   554  Q   D 43104 + 2048 [fstrim]
> 252,12  16       23     0.000031306   554  Q   D 45152 + 2048 [fstrim]
> 252,12  16       24     0.000032134   554  Q   D 47200 + 2048 [fstrim]
> 252,12  16       25     0.000032964   554  Q   D 49248 + 2048 [fstrim]
> 252,12  16       26     0.000033794   554  Q   D 51296 + 2048 [fstrim]
> 
> 
> As you can see, while ext4 correctly aligns the discards to 1MB, xfs
> does not.

XFs just sends a large extent to blkdev_issue_discard(), and cares
nothing about discard alignment or granularity.

> It looks like an fstrim or xfs bug: they don't look at
> discard_alignment (=0 ... a less misleading name would be
> discard_offset imho) + discard_granularity (=1MB) and they don't
> base alignments on those.

It looks like blkdev_issue_discard() has reduced each discard to
bios of a single "granule" (1MB), and not aligned them, hence they
are ignore by dm-thinp.

what are the discard parameters exposed by dm-thinp in
/sys/block/<thinp-blkdev>/queue/discard*

It looks to me that dmthinp might be setting discard_max_bytes to
1MB rather than discard_granularity. Looking at dm-thin.c:

static void set_discard_limits(struct pool *pool, struct queue_limits *limits)
{
        /*
         * FIXME: these limits may be incompatible with the pool's data device
         */
        limits->max_discard_sectors = pool->sectors_per_block;

        /*
         * This is just a hint, and not enforced.  We have to cope with
         * bios that overlap 2 blocks.
         */
        limits->discard_granularity = pool->sectors_per_block << SECTOR_SHIFT;
        limits->discard_zeroes_data = pool->pf.zero_new_blocks;
}


Yes - discard_max_bytes == discard_granularity, and so
blkdev_issue_discard fails to align the request properly. As it is,
setting discard_max_bytes to the thinp block size is silly - it
means you'll never get range requests, and we sent a discard for
every single block in a range rather than having the thinp code
iterate over a range itself.

i.e. this is not a filesystem bug that is causing the problem....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

  reply	other threads:[~2012-06-20 22:53 UTC|newest]

Thread overview: 72+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-06-18 21:33 Ext4 and xfs problems in dm-thin on allocation and discard Spelic
2012-06-18 21:33 ` Spelic
2012-06-19  1:57 ` Dave Chinner
2012-06-19  1:57   ` Dave Chinner
2012-06-19  3:12   ` Mike Snitzer
2012-06-19  3:12     ` Mike Snitzer
2012-06-19  6:32     ` Lukáš Czerner
2012-06-19  6:32       ` Lukáš Czerner
2012-06-19 11:29       ` Spelic
2012-06-19 11:29         ` Spelic
2012-06-19 12:20         ` Lukáš Czerner
2012-06-19 12:20           ` Lukáš Czerner
2012-06-19 13:34         ` Mike Snitzer
2012-06-19 13:34           ` Mike Snitzer
2012-06-19 13:16       ` Mike Snitzer
2012-06-19 13:16         ` Mike Snitzer
2012-06-19 13:25         ` Lukáš Czerner
2012-06-19 13:25           ` Lukáš Czerner
2012-06-19 13:30           ` Mike Snitzer
2012-06-19 13:30             ` Mike Snitzer
2012-06-19 13:52             ` Spelic
2012-06-19 13:52               ` Spelic
2012-06-19 14:05               ` Eric Sandeen
2012-06-19 14:05                 ` Eric Sandeen
2012-06-19 14:44               ` Mike Snitzer
2012-06-19 14:44                 ` Mike Snitzer
2012-06-19 18:48                 ` Mike Snitzer
2012-06-19 18:48                   ` Mike Snitzer
2012-06-19 20:06                   ` Dave Chinner
2012-06-19 20:06                     ` Dave Chinner
2012-06-19 20:21                     ` Ted Ts'o
2012-06-19 20:21                       ` Ted Ts'o
2012-06-19 20:39                       ` Dave Chinner
2012-06-19 20:39                         ` Dave Chinner
2012-06-20  9:01                         ` Christoph Hellwig
2012-06-20  9:01                           ` Christoph Hellwig
2012-06-19 21:37                     ` Spelic
2012-06-19 21:37                       ` Spelic
2012-06-19 23:12                       ` Dave Chinner
2012-06-19 23:12                         ` Dave Chinner
2012-06-20 12:11   ` Spelic
2012-06-20 12:11     ` Spelic
2012-06-20 22:53     ` Dave Chinner [this message]
2012-06-20 22:53       ` Dave Chinner
2012-06-21 17:47       ` Mike Snitzer
2012-06-21 17:47         ` Mike Snitzer
2012-06-21 23:29         ` Dave Chinner
2012-06-21 23:29           ` Dave Chinner
2012-07-01 14:53         ` Paolo Bonzini
2012-07-01 14:53           ` Paolo Bonzini
2012-07-02 13:00           ` Mike Snitzer
2012-07-02 13:00             ` Mike Snitzer
2012-07-02 13:15             ` Paolo Bonzini
2012-07-02 13:15               ` Paolo Bonzini
2012-06-19 14:09 ` Lukáš Czerner
2012-06-19 14:09   ` Lukáš Czerner
2012-06-19 14:19   ` Ted Ts'o
2012-06-19 14:19     ` Ted Ts'o
2012-06-19 14:23     ` Eric Sandeen
2012-06-19 14:23       ` Eric Sandeen
2012-06-19 14:37     ` Lukáš Czerner
2012-06-19 14:37       ` Lukáš Czerner
2012-06-19 14:43     ` [dm-devel] " Alasdair G Kergon
2012-06-19 14:43       ` Alasdair G Kergon
2012-06-19 15:28       ` Mike Snitzer
2012-06-19 15:28         ` Mike Snitzer
2012-06-19 16:03         ` [dm-devel] " Alasdair G Kergon
2012-06-19 16:03           ` Alasdair G Kergon
2012-06-19 19:58         ` Ted Ts'o
2012-06-19 19:58           ` Ted Ts'o
2012-06-19 20:44           ` Mike Snitzer
2012-06-19 20:44             ` Mike Snitzer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120620225327.GL30705@dastard \
    --to=david@fromorbit.com \
    --cc=dm-devel@redhat.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=spelic@shiftmail.org \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.