All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mike Snitzer <snitzer@redhat.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Spelic <spelic@shiftmail.org>,
	device-mapper development <dm-devel@redhat.com>,
	linux-ext4@vger.kernel.org, xfs@oss.sgi.com,
	Paolo Bonzini <pbonzini@redhat.com>,
	axboe@kernel.dk, hch@infradead.org
Subject: Re: Ext4 and xfs problems in dm-thin on allocation and discard
Date: Thu, 21 Jun 2012 13:47:43 -0400	[thread overview]
Message-ID: <20120621174742.GA27837@redhat.com> (raw)
In-Reply-To: <20120620225327.GL30705@dastard>

On Wed, Jun 20 2012 at  6:53pm -0400,
Dave Chinner <david@fromorbit.com> wrote:

> On Wed, Jun 20, 2012 at 02:11:31PM +0200, Spelic wrote:
> > Ok guys, I think I found the bug. One or more bugs.
> > 
> > 
> > Pool has chunksize 1MB.
> > In sysfs the thin volume has: queue/discard_max_bytes and
> > queue/discard_granularity are 1048576 .
> > And it has discard_alignment = 0, which based on sysfs-block
> > documentation is correct (a less misleading name would have been
> > discard_offset imho).
> > Here is the blktrace from ext4 fstrim:
> > ...
> > 252,9   17      498     0.030466556   841  Q   D 19898368 + 2048 [fstrim]
> > 252,9   17      499     0.030467501   841  Q   D 19900416 + 2048 [fstrim]
> > 252,9   17      500     0.030468359   841  Q   D 19902464 + 2048 [fstrim]
> > 252,9   17      501     0.030469313   841  Q   D 19904512 + 2048 [fstrim]
> > 252,9   17      502     0.030470144   841  Q   D 19906560 + 2048 [fstrim]
> > 252,9   17      503     0.030471381   841  Q   D 19908608 + 2048 [fstrim]
> > 252,9   17      504     0.030472473   841  Q   D 19910656 + 2048 [fstrim]
> > 252,9   17      505     0.030473504   841  Q   D 19912704 + 2048 [fstrim]
> > 252,9   17      506     0.030474561   841  Q   D 19914752 + 2048 [fstrim]
> > 252,9   17      507     0.030475571   841  Q   D 19916800 + 2048 [fstrim]
> > 252,9   17      508     0.030476423   841  Q   D 19918848 + 2048 [fstrim]
> > 252,9   17      509     0.030477341   841  Q   D 19920896 + 2048 [fstrim]
> > 252,9   17      510     0.034299630   841  Q   D 19922944 + 2048 [fstrim]
> > 252,9   17      511     0.034306880   841  Q   D 19924992 + 2048 [fstrim]
> > 252,9   17      512     0.034307955   841  Q   D 19927040 + 2048 [fstrim]
> > 252,9   17      513     0.034308928   841  Q   D 19929088 + 2048 [fstrim]
> > 252,9   17      514     0.034309945   841  Q   D 19931136 + 2048 [fstrim]
> > 252,9   17      515     0.034311007   841  Q   D 19933184 + 2048 [fstrim]
> > 252,9   17      516     0.034312008   841  Q   D 19935232 + 2048 [fstrim]
> > 252,9   17      517     0.034313122   841  Q   D 19937280 + 2048 [fstrim]
> > 252,9   17      518     0.034314013   841  Q   D 19939328 + 2048 [fstrim]
> > 252,9   17      519     0.034314940   841  Q   D 19941376 + 2048 [fstrim]
> > 252,9   17      520     0.034315835   841  Q   D 19943424 + 2048 [fstrim]
> > 252,9   17      521     0.034316662   841  Q   D 19945472 + 2048 [fstrim]
> > 252,9   17      522     0.034317547   841  Q   D 19947520 + 2048 [fstrim]
> > ...
> > 
> > Here is the blktrace from xfs fstrim:
> > 252,12  16        1     0.000000000   554  Q   D 96 + 2048 [fstrim]
> > 252,12  16        2     0.000010149   554  Q   D 2144 + 2048 [fstrim]
> > 252,12  16        3     0.000011349   554  Q   D 4192 + 2048 [fstrim]
> > 252,12  16        4     0.000012584   554  Q   D 6240 + 2048 [fstrim]
> > 252,12  16        5     0.000013685   554  Q   D 8288 + 2048 [fstrim]
> > 252,12  16        6     0.000014660   554  Q   D 10336 + 2048 [fstrim]
> > 252,12  16        7     0.000015707   554  Q   D 12384 + 2048 [fstrim]
> > 252,12  16        8     0.000016692   554  Q   D 14432 + 2048 [fstrim]
> > 252,12  16        9     0.000017594   554  Q   D 16480 + 2048 [fstrim]
> > 252,12  16       10     0.000018539   554  Q   D 18528 + 2048 [fstrim]
> > 252,12  16       11     0.000019434   554  Q   D 20576 + 2048 [fstrim]
> > 252,12  16       12     0.000020879   554  Q   D 22624 + 2048 [fstrim]
> > 252,12  16       13     0.000021856   554  Q   D 24672 + 2048 [fstrim]
> > 252,12  16       14     0.000022786   554  Q   D 26720 + 2048 [fstrim]
> > 252,12  16       15     0.000023699   554  Q   D 28768 + 2048 [fstrim]
> > 252,12  16       16     0.000024672   554  Q   D 30816 + 2048 [fstrim]
> > 252,12  16       17     0.000025467   554  Q   D 32864 + 2048 [fstrim]
> > 252,12  16       18     0.000026374   554  Q   D 34912 + 2048 [fstrim]
> > 252,12  16       19     0.000027194   554  Q   D 36960 + 2048 [fstrim]
> > 252,12  16       20     0.000028137   554  Q   D 39008 + 2048 [fstrim]
> > 252,12  16       21     0.000029524   554  Q   D 41056 + 2048 [fstrim]
> > 252,12  16       22     0.000030479   554  Q   D 43104 + 2048 [fstrim]
> > 252,12  16       23     0.000031306   554  Q   D 45152 + 2048 [fstrim]
> > 252,12  16       24     0.000032134   554  Q   D 47200 + 2048 [fstrim]
> > 252,12  16       25     0.000032964   554  Q   D 49248 + 2048 [fstrim]
> > 252,12  16       26     0.000033794   554  Q   D 51296 + 2048 [fstrim]
> > 
> > 
> > As you can see, while ext4 correctly aligns the discards to 1MB, xfs
> > does not.
> 
> XFs just sends a large extent to blkdev_issue_discard(), and cares
> nothing about discard alignment or granularity.
> 
> > It looks like an fstrim or xfs bug: they don't look at
> > discard_alignment (=0 ... a less misleading name would be
> > discard_offset imho) + discard_granularity (=1MB) and they don't
> > base alignments on those.
> 
> It looks like blkdev_issue_discard() has reduced each discard to
> bios of a single "granule" (1MB), and not aligned them, hence they
> are ignore by dm-thinp.
> 
> what are the discard parameters exposed by dm-thinp in
> /sys/block/<thinp-blkdev>/queue/discard*
> 
> It looks to me that dmthinp might be setting discard_max_bytes to
> 1MB rather than discard_granularity. Looking at dm-thin.c:
> 
> static void set_discard_limits(struct pool *pool, struct queue_limits *limits)
> {
>         /*
>          * FIXME: these limits may be incompatible with the pool's data device
>          */
>         limits->max_discard_sectors = pool->sectors_per_block;
> 
>         /*
>          * This is just a hint, and not enforced.  We have to cope with
>          * bios that overlap 2 blocks.
>          */
>         limits->discard_granularity = pool->sectors_per_block << SECTOR_SHIFT;
>         limits->discard_zeroes_data = pool->pf.zero_new_blocks;
> }
> 
> 
> Yes - discard_max_bytes == discard_granularity, and so
> blkdev_issue_discard fails to align the request properly. As it is,
> setting discard_max_bytes to the thinp block size is silly - it
> means you'll never get range requests, and we sent a discard for
> every single block in a range rather than having the thinp code
> iterate over a range itself.

So 2 different issues:
1) blkdev_issue_discard isn't properly aligning
2) thinp should accept larger discards (up to the stacked
   discard_max_bytes rather than setting an override)

> i.e. this is not a filesystem bug that is causing the problem....

Paolo Bonzini fixed blkdev_issue_discard to properly align some time
ago; unfortunately the patches slipped through the cracks (cc'ing Paolo,
Jens, and Christoph).

Here are references to Paolo's patches:
0/2 https://lkml.org/lkml/2012/3/14/323
1/2 https://lkml.org/lkml/2012/3/14/324
2/2 https://lkml.org/lkml/2012/3/14/325

Patch 2/2 specifically addresses the case where:
 discard_max_bytes == discard_granularity 

Paolo, any chance you could resend to Jens (maybe with hch's comments on
patch#2 accounted for)?  Also, please add hch's Reviewed-by when
reposting.

(would love to see this fixed for 3.5-rcX but if not 3.6 it is?)

WARNING: multiple messages have this Message-ID (diff)
From: Mike Snitzer <snitzer@redhat.com>
To: Dave Chinner <david@fromorbit.com>
Cc: axboe@kernel.dk, xfs@oss.sgi.com, hch@infradead.org,
	device-mapper development <dm-devel@redhat.com>,
	Spelic <spelic@shiftmail.org>,
	Paolo Bonzini <pbonzini@redhat.com>,
	linux-ext4@vger.kernel.org
Subject: Re: Ext4 and xfs problems in dm-thin on allocation and discard
Date: Thu, 21 Jun 2012 13:47:43 -0400	[thread overview]
Message-ID: <20120621174742.GA27837@redhat.com> (raw)
In-Reply-To: <20120620225327.GL30705@dastard>

On Wed, Jun 20 2012 at  6:53pm -0400,
Dave Chinner <david@fromorbit.com> wrote:

> On Wed, Jun 20, 2012 at 02:11:31PM +0200, Spelic wrote:
> > Ok guys, I think I found the bug. One or more bugs.
> > 
> > 
> > Pool has chunksize 1MB.
> > In sysfs the thin volume has: queue/discard_max_bytes and
> > queue/discard_granularity are 1048576 .
> > And it has discard_alignment = 0, which based on sysfs-block
> > documentation is correct (a less misleading name would have been
> > discard_offset imho).
> > Here is the blktrace from ext4 fstrim:
> > ...
> > 252,9   17      498     0.030466556   841  Q   D 19898368 + 2048 [fstrim]
> > 252,9   17      499     0.030467501   841  Q   D 19900416 + 2048 [fstrim]
> > 252,9   17      500     0.030468359   841  Q   D 19902464 + 2048 [fstrim]
> > 252,9   17      501     0.030469313   841  Q   D 19904512 + 2048 [fstrim]
> > 252,9   17      502     0.030470144   841  Q   D 19906560 + 2048 [fstrim]
> > 252,9   17      503     0.030471381   841  Q   D 19908608 + 2048 [fstrim]
> > 252,9   17      504     0.030472473   841  Q   D 19910656 + 2048 [fstrim]
> > 252,9   17      505     0.030473504   841  Q   D 19912704 + 2048 [fstrim]
> > 252,9   17      506     0.030474561   841  Q   D 19914752 + 2048 [fstrim]
> > 252,9   17      507     0.030475571   841  Q   D 19916800 + 2048 [fstrim]
> > 252,9   17      508     0.030476423   841  Q   D 19918848 + 2048 [fstrim]
> > 252,9   17      509     0.030477341   841  Q   D 19920896 + 2048 [fstrim]
> > 252,9   17      510     0.034299630   841  Q   D 19922944 + 2048 [fstrim]
> > 252,9   17      511     0.034306880   841  Q   D 19924992 + 2048 [fstrim]
> > 252,9   17      512     0.034307955   841  Q   D 19927040 + 2048 [fstrim]
> > 252,9   17      513     0.034308928   841  Q   D 19929088 + 2048 [fstrim]
> > 252,9   17      514     0.034309945   841  Q   D 19931136 + 2048 [fstrim]
> > 252,9   17      515     0.034311007   841  Q   D 19933184 + 2048 [fstrim]
> > 252,9   17      516     0.034312008   841  Q   D 19935232 + 2048 [fstrim]
> > 252,9   17      517     0.034313122   841  Q   D 19937280 + 2048 [fstrim]
> > 252,9   17      518     0.034314013   841  Q   D 19939328 + 2048 [fstrim]
> > 252,9   17      519     0.034314940   841  Q   D 19941376 + 2048 [fstrim]
> > 252,9   17      520     0.034315835   841  Q   D 19943424 + 2048 [fstrim]
> > 252,9   17      521     0.034316662   841  Q   D 19945472 + 2048 [fstrim]
> > 252,9   17      522     0.034317547   841  Q   D 19947520 + 2048 [fstrim]
> > ...
> > 
> > Here is the blktrace from xfs fstrim:
> > 252,12  16        1     0.000000000   554  Q   D 96 + 2048 [fstrim]
> > 252,12  16        2     0.000010149   554  Q   D 2144 + 2048 [fstrim]
> > 252,12  16        3     0.000011349   554  Q   D 4192 + 2048 [fstrim]
> > 252,12  16        4     0.000012584   554  Q   D 6240 + 2048 [fstrim]
> > 252,12  16        5     0.000013685   554  Q   D 8288 + 2048 [fstrim]
> > 252,12  16        6     0.000014660   554  Q   D 10336 + 2048 [fstrim]
> > 252,12  16        7     0.000015707   554  Q   D 12384 + 2048 [fstrim]
> > 252,12  16        8     0.000016692   554  Q   D 14432 + 2048 [fstrim]
> > 252,12  16        9     0.000017594   554  Q   D 16480 + 2048 [fstrim]
> > 252,12  16       10     0.000018539   554  Q   D 18528 + 2048 [fstrim]
> > 252,12  16       11     0.000019434   554  Q   D 20576 + 2048 [fstrim]
> > 252,12  16       12     0.000020879   554  Q   D 22624 + 2048 [fstrim]
> > 252,12  16       13     0.000021856   554  Q   D 24672 + 2048 [fstrim]
> > 252,12  16       14     0.000022786   554  Q   D 26720 + 2048 [fstrim]
> > 252,12  16       15     0.000023699   554  Q   D 28768 + 2048 [fstrim]
> > 252,12  16       16     0.000024672   554  Q   D 30816 + 2048 [fstrim]
> > 252,12  16       17     0.000025467   554  Q   D 32864 + 2048 [fstrim]
> > 252,12  16       18     0.000026374   554  Q   D 34912 + 2048 [fstrim]
> > 252,12  16       19     0.000027194   554  Q   D 36960 + 2048 [fstrim]
> > 252,12  16       20     0.000028137   554  Q   D 39008 + 2048 [fstrim]
> > 252,12  16       21     0.000029524   554  Q   D 41056 + 2048 [fstrim]
> > 252,12  16       22     0.000030479   554  Q   D 43104 + 2048 [fstrim]
> > 252,12  16       23     0.000031306   554  Q   D 45152 + 2048 [fstrim]
> > 252,12  16       24     0.000032134   554  Q   D 47200 + 2048 [fstrim]
> > 252,12  16       25     0.000032964   554  Q   D 49248 + 2048 [fstrim]
> > 252,12  16       26     0.000033794   554  Q   D 51296 + 2048 [fstrim]
> > 
> > 
> > As you can see, while ext4 correctly aligns the discards to 1MB, xfs
> > does not.
> 
> XFs just sends a large extent to blkdev_issue_discard(), and cares
> nothing about discard alignment or granularity.
> 
> > It looks like an fstrim or xfs bug: they don't look at
> > discard_alignment (=0 ... a less misleading name would be
> > discard_offset imho) + discard_granularity (=1MB) and they don't
> > base alignments on those.
> 
> It looks like blkdev_issue_discard() has reduced each discard to
> bios of a single "granule" (1MB), and not aligned them, hence they
> are ignore by dm-thinp.
> 
> what are the discard parameters exposed by dm-thinp in
> /sys/block/<thinp-blkdev>/queue/discard*
> 
> It looks to me that dmthinp might be setting discard_max_bytes to
> 1MB rather than discard_granularity. Looking at dm-thin.c:
> 
> static void set_discard_limits(struct pool *pool, struct queue_limits *limits)
> {
>         /*
>          * FIXME: these limits may be incompatible with the pool's data device
>          */
>         limits->max_discard_sectors = pool->sectors_per_block;
> 
>         /*
>          * This is just a hint, and not enforced.  We have to cope with
>          * bios that overlap 2 blocks.
>          */
>         limits->discard_granularity = pool->sectors_per_block << SECTOR_SHIFT;
>         limits->discard_zeroes_data = pool->pf.zero_new_blocks;
> }
> 
> 
> Yes - discard_max_bytes == discard_granularity, and so
> blkdev_issue_discard fails to align the request properly. As it is,
> setting discard_max_bytes to the thinp block size is silly - it
> means you'll never get range requests, and we sent a discard for
> every single block in a range rather than having the thinp code
> iterate over a range itself.

So 2 different issues:
1) blkdev_issue_discard isn't properly aligning
2) thinp should accept larger discards (up to the stacked
   discard_max_bytes rather than setting an override)

> i.e. this is not a filesystem bug that is causing the problem....

Paolo Bonzini fixed blkdev_issue_discard to properly align some time
ago; unfortunately the patches slipped through the cracks (cc'ing Paolo,
Jens, and Christoph).

Here are references to Paolo's patches:
0/2 https://lkml.org/lkml/2012/3/14/323
1/2 https://lkml.org/lkml/2012/3/14/324
2/2 https://lkml.org/lkml/2012/3/14/325

Patch 2/2 specifically addresses the case where:
 discard_max_bytes == discard_granularity 

Paolo, any chance you could resend to Jens (maybe with hch's comments on
patch#2 accounted for)?  Also, please add hch's Reviewed-by when
reposting.

(would love to see this fixed for 3.5-rcX but if not 3.6 it is?)

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

  reply	other threads:[~2012-06-21 17:47 UTC|newest]

Thread overview: 72+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-06-18 21:33 Ext4 and xfs problems in dm-thin on allocation and discard Spelic
2012-06-18 21:33 ` Spelic
2012-06-19  1:57 ` Dave Chinner
2012-06-19  1:57   ` Dave Chinner
2012-06-19  3:12   ` Mike Snitzer
2012-06-19  3:12     ` Mike Snitzer
2012-06-19  6:32     ` Lukáš Czerner
2012-06-19  6:32       ` Lukáš Czerner
2012-06-19 11:29       ` Spelic
2012-06-19 11:29         ` Spelic
2012-06-19 12:20         ` Lukáš Czerner
2012-06-19 12:20           ` Lukáš Czerner
2012-06-19 13:34         ` Mike Snitzer
2012-06-19 13:34           ` Mike Snitzer
2012-06-19 13:16       ` Mike Snitzer
2012-06-19 13:16         ` Mike Snitzer
2012-06-19 13:25         ` Lukáš Czerner
2012-06-19 13:25           ` Lukáš Czerner
2012-06-19 13:30           ` Mike Snitzer
2012-06-19 13:30             ` Mike Snitzer
2012-06-19 13:52             ` Spelic
2012-06-19 13:52               ` Spelic
2012-06-19 14:05               ` Eric Sandeen
2012-06-19 14:05                 ` Eric Sandeen
2012-06-19 14:44               ` Mike Snitzer
2012-06-19 14:44                 ` Mike Snitzer
2012-06-19 18:48                 ` Mike Snitzer
2012-06-19 18:48                   ` Mike Snitzer
2012-06-19 20:06                   ` Dave Chinner
2012-06-19 20:06                     ` Dave Chinner
2012-06-19 20:21                     ` Ted Ts'o
2012-06-19 20:21                       ` Ted Ts'o
2012-06-19 20:39                       ` Dave Chinner
2012-06-19 20:39                         ` Dave Chinner
2012-06-20  9:01                         ` Christoph Hellwig
2012-06-20  9:01                           ` Christoph Hellwig
2012-06-19 21:37                     ` Spelic
2012-06-19 21:37                       ` Spelic
2012-06-19 23:12                       ` Dave Chinner
2012-06-19 23:12                         ` Dave Chinner
2012-06-20 12:11   ` Spelic
2012-06-20 12:11     ` Spelic
2012-06-20 22:53     ` Dave Chinner
2012-06-20 22:53       ` Dave Chinner
2012-06-21 17:47       ` Mike Snitzer [this message]
2012-06-21 17:47         ` Mike Snitzer
2012-06-21 23:29         ` Dave Chinner
2012-06-21 23:29           ` Dave Chinner
2012-07-01 14:53         ` Paolo Bonzini
2012-07-01 14:53           ` Paolo Bonzini
2012-07-02 13:00           ` Mike Snitzer
2012-07-02 13:00             ` Mike Snitzer
2012-07-02 13:15             ` Paolo Bonzini
2012-07-02 13:15               ` Paolo Bonzini
2012-06-19 14:09 ` Lukáš Czerner
2012-06-19 14:09   ` Lukáš Czerner
2012-06-19 14:19   ` Ted Ts'o
2012-06-19 14:19     ` Ted Ts'o
2012-06-19 14:23     ` Eric Sandeen
2012-06-19 14:23       ` Eric Sandeen
2012-06-19 14:37     ` Lukáš Czerner
2012-06-19 14:37       ` Lukáš Czerner
2012-06-19 14:43     ` [dm-devel] " Alasdair G Kergon
2012-06-19 14:43       ` Alasdair G Kergon
2012-06-19 15:28       ` Mike Snitzer
2012-06-19 15:28         ` Mike Snitzer
2012-06-19 16:03         ` [dm-devel] " Alasdair G Kergon
2012-06-19 16:03           ` Alasdair G Kergon
2012-06-19 19:58         ` Ted Ts'o
2012-06-19 19:58           ` Ted Ts'o
2012-06-19 20:44           ` Mike Snitzer
2012-06-19 20:44             ` Mike Snitzer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120621174742.GA27837@redhat.com \
    --to=snitzer@redhat.com \
    --cc=axboe@kernel.dk \
    --cc=david@fromorbit.com \
    --cc=dm-devel@redhat.com \
    --cc=hch@infradead.org \
    --cc=linux-ext4@vger.kernel.org \
    --cc=pbonzini@redhat.com \
    --cc=spelic@shiftmail.org \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.