From: Mike Snitzer <snitzer@redhat.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Spelic <spelic@shiftmail.org>,
device-mapper development <dm-devel@redhat.com>,
linux-ext4@vger.kernel.org, xfs@oss.sgi.com,
Paolo Bonzini <pbonzini@redhat.com>,
axboe@kernel.dk, hch@infradead.org
Subject: Re: Ext4 and xfs problems in dm-thin on allocation and discard
Date: Thu, 21 Jun 2012 13:47:43 -0400 [thread overview]
Message-ID: <20120621174742.GA27837@redhat.com> (raw)
In-Reply-To: <20120620225327.GL30705@dastard>
On Wed, Jun 20 2012 at 6:53pm -0400,
Dave Chinner <david@fromorbit.com> wrote:
> On Wed, Jun 20, 2012 at 02:11:31PM +0200, Spelic wrote:
> > Ok guys, I think I found the bug. One or more bugs.
> >
> >
> > Pool has chunksize 1MB.
> > In sysfs the thin volume has: queue/discard_max_bytes and
> > queue/discard_granularity are 1048576 .
> > And it has discard_alignment = 0, which based on sysfs-block
> > documentation is correct (a less misleading name would have been
> > discard_offset imho).
> > Here is the blktrace from ext4 fstrim:
> > ...
> > 252,9 17 498 0.030466556 841 Q D 19898368 + 2048 [fstrim]
> > 252,9 17 499 0.030467501 841 Q D 19900416 + 2048 [fstrim]
> > 252,9 17 500 0.030468359 841 Q D 19902464 + 2048 [fstrim]
> > 252,9 17 501 0.030469313 841 Q D 19904512 + 2048 [fstrim]
> > 252,9 17 502 0.030470144 841 Q D 19906560 + 2048 [fstrim]
> > 252,9 17 503 0.030471381 841 Q D 19908608 + 2048 [fstrim]
> > 252,9 17 504 0.030472473 841 Q D 19910656 + 2048 [fstrim]
> > 252,9 17 505 0.030473504 841 Q D 19912704 + 2048 [fstrim]
> > 252,9 17 506 0.030474561 841 Q D 19914752 + 2048 [fstrim]
> > 252,9 17 507 0.030475571 841 Q D 19916800 + 2048 [fstrim]
> > 252,9 17 508 0.030476423 841 Q D 19918848 + 2048 [fstrim]
> > 252,9 17 509 0.030477341 841 Q D 19920896 + 2048 [fstrim]
> > 252,9 17 510 0.034299630 841 Q D 19922944 + 2048 [fstrim]
> > 252,9 17 511 0.034306880 841 Q D 19924992 + 2048 [fstrim]
> > 252,9 17 512 0.034307955 841 Q D 19927040 + 2048 [fstrim]
> > 252,9 17 513 0.034308928 841 Q D 19929088 + 2048 [fstrim]
> > 252,9 17 514 0.034309945 841 Q D 19931136 + 2048 [fstrim]
> > 252,9 17 515 0.034311007 841 Q D 19933184 + 2048 [fstrim]
> > 252,9 17 516 0.034312008 841 Q D 19935232 + 2048 [fstrim]
> > 252,9 17 517 0.034313122 841 Q D 19937280 + 2048 [fstrim]
> > 252,9 17 518 0.034314013 841 Q D 19939328 + 2048 [fstrim]
> > 252,9 17 519 0.034314940 841 Q D 19941376 + 2048 [fstrim]
> > 252,9 17 520 0.034315835 841 Q D 19943424 + 2048 [fstrim]
> > 252,9 17 521 0.034316662 841 Q D 19945472 + 2048 [fstrim]
> > 252,9 17 522 0.034317547 841 Q D 19947520 + 2048 [fstrim]
> > ...
> >
> > Here is the blktrace from xfs fstrim:
> > 252,12 16 1 0.000000000 554 Q D 96 + 2048 [fstrim]
> > 252,12 16 2 0.000010149 554 Q D 2144 + 2048 [fstrim]
> > 252,12 16 3 0.000011349 554 Q D 4192 + 2048 [fstrim]
> > 252,12 16 4 0.000012584 554 Q D 6240 + 2048 [fstrim]
> > 252,12 16 5 0.000013685 554 Q D 8288 + 2048 [fstrim]
> > 252,12 16 6 0.000014660 554 Q D 10336 + 2048 [fstrim]
> > 252,12 16 7 0.000015707 554 Q D 12384 + 2048 [fstrim]
> > 252,12 16 8 0.000016692 554 Q D 14432 + 2048 [fstrim]
> > 252,12 16 9 0.000017594 554 Q D 16480 + 2048 [fstrim]
> > 252,12 16 10 0.000018539 554 Q D 18528 + 2048 [fstrim]
> > 252,12 16 11 0.000019434 554 Q D 20576 + 2048 [fstrim]
> > 252,12 16 12 0.000020879 554 Q D 22624 + 2048 [fstrim]
> > 252,12 16 13 0.000021856 554 Q D 24672 + 2048 [fstrim]
> > 252,12 16 14 0.000022786 554 Q D 26720 + 2048 [fstrim]
> > 252,12 16 15 0.000023699 554 Q D 28768 + 2048 [fstrim]
> > 252,12 16 16 0.000024672 554 Q D 30816 + 2048 [fstrim]
> > 252,12 16 17 0.000025467 554 Q D 32864 + 2048 [fstrim]
> > 252,12 16 18 0.000026374 554 Q D 34912 + 2048 [fstrim]
> > 252,12 16 19 0.000027194 554 Q D 36960 + 2048 [fstrim]
> > 252,12 16 20 0.000028137 554 Q D 39008 + 2048 [fstrim]
> > 252,12 16 21 0.000029524 554 Q D 41056 + 2048 [fstrim]
> > 252,12 16 22 0.000030479 554 Q D 43104 + 2048 [fstrim]
> > 252,12 16 23 0.000031306 554 Q D 45152 + 2048 [fstrim]
> > 252,12 16 24 0.000032134 554 Q D 47200 + 2048 [fstrim]
> > 252,12 16 25 0.000032964 554 Q D 49248 + 2048 [fstrim]
> > 252,12 16 26 0.000033794 554 Q D 51296 + 2048 [fstrim]
> >
> >
> > As you can see, while ext4 correctly aligns the discards to 1MB, xfs
> > does not.
>
> XFs just sends a large extent to blkdev_issue_discard(), and cares
> nothing about discard alignment or granularity.
>
> > It looks like an fstrim or xfs bug: they don't look at
> > discard_alignment (=0 ... a less misleading name would be
> > discard_offset imho) + discard_granularity (=1MB) and they don't
> > base alignments on those.
>
> It looks like blkdev_issue_discard() has reduced each discard to
> bios of a single "granule" (1MB), and not aligned them, hence they
> are ignore by dm-thinp.
>
> what are the discard parameters exposed by dm-thinp in
> /sys/block/<thinp-blkdev>/queue/discard*
>
> It looks to me that dmthinp might be setting discard_max_bytes to
> 1MB rather than discard_granularity. Looking at dm-thin.c:
>
> static void set_discard_limits(struct pool *pool, struct queue_limits *limits)
> {
> /*
> * FIXME: these limits may be incompatible with the pool's data device
> */
> limits->max_discard_sectors = pool->sectors_per_block;
>
> /*
> * This is just a hint, and not enforced. We have to cope with
> * bios that overlap 2 blocks.
> */
> limits->discard_granularity = pool->sectors_per_block << SECTOR_SHIFT;
> limits->discard_zeroes_data = pool->pf.zero_new_blocks;
> }
>
>
> Yes - discard_max_bytes == discard_granularity, and so
> blkdev_issue_discard fails to align the request properly. As it is,
> setting discard_max_bytes to the thinp block size is silly - it
> means you'll never get range requests, and we sent a discard for
> every single block in a range rather than having the thinp code
> iterate over a range itself.
So 2 different issues:
1) blkdev_issue_discard isn't properly aligning
2) thinp should accept larger discards (up to the stacked
discard_max_bytes rather than setting an override)
> i.e. this is not a filesystem bug that is causing the problem....
Paolo Bonzini fixed blkdev_issue_discard to properly align some time
ago; unfortunately the patches slipped through the cracks (cc'ing Paolo,
Jens, and Christoph).
Here are references to Paolo's patches:
0/2 https://lkml.org/lkml/2012/3/14/323
1/2 https://lkml.org/lkml/2012/3/14/324
2/2 https://lkml.org/lkml/2012/3/14/325
Patch 2/2 specifically addresses the case where:
discard_max_bytes == discard_granularity
Paolo, any chance you could resend to Jens (maybe with hch's comments on
patch#2 accounted for)? Also, please add hch's Reviewed-by when
reposting.
(would love to see this fixed for 3.5-rcX but if not 3.6 it is?)
WARNING: multiple messages have this Message-ID (diff)
From: Mike Snitzer <snitzer@redhat.com>
To: Dave Chinner <david@fromorbit.com>
Cc: axboe@kernel.dk, xfs@oss.sgi.com, hch@infradead.org,
device-mapper development <dm-devel@redhat.com>,
Spelic <spelic@shiftmail.org>,
Paolo Bonzini <pbonzini@redhat.com>,
linux-ext4@vger.kernel.org
Subject: Re: Ext4 and xfs problems in dm-thin on allocation and discard
Date: Thu, 21 Jun 2012 13:47:43 -0400 [thread overview]
Message-ID: <20120621174742.GA27837@redhat.com> (raw)
In-Reply-To: <20120620225327.GL30705@dastard>
On Wed, Jun 20 2012 at 6:53pm -0400,
Dave Chinner <david@fromorbit.com> wrote:
> On Wed, Jun 20, 2012 at 02:11:31PM +0200, Spelic wrote:
> > Ok guys, I think I found the bug. One or more bugs.
> >
> >
> > Pool has chunksize 1MB.
> > In sysfs the thin volume has: queue/discard_max_bytes and
> > queue/discard_granularity are 1048576 .
> > And it has discard_alignment = 0, which based on sysfs-block
> > documentation is correct (a less misleading name would have been
> > discard_offset imho).
> > Here is the blktrace from ext4 fstrim:
> > ...
> > 252,9 17 498 0.030466556 841 Q D 19898368 + 2048 [fstrim]
> > 252,9 17 499 0.030467501 841 Q D 19900416 + 2048 [fstrim]
> > 252,9 17 500 0.030468359 841 Q D 19902464 + 2048 [fstrim]
> > 252,9 17 501 0.030469313 841 Q D 19904512 + 2048 [fstrim]
> > 252,9 17 502 0.030470144 841 Q D 19906560 + 2048 [fstrim]
> > 252,9 17 503 0.030471381 841 Q D 19908608 + 2048 [fstrim]
> > 252,9 17 504 0.030472473 841 Q D 19910656 + 2048 [fstrim]
> > 252,9 17 505 0.030473504 841 Q D 19912704 + 2048 [fstrim]
> > 252,9 17 506 0.030474561 841 Q D 19914752 + 2048 [fstrim]
> > 252,9 17 507 0.030475571 841 Q D 19916800 + 2048 [fstrim]
> > 252,9 17 508 0.030476423 841 Q D 19918848 + 2048 [fstrim]
> > 252,9 17 509 0.030477341 841 Q D 19920896 + 2048 [fstrim]
> > 252,9 17 510 0.034299630 841 Q D 19922944 + 2048 [fstrim]
> > 252,9 17 511 0.034306880 841 Q D 19924992 + 2048 [fstrim]
> > 252,9 17 512 0.034307955 841 Q D 19927040 + 2048 [fstrim]
> > 252,9 17 513 0.034308928 841 Q D 19929088 + 2048 [fstrim]
> > 252,9 17 514 0.034309945 841 Q D 19931136 + 2048 [fstrim]
> > 252,9 17 515 0.034311007 841 Q D 19933184 + 2048 [fstrim]
> > 252,9 17 516 0.034312008 841 Q D 19935232 + 2048 [fstrim]
> > 252,9 17 517 0.034313122 841 Q D 19937280 + 2048 [fstrim]
> > 252,9 17 518 0.034314013 841 Q D 19939328 + 2048 [fstrim]
> > 252,9 17 519 0.034314940 841 Q D 19941376 + 2048 [fstrim]
> > 252,9 17 520 0.034315835 841 Q D 19943424 + 2048 [fstrim]
> > 252,9 17 521 0.034316662 841 Q D 19945472 + 2048 [fstrim]
> > 252,9 17 522 0.034317547 841 Q D 19947520 + 2048 [fstrim]
> > ...
> >
> > Here is the blktrace from xfs fstrim:
> > 252,12 16 1 0.000000000 554 Q D 96 + 2048 [fstrim]
> > 252,12 16 2 0.000010149 554 Q D 2144 + 2048 [fstrim]
> > 252,12 16 3 0.000011349 554 Q D 4192 + 2048 [fstrim]
> > 252,12 16 4 0.000012584 554 Q D 6240 + 2048 [fstrim]
> > 252,12 16 5 0.000013685 554 Q D 8288 + 2048 [fstrim]
> > 252,12 16 6 0.000014660 554 Q D 10336 + 2048 [fstrim]
> > 252,12 16 7 0.000015707 554 Q D 12384 + 2048 [fstrim]
> > 252,12 16 8 0.000016692 554 Q D 14432 + 2048 [fstrim]
> > 252,12 16 9 0.000017594 554 Q D 16480 + 2048 [fstrim]
> > 252,12 16 10 0.000018539 554 Q D 18528 + 2048 [fstrim]
> > 252,12 16 11 0.000019434 554 Q D 20576 + 2048 [fstrim]
> > 252,12 16 12 0.000020879 554 Q D 22624 + 2048 [fstrim]
> > 252,12 16 13 0.000021856 554 Q D 24672 + 2048 [fstrim]
> > 252,12 16 14 0.000022786 554 Q D 26720 + 2048 [fstrim]
> > 252,12 16 15 0.000023699 554 Q D 28768 + 2048 [fstrim]
> > 252,12 16 16 0.000024672 554 Q D 30816 + 2048 [fstrim]
> > 252,12 16 17 0.000025467 554 Q D 32864 + 2048 [fstrim]
> > 252,12 16 18 0.000026374 554 Q D 34912 + 2048 [fstrim]
> > 252,12 16 19 0.000027194 554 Q D 36960 + 2048 [fstrim]
> > 252,12 16 20 0.000028137 554 Q D 39008 + 2048 [fstrim]
> > 252,12 16 21 0.000029524 554 Q D 41056 + 2048 [fstrim]
> > 252,12 16 22 0.000030479 554 Q D 43104 + 2048 [fstrim]
> > 252,12 16 23 0.000031306 554 Q D 45152 + 2048 [fstrim]
> > 252,12 16 24 0.000032134 554 Q D 47200 + 2048 [fstrim]
> > 252,12 16 25 0.000032964 554 Q D 49248 + 2048 [fstrim]
> > 252,12 16 26 0.000033794 554 Q D 51296 + 2048 [fstrim]
> >
> >
> > As you can see, while ext4 correctly aligns the discards to 1MB, xfs
> > does not.
>
> XFs just sends a large extent to blkdev_issue_discard(), and cares
> nothing about discard alignment or granularity.
>
> > It looks like an fstrim or xfs bug: they don't look at
> > discard_alignment (=0 ... a less misleading name would be
> > discard_offset imho) + discard_granularity (=1MB) and they don't
> > base alignments on those.
>
> It looks like blkdev_issue_discard() has reduced each discard to
> bios of a single "granule" (1MB), and not aligned them, hence they
> are ignore by dm-thinp.
>
> what are the discard parameters exposed by dm-thinp in
> /sys/block/<thinp-blkdev>/queue/discard*
>
> It looks to me that dmthinp might be setting discard_max_bytes to
> 1MB rather than discard_granularity. Looking at dm-thin.c:
>
> static void set_discard_limits(struct pool *pool, struct queue_limits *limits)
> {
> /*
> * FIXME: these limits may be incompatible with the pool's data device
> */
> limits->max_discard_sectors = pool->sectors_per_block;
>
> /*
> * This is just a hint, and not enforced. We have to cope with
> * bios that overlap 2 blocks.
> */
> limits->discard_granularity = pool->sectors_per_block << SECTOR_SHIFT;
> limits->discard_zeroes_data = pool->pf.zero_new_blocks;
> }
>
>
> Yes - discard_max_bytes == discard_granularity, and so
> blkdev_issue_discard fails to align the request properly. As it is,
> setting discard_max_bytes to the thinp block size is silly - it
> means you'll never get range requests, and we sent a discard for
> every single block in a range rather than having the thinp code
> iterate over a range itself.
So 2 different issues:
1) blkdev_issue_discard isn't properly aligning
2) thinp should accept larger discards (up to the stacked
discard_max_bytes rather than setting an override)
> i.e. this is not a filesystem bug that is causing the problem....
Paolo Bonzini fixed blkdev_issue_discard to properly align some time
ago; unfortunately the patches slipped through the cracks (cc'ing Paolo,
Jens, and Christoph).
Here are references to Paolo's patches:
0/2 https://lkml.org/lkml/2012/3/14/323
1/2 https://lkml.org/lkml/2012/3/14/324
2/2 https://lkml.org/lkml/2012/3/14/325
Patch 2/2 specifically addresses the case where:
discard_max_bytes == discard_granularity
Paolo, any chance you could resend to Jens (maybe with hch's comments on
patch#2 accounted for)? Also, please add hch's Reviewed-by when
reposting.
(would love to see this fixed for 3.5-rcX but if not 3.6 it is?)
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
next prev parent reply other threads:[~2012-06-21 17:47 UTC|newest]
Thread overview: 72+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-06-18 21:33 Ext4 and xfs problems in dm-thin on allocation and discard Spelic
2012-06-18 21:33 ` Spelic
2012-06-19 1:57 ` Dave Chinner
2012-06-19 1:57 ` Dave Chinner
2012-06-19 3:12 ` Mike Snitzer
2012-06-19 3:12 ` Mike Snitzer
2012-06-19 6:32 ` Lukáš Czerner
2012-06-19 6:32 ` Lukáš Czerner
2012-06-19 11:29 ` Spelic
2012-06-19 11:29 ` Spelic
2012-06-19 12:20 ` Lukáš Czerner
2012-06-19 12:20 ` Lukáš Czerner
2012-06-19 13:34 ` Mike Snitzer
2012-06-19 13:34 ` Mike Snitzer
2012-06-19 13:16 ` Mike Snitzer
2012-06-19 13:16 ` Mike Snitzer
2012-06-19 13:25 ` Lukáš Czerner
2012-06-19 13:25 ` Lukáš Czerner
2012-06-19 13:30 ` Mike Snitzer
2012-06-19 13:30 ` Mike Snitzer
2012-06-19 13:52 ` Spelic
2012-06-19 13:52 ` Spelic
2012-06-19 14:05 ` Eric Sandeen
2012-06-19 14:05 ` Eric Sandeen
2012-06-19 14:44 ` Mike Snitzer
2012-06-19 14:44 ` Mike Snitzer
2012-06-19 18:48 ` Mike Snitzer
2012-06-19 18:48 ` Mike Snitzer
2012-06-19 20:06 ` Dave Chinner
2012-06-19 20:06 ` Dave Chinner
2012-06-19 20:21 ` Ted Ts'o
2012-06-19 20:21 ` Ted Ts'o
2012-06-19 20:39 ` Dave Chinner
2012-06-19 20:39 ` Dave Chinner
2012-06-20 9:01 ` Christoph Hellwig
2012-06-20 9:01 ` Christoph Hellwig
2012-06-19 21:37 ` Spelic
2012-06-19 21:37 ` Spelic
2012-06-19 23:12 ` Dave Chinner
2012-06-19 23:12 ` Dave Chinner
2012-06-20 12:11 ` Spelic
2012-06-20 12:11 ` Spelic
2012-06-20 22:53 ` Dave Chinner
2012-06-20 22:53 ` Dave Chinner
2012-06-21 17:47 ` Mike Snitzer [this message]
2012-06-21 17:47 ` Mike Snitzer
2012-06-21 23:29 ` Dave Chinner
2012-06-21 23:29 ` Dave Chinner
2012-07-01 14:53 ` Paolo Bonzini
2012-07-01 14:53 ` Paolo Bonzini
2012-07-02 13:00 ` Mike Snitzer
2012-07-02 13:00 ` Mike Snitzer
2012-07-02 13:15 ` Paolo Bonzini
2012-07-02 13:15 ` Paolo Bonzini
2012-06-19 14:09 ` Lukáš Czerner
2012-06-19 14:09 ` Lukáš Czerner
2012-06-19 14:19 ` Ted Ts'o
2012-06-19 14:19 ` Ted Ts'o
2012-06-19 14:23 ` Eric Sandeen
2012-06-19 14:23 ` Eric Sandeen
2012-06-19 14:37 ` Lukáš Czerner
2012-06-19 14:37 ` Lukáš Czerner
2012-06-19 14:43 ` [dm-devel] " Alasdair G Kergon
2012-06-19 14:43 ` Alasdair G Kergon
2012-06-19 15:28 ` Mike Snitzer
2012-06-19 15:28 ` Mike Snitzer
2012-06-19 16:03 ` [dm-devel] " Alasdair G Kergon
2012-06-19 16:03 ` Alasdair G Kergon
2012-06-19 19:58 ` Ted Ts'o
2012-06-19 19:58 ` Ted Ts'o
2012-06-19 20:44 ` Mike Snitzer
2012-06-19 20:44 ` Mike Snitzer
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20120621174742.GA27837@redhat.com \
--to=snitzer@redhat.com \
--cc=axboe@kernel.dk \
--cc=david@fromorbit.com \
--cc=dm-devel@redhat.com \
--cc=hch@infradead.org \
--cc=linux-ext4@vger.kernel.org \
--cc=pbonzini@redhat.com \
--cc=spelic@shiftmail.org \
--cc=xfs@oss.sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.