* dm-cache questions
@ 2013-12-10 1:56 Paul B. Henson
2013-12-10 9:50 ` Joe Thornber
0 siblings, 1 reply; 11+ messages in thread
From: Paul B. Henson @ 2013-12-10 1:56 UTC (permalink / raw)
To: dm-devel
I'm building a small virtualization server for which I'd like to avail of
ssd caching to increase performance. While there seems to be an increasing
plethora of options for ssd caching under linux, I'd like to stick with
something that's part of the mainline kernel, which I think restricts the
playing field down to bcache or dm-cache.
After reviewing the dm-cache documentation and mailing list archives, I had
a few questions I hope somebody might be able to answer; I apologize in
advance if any of them are silly or something I should've already found on
my own.
I've got four WD RE4 2TB drives that I plan to configure as RAID10 for the
data device, and two Samsung 840 Pro 256GB SSD's that I plan to configure as
RAID1 for the cache device. I'd like to set up write back caching to improve
both read and write performance. I was going to set up lvm on top of the
cached device and then use lv's as the backing store for kvm virtual
machines.
Is dm-cache considered ready for production deployment? From what I
understand, there are plans to add support for managing dm-cache to lvm2,
and without that it's a bit cryptic to use/set up. I see that Fedora has
deferred including support for dm-cache into their distribution pending that
lvm2 support, but other than easing configuration/management, are there any
reasons not to go ahead and deploy dm-cache in production now working with
it directly rather than through lvm2?
What is the recommended kernel version for using dm-cache? Would 3.10LTS be
suitable, or would it be better at this point to be running the latest
stable, eg 3.12.x now, and then 3.13.x once 3.12 goes EOL, to be sure to
have the latest bug fixes and performance enhancements?
From reviewing the documentation, in addition to the origin/backing device
and the cache device, a third device is necessary for metadata. Per the
documentation the rationale for having a separate device for metadata rather
than simply using the cache device is so that the metadevice can be
configured with different redundancy; the example given is that perhaps it
could be mirrored. I'm confused though as to what utility there is an having
a metadata device with a different level of redundancy than the cache
device. If the metadata device is mirrored, and the cache device is not, you
will still be able to access the metadata should the cache device fail, but
given the cache device has failed, what are you going to do with it?
Conversely, if the cache device is mirrored, and the metadata device is not,
should the metadata device fail, how are you going to use your cache? I can
see potentially having the origin device redundant, and the cache device
not, assuming you are not using write back caching, but I don't initially
see a scenario where you would configure a cache device and a metadevice
with different availability characteristics.
What are the performance requirements of the metadevice? For my system, I
can either put it on the cache device, on the origin device, or I have
another mirror of two USB sticks used for /boot that it could go on.
Intuitively it seems the metadata device should be fast/low latency, so my
first guess would be the best location would be on the SSD mirror I'm using
for cache. Based on the examples I've seen, you can either partition the
device into two pieces to separate metadata from cache, or use dm-linear,
I'm thinking I'll go with partitioning as that seems simpler and I'm more
familiar with it, although I suppose that will result in a little bit of
waste for the partition table and alignment.
With bcache, they recommend selecting the bucket size and block size based
on the specifications of your SSD, is there any similar recommended
alignment with the underlying SSD for selecting dm-cache block size? The SSD
I am using has a 1024k erase block size and an 8k page size. Or should be
block size be tuned based more on the size of the origin device relative to
the cache device and your expected I/O sizes, with no particular regard for
the physical characteristics of your SSD ?
From what I've read, the rule of thumb algorithm for sizing your metadata
device is 4 MB + ( 16 bytes * nr_blocks ). Is that still accurate? So, if I
hypothetically selected a 256k block size, I would calculate it as:
# blockdev --getsize64 /dev/md2 (ssd mirror)
255926140928
4194304 + (16 * 255926140928 / 262144) = 19814796
So I would need to make a partition of size approximately 19MB for the
metadata? Then, assuming I partitioned md2 into md2p1 (metadata) and md2p2
(cache), and my origin device was md3, I could create the cache device via:
# blockdev --getsz /dev/md3
7813531648
# dmsetup create md3-cached --table '0 7813531648 cache /dev/md2p1
/dev/md2p2 /dev/md3 512 1 writeback default 0'
For shutdown, you should then arrange to run 'dmsetup suspend md3-cached'
at reboot/halt so it goes down cleanly? From what I read, dm-cache should be
reasonably robust in the face of a crash/panic, so this is really more of an
optimization as opposed to a hard requirement?
Just a couple more miscellaneous questions :), is there any way to switch
between modes/policies without downtime on the cache device? For example, if
one of the SSD's failed and you wanted to switch to write through mode
rather than write back until you replaced it and the mirror was healthy
again?
Is there any support or integration with SSD TRIM for the cache device? Not
necessarily in real-time, as that can degrade performance, but occasionally
in batch ala fstrim for filesystems, to get dm-cache to TRIM all of the not
in use blocks at that time in order to optimize the SSD garbage collector?
If you have read this far, thank you very much :), I'm sorry for such a long
message, but I'm trying to wrap my head around this and be sure I have a
good understanding before using it.
Thanks.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: dm-cache questions
2013-12-10 1:56 dm-cache questions Paul B. Henson
@ 2013-12-10 9:50 ` Joe Thornber
2013-12-10 21:04 ` Paul B. Henson
0 siblings, 1 reply; 11+ messages in thread
From: Joe Thornber @ 2013-12-10 9:50 UTC (permalink / raw)
To: device-mapper development
On Mon, Dec 09, 2013 at 05:56:03PM -0800, Paul B. Henson wrote:
> I'm building a small virtualization server for which I'd like to avail of
> ssd caching to increase performance. While there seems to be an increasing
> plethora of options for ssd caching under linux, I'd like to stick with
> something that's part of the mainline kernel, which I think restricts the
> playing field down to bcache or dm-cache.
>
> After reviewing the dm-cache documentation and mailing list archives, I had
> a few questions I hope somebody might be able to answer; I apologize in
> advance if any of them are silly or something I should've already found on
> my own.
>
> I've got four WD RE4 2TB drives that I plan to configure as RAID10 for the
> data device, and two Samsung 840 Pro 256GB SSD's that I plan to configure as
> RAID1 for the cache device. I'd like to set up write back caching to improve
> both read and write performance. I was going to set up lvm on top of the
> cached device and then use lv's as the backing store for kvm virtual
> machines.
>
> Is dm-cache considered ready for production deployment? From what I
> understand, there are plans to add support for managing dm-cache to lvm2,
> and without that it's a bit cryptic to use/set up. I see that Fedora has
> deferred including support for dm-cache into their distribution pending that
> lvm2 support, but other than easing configuration/management, are there any
> reasons not to go ahead and deploy dm-cache in production now working with
> it directly rather than through lvm2?
I've just found a serious bug that causes metadata space to be used up
too quickly. So hold off until I get a patch together later this week.
> What is the recommended kernel version for using dm-cache? Would 3.10LTS be
> suitable, or would it be better at this point to be running the latest
> stable, eg 3.12.x now, and then 3.13.x once 3.12 goes EOL, to be sure to
> have the latest bug fixes and performance enhancements?
>
> >From reviewing the documentation, in addition to the origin/backing device
> and the cache device, a third device is necessary for metadata. Per the
> documentation the rationale for having a separate device for metadata rather
> than simply using the cache device is so that the metadevice can be
> configured with different redundancy; the example given is that perhaps it
> could be mirrored. I'm confused though as to what utility there is an having
> a metadata device with a different level of redundancy than the cache
> device. If the metadata device is mirrored, and the cache device is not, you
> will still be able to access the metadata should the cache device fail, but
> given the cache device has failed, what are you going to do with it?
The metadata is stored in btrees, damaging a high level node in this
btree can lose an awful lot of mappings. So I recommend mirroring it.
dm-cache using the same metadata library as dm-thin, where commonly
people would want to put the metadata on SSD and the data on a
spindle. Your volume manage (eg, LVM2) should be doing this
transparently for you.
> What are the performance requirements of the metadevice? For my system, I
> can either put it on the cache device, on the origin device, or I have
> another mirror of two USB sticks used for /boot that it could go on.
> Intuitively it seems the metadata device should be fast/low latency, so my
> first guess would be the best location would be on the SSD mirror I'm using
> for cache. Based on the examples I've seen, you can either partition the
> device into two pieces to separate metadata from cache, or use dm-linear,
> I'm thinking I'll go with partitioning as that seems simpler and I'm more
> familiar with it, although I suppose that will result in a little bit of
> waste for the partition table and alignment.
Personally I'd use linear, since it allows you to resize easily.
> With bcache, they recommend selecting the bucket size and block size based
> on the specifications of your SSD, is there any similar recommended
> alignment with the underlying SSD for selecting dm-cache block size? The SSD
> I am using has a 1024k erase block size and an 8k page size. Or should be
> block size be tuned based more on the size of the origin device relative to
> the cache device and your expected I/O sizes, with no particular regard for
> the physical characteristics of your SSD ?
bcache is log based, and so uses the ssds more efficiently. For
dm-cache I'd say your IO patterns, and size of the hotspots were the
dominant factor.
> >From what I've read, the rule of thumb algorithm for sizing your metadata
> device is 4 MB + ( 16 bytes * nr_blocks ). Is that still accurate? So, if I
> hypothetically selected a 256k block size, I would calculate it as:
>
> # blockdev --getsize64 /dev/md2 (ssd mirror)
> 255926140928
>
> 4194304 + (16 * 255926140928 / 262144) = 19814796
>
> So I would need to make a partition of size approximately 19MB for the
> metadata? Then, assuming I partitioned md2 into md2p1 (metadata) and md2p2
> (cache), and my origin device was md3, I could create the cache device via:
Yes, but go crazy and round up to 128m since it's so small.
>
> # blockdev --getsz /dev/md3
> 7813531648
>
> # dmsetup create md3-cached --table '0 7813531648 cache /dev/md2p1
> /dev/md2p2 /dev/md3 512 1 writeback default 0'
>
> For shutdown, you should then arrange to run 'dmsetup suspend md3-cached'
> at reboot/halt so it goes down cleanly? From what I read, dm-cache should be
> reasonably robust in the face of a crash/panic, so this is really more of an
> optimization as opposed to a hard requirement?
Definitely shut down cleanly. This allows the policy plugin to write
it's 'hint' array that will improve performance on reload. For
example the default 'mq' policy stores the hit counts for the cached
blocks.
> Just a couple more miscellaneous questions :), is there any way to switch
> between modes/policies without downtime on the cache device? For example, if
> one of the SSD's failed and you wanted to switch to write through mode
> rather than write back until you replaced it and the mirror was healthy
> again?
Use the normal suspend, reload, resume cycle.
> Is there any support or integration with SSD TRIM for the cache device? Not
> necessarily in real-time, as that can degrade performance, but occasionally
> in batch ala fstrim for filesystems, to get dm-cache to TRIM all of the not
> in use blocks at that time in order to optimize the SSD garbage collector?
dm-cache both passes down trim messages, and keeps track of discarded
origin blocks in it's metadata to avoid redundant io on promotion/demotion.
> If you have read this far, thank you very much :), I'm sorry for such a long
> message, but I'm trying to wrap my head around this and be sure I have a
> good understanding before using it.
No problem, yell if you need more help.
- Joe
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: dm-cache questions
2013-12-10 9:50 ` Joe Thornber
@ 2013-12-10 21:04 ` Paul B. Henson
2013-12-11 1:03 ` Alasdair G Kergon
0 siblings, 1 reply; 11+ messages in thread
From: Paul B. Henson @ 2013-12-10 21:04 UTC (permalink / raw)
To: 'device-mapper development'
> From: Joe Thornber
> Sent: Tuesday, December 10, 2013 1:50 AM
>
> I've just found a serious bug that causes metadata space to be used up
> too quickly. So hold off until I get a patch together later this week.
Ouch. Life on the bleeding edge ;). Do fixes like this get backported to
3.10LTS, or just to the latest mainline/stable kernel? One nice thing about
dm-cache is that it layers transparently on top of your origin device, so it
can be dropped in at any time. I've been kind of stalled moving forward
waiting to sort out caching, but I think I can go ahead and get everything
set up and then just drop dm-cache in on top of it later. I guess the other
side of that coin is you have to be careful to make sure nothing accesses
the origin device directly rather than through the cache once it's set up,
as that would be a sure recipe for corruption...
> The metadata is stored in btrees, damaging a high level node in this
> btree can lose an awful lot of mappings. So I recommend mirroring it.
If the metadata device is an SSD, does there end up being a lot of write
amplifications due to the small updates being made to the metadata?
> spindle. Your volume manage (eg, LVM2) should be doing this
> transparently for you.
Any idea when dm-cache support is going to show up in lvm2? Reviewing the
last few months of mailing list traffic on that list, I don't see any
mention of it. I found a message dating back to March 2013 that said "within
a few months", but perhaps that was optimistic :). Do you think lvm2 would
recognize and be able to manage a "handmade" dm-cache once it is released,
or would you need to clean out the cache, remove it, and then re-create it
with lvm2?
> Personally I'd use linear, since it allows you to resize easily.
After looking at it again, it is simpler than it appeared on first
impression.
> Use the normal suspend, reload, resume cycle.
Ah, ok, I thought suspend removed the device, but it just freezes I/O and
pauses applications until you bring it back.
> dm-cache both passes down trim messages, and keeps track of discarded
> origin blocks in it's metadata to avoid redundant io on
promotion/demotion.
So if your origin device supports trim, it will see them, but what about the
cache device, which is only being used by dm-cache? Does dm-cache
occasionally trim the cache device to free unused blocks?
> No problem, yell if you need more help.
Cool, thanks much.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: dm-cache questions
2013-12-10 21:04 ` Paul B. Henson
@ 2013-12-11 1:03 ` Alasdair G Kergon
2013-12-11 2:08 ` Paul B. Henson
0 siblings, 1 reply; 11+ messages in thread
From: Alasdair G Kergon @ 2013-12-11 1:03 UTC (permalink / raw)
To: Paul B. Henson; +Cc: 'device-mapper development'
On Tue, Dec 10, 2013 at 01:04:58PM -0800, Paul B. Henson wrote:
> Ouch. Life on the bleeding edge ;). Do fixes like this get backported to
> 3.10LTS, or just to the latest mainline/stable kernel?
We send them to mainline and stable then it is up to the distributions
concerned to choose to apply them. Sometimes it helps to give the
distribution(s) concerned a gentle prod and ask if they've noticed the
patch(es) and if/when they'll be taking them.
> Any idea when dm-cache support is going to show up in lvm2?
Various aspects of the design are being debated. Expect to see progress
to start to appear on lvm-devel over the next few weeks. We're
hopeful we'll have something that's beginning to be useful in January.
Alasdair
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: dm-cache questions
2013-12-11 1:03 ` Alasdair G Kergon
@ 2013-12-11 2:08 ` Paul B. Henson
2013-12-11 15:06 ` Mike Snitzer
0 siblings, 1 reply; 11+ messages in thread
From: Paul B. Henson @ 2013-12-11 2:08 UTC (permalink / raw)
To: 'Alasdair G Kergon'; +Cc: 'device-mapper development'
> From: Alasdair G Kergon [mailto:agk@redhat.com]
> Sent: Tuesday, December 10, 2013 5:04 PM
>
> We send them to mainline and stable then it is up to the distributions
> concerned to choose to apply them.
I meant more whether they'd get applied to the LTS kernel upstream; eg, show
up in a minor patch to 3.10 from kernel.org.
> Various aspects of the design are being debated. Expect to see progress
> to start to appear on lvm-devel over the next few weeks. We're
> hopeful we'll have something that's beginning to be useful in January.
Cool, thanks for the info. I'll start lurking on that list and see what
happens :).
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: dm-cache questions
2013-12-11 2:08 ` Paul B. Henson
@ 2013-12-11 15:06 ` Mike Snitzer
2013-12-12 2:14 ` Paul B. Henson
0 siblings, 1 reply; 11+ messages in thread
From: Mike Snitzer @ 2013-12-11 15:06 UTC (permalink / raw)
To: Paul B. Henson
Cc: 'device-mapper development', 'Alasdair G Kergon'
On Tue, Dec 10 2013 at 9:08pm -0500,
Paul B. Henson <henson@acm.org> wrote:
> > From: Alasdair G Kergon [mailto:agk@redhat.com]
> > Sent: Tuesday, December 10, 2013 5:04 PM
> >
> > We send them to mainline and stable then it is up to the distributions
> > concerned to choose to apply them.
>
> I meant more whether they'd get applied to the LTS kernel upstream; eg, show
> up in a minor patch to 3.10 from kernel.org.
3.10 is actively maintained by gregkh as a "longterm" stable kernel so
all relevant upstream commits should make their way into that tree:
http://www.linuxfoundation.org/news-media/blogs/browse/2013/08/longterm-kernel-310
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: dm-cache questions
2013-12-11 15:06 ` Mike Snitzer
@ 2013-12-12 2:14 ` Paul B. Henson
2013-12-13 14:20 ` Mike Snitzer
0 siblings, 1 reply; 11+ messages in thread
From: Paul B. Henson @ 2013-12-12 2:14 UTC (permalink / raw)
To: 'Mike Snitzer'; +Cc: 'device-mapper development'
> From: Mike Snitzer [mailto:snitzer@redhat.com]
> Sent: Wednesday, December 11, 2013 7:07 AM
>
> 3.10 is actively maintained by gregkh as a "longterm" stable kernel so
> all relevant upstream commits should make their way into that tree:
Right, but as dm-cache changes over time, with new features or other major
changes being made relative to the version shipped in 3.10, presumably some
bug fixes that might get committed to mainline would not apply cleanly to
3.10 without some potentially non-negligible backporting effort? I don't
think gregkh does that himself? So if there was a major bug fix that ideally
would go back to LTS but couldn't be simply cherry picked, one of the
dm-cache devs would need to submit a separate commit for 3.10? Or perhaps I
misunderstand the process.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: dm-cache questions
2013-12-12 2:14 ` Paul B. Henson
@ 2013-12-13 14:20 ` Mike Snitzer
2013-12-13 19:44 ` Paul B. Henson
0 siblings, 1 reply; 11+ messages in thread
From: Mike Snitzer @ 2013-12-13 14:20 UTC (permalink / raw)
To: Paul B. Henson; +Cc: 'device-mapper development'
On Wed, Dec 11 2013 at 9:14pm -0500,
Paul B. Henson <henson@acm.org> wrote:
> > From: Mike Snitzer [mailto:snitzer@redhat.com]
> > Sent: Wednesday, December 11, 2013 7:07 AM
> >
> > 3.10 is actively maintained by gregkh as a "longterm" stable kernel so
> > all relevant upstream commits should make their way into that tree:
>
> Right, but as dm-cache changes over time, with new features or other major
> changes being made relative to the version shipped in 3.10, presumably some
> bug fixes that might get committed to mainline would not apply cleanly to
> 3.10 without some potentially non-negligible backporting effort? I don't
> think gregkh does that himself? So if there was a major bug fix that ideally
> would go back to LTS but couldn't be simply cherry picked, one of the
> dm-cache devs would need to submit a separate commit for 3.10? Or perhaps I
> misunderstand the process.
We sometimes cater to the quirks of a specific stable release's codebase:
http://www.redhat.com/archives/dm-devel/2013-December/msg00018.html
Or we add notes to stable for guidance on how to resolve small
differences that'd prevent a clean backport (see bottom of patch
header):
http://www.redhat.com/archives/dm-devel/2013-November/msg00200.html
But in general it is the task of other distro vendors to backport stable
fixes to their products.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: dm-cache questions
2013-12-13 14:20 ` Mike Snitzer
@ 2013-12-13 19:44 ` Paul B. Henson
2013-12-13 21:21 ` Mike Snitzer
0 siblings, 1 reply; 11+ messages in thread
From: Paul B. Henson @ 2013-12-13 19:44 UTC (permalink / raw)
To: 'device-mapper development'
> From: Mike Snitzer
> Sent: Friday, December 13, 2013 6:20 AM
>
> But in general it is the task of other distro vendors to backport stable
> fixes to their products.
Got it, thanks. I see the recent beta announcement officially confirmed that
RHEL 7 will be based on 3.10, so I'm guessing somebody in your organization
is going to be back porting to that stable release at least :). Although I
suppose that could be in your internal wad-o-patches rather than official
upstream commits.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: dm-cache questions
2013-12-13 19:44 ` Paul B. Henson
@ 2013-12-13 21:21 ` Mike Snitzer
2013-12-14 0:59 ` Paul B. Henson
0 siblings, 1 reply; 11+ messages in thread
From: Mike Snitzer @ 2013-12-13 21:21 UTC (permalink / raw)
To: Paul B. Henson; +Cc: 'device-mapper development'
On Fri, Dec 13 2013 at 2:44pm -0500,
Paul B. Henson <henson@acm.org> wrote:
> > From: Mike Snitzer
> > Sent: Friday, December 13, 2013 6:20 AM
> >
> > But in general it is the task of other distro vendors to backport stable
> > fixes to their products.
>
> Got it, thanks. I see the recent beta announcement officially confirmed that
> RHEL 7 will be based on 3.10, so I'm guessing somebody in your organization
> is going to be back porting to that stable release at least :). Although I
> suppose that could be in your internal wad-o-patches rather than official
> upstream commits.
All the latest dm-cache changes will be included in RHEL7. I'm the
lucky guy who brings these changes back.
FYI, the most recent dm-cache fixes should hopefully be available in
linus's tree shortly. Joe fixed the dm cache metadata leak that he
advised you to wait for before using dm-cache in a more serious way.
See:
http://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/log/?h=for-linus
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: dm-cache questions
2013-12-13 21:21 ` Mike Snitzer
@ 2013-12-14 0:59 ` Paul B. Henson
0 siblings, 0 replies; 11+ messages in thread
From: Paul B. Henson @ 2013-12-14 0:59 UTC (permalink / raw)
To: Mike Snitzer; +Cc: 'device-mapper development'
On Fri, Dec 13, 2013 at 04:21:56PM -0500, Mike Snitzer wrote:
> FYI, the most recent dm-cache fixes should hopefully be available in
> linus's tree shortly. Joe fixed the dm cache metadata leak that he
> advised you to wait for before using dm-cache in a more serious way.
Cool, I'm guessing that's the "dm array: fix a reference counting bug in
shadow_ablock" commit? I'll keep an eye out for that in the 3.12.x
changelogs, thanks...
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2013-12-14 0:59 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-12-10 1:56 dm-cache questions Paul B. Henson
2013-12-10 9:50 ` Joe Thornber
2013-12-10 21:04 ` Paul B. Henson
2013-12-11 1:03 ` Alasdair G Kergon
2013-12-11 2:08 ` Paul B. Henson
2013-12-11 15:06 ` Mike Snitzer
2013-12-12 2:14 ` Paul B. Henson
2013-12-13 14:20 ` Mike Snitzer
2013-12-13 19:44 ` Paul B. Henson
2013-12-13 21:21 ` Mike Snitzer
2013-12-14 0:59 ` Paul B. Henson
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.