* I/O block when removing thin device on the same pool @ 2016-01-20 10:05 Dennis Yang 2016-01-20 11:27 ` Zdenek Kabelac 0 siblings, 1 reply; 19+ messages in thread From: Dennis Yang @ 2016-01-20 10:05 UTC (permalink / raw) To: device-mapper development Hi, I had noticed that I/O requests to one thin device will be blocked when the other thin device is being deleting. The root cause of this is that to delete a thin device will eventually call dm_btree_del() which is a slow function and can block. This means that the device deleting process will need to hold the pool lock for a very long time to wait for this function to delete the whole data mapping subtree. Since I/O to the devices on the same pool needs to held the same pool lock to lookup/insert/delete data mapping, all I/O will be blocked until the delete process finish. For now, I have to discard all the mappings of a thin device before deleting it to prevent I/O from being blocked. Since these discard requests not only take lots of time to finish but hurt the pool I/O throughput, I am still looking for other better solutions to fix this issue. I think the main problem is still the big pool lock in dm-thin which hurts both the scalability and performance of. I am wondering if there is any plan on improving this or any better fix for the I/O block problem. Any help would be grateful. Dennis ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: I/O block when removing thin device on the same pool 2016-01-20 10:05 I/O block when removing thin device on the same pool Dennis Yang @ 2016-01-20 11:27 ` Zdenek Kabelac 2016-01-20 16:17 ` Dennis Yang 2016-01-21 17:33 ` Nikolay Borisov 0 siblings, 2 replies; 19+ messages in thread From: Zdenek Kabelac @ 2016-01-20 11:27 UTC (permalink / raw) To: device-mapper development Dne 20.1.2016 v 11:05 Dennis Yang napsal(a): > Hi, > > I had noticed that I/O requests to one thin device will be blocked > when the other thin device is being deleting. The root cause of this > is that to delete a thin device will eventually call dm_btree_del() > which is a slow function and can block. This means that the device > deleting process will need to hold the pool lock for a very long time > to wait for this function to delete the whole data mapping subtree. > Since I/O to the devices on the same pool needs to held the same pool > lock to lookup/insert/delete data mapping, all I/O will be blocked > until the delete process finish. > > For now, I have to discard all the mappings of a thin device before > deleting it to prevent I/O from being blocked. Since these discard > requests not only take lots of time to finish but hurt the pool I/O > throughput, I am still looking for other better solutions to fix this > issue. > > I think the main problem is still the big pool lock in dm-thin which > hurts both the scalability and performance of. I am wondering if there > is any plan on improving this or any better fix for the I/O block > problem. Hi What is your use case. You may possibly split the load between several thin-pools ? Current design is not targeted to simultaneously maintain very large number of active thin-volumes within a single thin-pool. Zdenek ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: I/O block when removing thin device on the same pool 2016-01-20 11:27 ` Zdenek Kabelac @ 2016-01-20 16:17 ` Dennis Yang 2016-01-21 17:33 ` Nikolay Borisov 1 sibling, 0 replies; 19+ messages in thread From: Dennis Yang @ 2016-01-20 16:17 UTC (permalink / raw) To: device-mapper development Hi, Thanks for replying. In my use case, I will have couples of 50TB thin devices (less than 20) with different services running on them. Also, I will take hourly read-only snapshot on some of these thin devices and prevent one single thin device from having over 1024 snapshots by deleting the oldest snapshot when it has to. During the deletion of a snapshot or a thin device, I/O gets blocked and some of the latency-sensitive services stop and return error code. I am aware of that the current design is not suitable for me to put all the thin devices on the same pool. However, It seems that this I/O blocking problem will still exist even when I have only one thin device and couple of read-only snapshots of it on the same pool. Dennis 2016-01-20 19:27 GMT+08:00 Zdenek Kabelac <zkabelac@redhat.com>: > Dne 20.1.2016 v 11:05 Dennis Yang napsal(a): >> >> Hi, >> >> I had noticed that I/O requests to one thin device will be blocked >> when the other thin device is being deleting. The root cause of this >> is that to delete a thin device will eventually call dm_btree_del() >> which is a slow function and can block. This means that the device >> deleting process will need to hold the pool lock for a very long time >> to wait for this function to delete the whole data mapping subtree. >> Since I/O to the devices on the same pool needs to held the same pool >> lock to lookup/insert/delete data mapping, all I/O will be blocked >> until the delete process finish. >> >> For now, I have to discard all the mappings of a thin device before >> deleting it to prevent I/O from being blocked. Since these discard >> requests not only take lots of time to finish but hurt the pool I/O >> throughput, I am still looking for other better solutions to fix this >> issue. >> >> I think the main problem is still the big pool lock in dm-thin which >> hurts both the scalability and performance of. I am wondering if there >> is any plan on improving this or any better fix for the I/O block >> problem. > > > Hi > > What is your use case. > > You may possibly split the load between several thin-pools ? > > Current design is not targeted to simultaneously maintain very large number > of active thin-volumes within a single thin-pool. > > > Zdenek > > -- > dm-devel mailing list > dm-devel@redhat.com > https://www.redhat.com/mailman/listinfo/dm-devel ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: I/O block when removing thin device on the same pool 2016-01-20 11:27 ` Zdenek Kabelac 2016-01-20 16:17 ` Dennis Yang @ 2016-01-21 17:33 ` Nikolay Borisov 2016-01-21 19:44 ` Mike Snitzer 1 sibling, 1 reply; 19+ messages in thread From: Nikolay Borisov @ 2016-01-21 17:33 UTC (permalink / raw) To: device-mapper development On Wed, Jan 20, 2016 at 1:27 PM, Zdenek Kabelac <zkabelac@redhat.com> wrote: > Dne 20.1.2016 v 11:05 Dennis Yang napsal(a): >> >> Hi, >> >> I had noticed that I/O requests to one thin device will be blocked >> when the other thin device is being deleting. The root cause of this >> is that to delete a thin device will eventually call dm_btree_del() >> which is a slow function and can block. This means that the device >> deleting process will need to hold the pool lock for a very long time >> to wait for this function to delete the whole data mapping subtree. >> Since I/O to the devices on the same pool needs to held the same pool >> lock to lookup/insert/delete data mapping, all I/O will be blocked >> until the delete process finish. >> >> For now, I have to discard all the mappings of a thin device before >> deleting it to prevent I/O from being blocked. Since these discard >> requests not only take lots of time to finish but hurt the pool I/O >> throughput, I am still looking for other better solutions to fix this >> issue. >> >> I think the main problem is still the big pool lock in dm-thin which >> hurts both the scalability and performance of. I am wondering if there >> is any plan on improving this or any better fix for the I/O block >> problem. > > > Hi > > What is your use case. > > You may possibly split the load between several thin-pools ? > > Current design is not targeted to simultaneously maintain very large number > of active thin-volumes within a single thin-pool. Sorry of the offtopic, but what would constitute a "Very large number" - 100, 1000s? > > > Zdenek > > > -- > dm-devel mailing list > dm-devel@redhat.com > https://www.redhat.com/mailman/listinfo/dm-devel ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: I/O block when removing thin device on the same pool 2016-01-21 17:33 ` Nikolay Borisov @ 2016-01-21 19:44 ` Mike Snitzer 2016-01-22 13:38 ` Lars Ellenberg 0 siblings, 1 reply; 19+ messages in thread From: Mike Snitzer @ 2016-01-21 19:44 UTC (permalink / raw) To: Nikolay Borisov, Dennis Yang; +Cc: device-mapper development > > Dne 20.1.2016 v 11:05 Dennis Yang napsal(a): > >> > >> Hi, > >> > >> I had noticed that I/O requests to one thin device will be blocked > >> when the other thin device is being deleting. The root cause of this > >> is that to delete a thin device will eventually call dm_btree_del() > >> which is a slow function and can block. This means that the device > >> deleting process will need to hold the pool lock for a very long time > >> to wait for this function to delete the whole data mapping subtree. > >> Since I/O to the devices on the same pool needs to held the same pool > >> lock to lookup/insert/delete data mapping, all I/O will be blocked > >> until the delete process finish. > >> > >> For now, I have to discard all the mappings of a thin device before > >> deleting it to prevent I/O from being blocked. Since these discard > >> requests not only take lots of time to finish but hurt the pool I/O > >> throughput, I am still looking for other better solutions to fix this > >> issue. > >> > >> I think the main problem is still the big pool lock in dm-thin which > >> hurts both the scalability and performance of. I am wondering if there > >> is any plan on improving this or any better fix for the I/O block > >> problem. Just so I'm aware: which kernel are you using? dm_pool_delete_thin_device() takes pmd->root_lock so yes it is very coarse-grained; especially when you consider concurrent IO to another thin device from the same pool will call interfaces, like dm_thin_find_block(), which also take the same pmd->root_lock. It should be noted that the discard performance has improved considerably with the range discard support (which really didn't stabilize until Linux 4.4) and then even more improvement in Linux 4.5 with commits like: 3d5f6733 ("dm thin metadata: speed up discard of partially mapped volumes") > On Wed, Jan 20, 2016 at 1:27 PM, Zdenek Kabelac <zkabelac@redhat.com> wrote: > > Hi > > > > What is your use case. > > > > You may possibly split the load between several thin-pools ? > > > > Current design is not targeted to simultaneously maintain very large number > > of active thin-volumes within a single thin-pool. We have some systemic coarse-grained locking for sure (making things like device deletion vs normal IO problematic) but if we're talking pure concurrent IO to many thin devices backed by the same thin-pool we really should perform reasonably well. On Thu, Jan 21 2016 at 12:33pm -0500, Nikolay Borisov <n.borisov@siteground.com> wrote: > Sorry of the offtopic, but what would constitute a "Very large number" > - 100, 1000s? TBD really. Like I said above concurrent IO shouldn't hit locks like the pool metadata lock (pmd->root_lock) _that_ hard. But if that IO is competing with device discard or delete operations then it'll be a different story. As it happens I just attended a meeting that emphasized the requirement to scale to 100s or even 1000 thin devices within a single pool. So while there certainly could be painfully pathological locking bottlenecks that have yet to be exposed I'm fairly confident we'll be identifying them soon enough. Any perf-report tool traces that illustrate realized thin-pool performance bottlenecks are always appreciated. Mike ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: I/O block when removing thin device on the same pool 2016-01-21 19:44 ` Mike Snitzer @ 2016-01-22 13:38 ` Lars Ellenberg 2016-01-22 13:58 ` Zdenek Kabelac 2016-01-22 16:43 ` Joe Thornber 0 siblings, 2 replies; 19+ messages in thread From: Lars Ellenberg @ 2016-01-22 13:38 UTC (permalink / raw) To: dm-devel On Thu, Jan 21, 2016 at 02:44:06PM -0500, Mike Snitzer wrote: > > > Dne 20.1.2016 v 11:05 Dennis Yang napsal(a): > > >> > > >> Hi, > > >> > > >> I had noticed that I/O requests to one thin device will be blocked > > >> when the other thin device is being deleting. The root cause of this > > >> is that to delete a thin device will eventually call dm_btree_del() > > >> which is a slow function and can block. This means that the device > > >> deleting process will need to hold the pool lock for a very long time > > >> to wait for this function to delete the whole data mapping subtree. > > >> Since I/O to the devices on the same pool needs to held the same pool > > >> lock to lookup/insert/delete data mapping, all I/O will be blocked > > >> until the delete process finish. > > >> > > >> For now, I have to discard all the mappings of a thin device before > > >> deleting it to prevent I/O from being blocked. Since these discard > > >> requests not only take lots of time to finish but hurt the pool I/O > > >> throughput, I am still looking for other better solutions to fix this > > >> issue. > > >> > > >> I think the main problem is still the big pool lock in dm-thin which > > >> hurts both the scalability and performance of. I am wondering if there > > >> is any plan on improving this or any better fix for the I/O block > > >> problem. > > Just so I'm aware: which kernel are you using? > > dm_pool_delete_thin_device() takes pmd->root_lock so yes it is very > coarse-grained; especially when you consider concurrent IO to another > thin device from the same pool will call interfaces, like > dm_thin_find_block(), which also take the same pmd->root_lock. We have seen lvremove of thin snapshots sometimes minutes, even ~20 minutes before. So that means blocking IO to other devices in that pool (e.g. the typically currently in-use "origin") for minutes. That was, iirc, with ~10 TB origin, mostly allocated, tens of "rotating" snapshots, 64k chunk size, and considerable random write change rate on the origin. I'd like to propose a different approach for lvremove of thin devices (using "made up terms" instead of the correct device mapper vocabulary, because I'm lazy): on lvremove of a thin device, take all the locks you need, even if that implies blocking IO to other devices, BUT then don't do all the "delete" right there while holding those locks, but convert the device into a "i-am-currently-removing-myself" target, and release all the locks. That should be fast (enough). Then this "i-am-currently-removing-myself" target would have its .open() return some error, so it cannot even be opened anymore (or something with similar effect), start some kernel thread that does the actual "wipe" and "unref/unmap" from the tree and all that stuff "in the background", using much finer granular temporary locking for each processed region. If that then takes 20 minutes, someone may still care, but at least it does not block IO to the other active devices in the pool. Or is something like this already going on? Lars Ellenberg ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: I/O block when removing thin device on the same pool 2016-01-22 13:38 ` Lars Ellenberg @ 2016-01-22 13:58 ` Zdenek Kabelac 2016-01-22 16:07 ` Mike Snitzer 2016-01-22 16:43 ` Joe Thornber 1 sibling, 1 reply; 19+ messages in thread From: Zdenek Kabelac @ 2016-01-22 13:58 UTC (permalink / raw) To: device-mapper development Dne 22.1.2016 v 14:38 Lars Ellenberg napsal(a): > On Thu, Jan 21, 2016 at 02:44:06PM -0500, Mike Snitzer wrote: >>>> Dne 20.1.2016 v 11:05 Dennis Yang napsal(a): >>>>> >>>>> Hi, >>>>> >>>>> I had noticed that I/O requests to one thin device will be blocked >>>>> when the other thin device is being deleting. The root cause of this >>>>> is that to delete a thin device will eventually call dm_btree_del() >>>>> which is a slow function and can block. This means that the device >>>>> deleting process will need to hold the pool lock for a very long time >>>>> to wait for this function to delete the whole data mapping subtree. >>>>> Since I/O to the devices on the same pool needs to held the same pool >>>>> lock to lookup/insert/delete data mapping, all I/O will be blocked >>>>> until the delete process finish. >>>>> >>>>> For now, I have to discard all the mappings of a thin device before >>>>> deleting it to prevent I/O from being blocked. Since these discard >>>>> requests not only take lots of time to finish but hurt the pool I/O >>>>> throughput, I am still looking for other better solutions to fix this >>>>> issue. >>>>> >>>>> I think the main problem is still the big pool lock in dm-thin which >>>>> hurts both the scalability and performance of. I am wondering if there >>>>> is any plan on improving this or any better fix for the I/O block >>>>> problem. >> >> Just so I'm aware: which kernel are you using? >> >> dm_pool_delete_thin_device() takes pmd->root_lock so yes it is very >> coarse-grained; especially when you consider concurrent IO to another >> thin device from the same pool will call interfaces, like >> dm_thin_find_block(), which also take the same pmd->root_lock. > > We have seen lvremove of thin snapshots sometimes minutes, > even ~20 minutes before. > So that means blocking IO to other devices in that pool > (e.g. the typically currently in-use "origin") for minutes. > > That was, iirc, with ~10 TB origin, mostly allocated, > tens of "rotating" snapshots, 64k chunk size, > and considerable random write change rate on the origin. > > I'd like to propose a different approach for lvremove of thin devices > (using "made up terms" instead of the correct device mapper vocabulary, > because I'm lazy): > on lvremove of a thin device, take all the locks you need, > even if that implies blocking IO to other devices, > BUT > then don't do all the "delete" right there while holding those > locks, but convert the device into a "i-am-currently-removing-myself" > target, and release all the locks. That should be fast (enough). > > Then this "i-am-currently-removing-myself" target would have its .open() > return some error, so it cannot even be opened anymore (or something > with similar effect), start some kernel thread that does the actual > "wipe" and "unref/unmap" from the tree and all that stuff "in the > background", using much finer granular temporary locking for each > processed region. > > If that then takes 20 minutes, someone may still care, but at least it > does not block IO to the other active devices in the pool. > > Or is something like this already going on? > Hi Please always specify kernel in-use. Eventually retry with last officially released one (e.g. 4.4) There were number of improvements in speed of discard. Also - you may try to use thin-pool with '--discards nopassdown' (or even ignore) in case TRIM is very limiting factor (with impacting free space in thin-pool for 'ignore' one) Zdenek ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: I/O block when removing thin device on the same pool 2016-01-22 13:58 ` Zdenek Kabelac @ 2016-01-22 16:07 ` Mike Snitzer 0 siblings, 0 replies; 19+ messages in thread From: Mike Snitzer @ 2016-01-22 16:07 UTC (permalink / raw) To: dm-devel On Fri, Jan 22 2016 at 8:58am -0500, Zdenek Kabelac <zkabelac@redhat.com> wrote: > Dne 22.1.2016 v 14:38 Lars Ellenberg napsal(a): > >On Thu, Jan 21, 2016 at 02:44:06PM -0500, Mike Snitzer wrote: > >>>>Dne 20.1.2016 v 11:05 Dennis Yang napsal(a): > >>>>> > >>>>>Hi, > >>>>> > >>>>>I had noticed that I/O requests to one thin device will be blocked > >>>>>when the other thin device is being deleting. The root cause of this > >>>>>is that to delete a thin device will eventually call dm_btree_del() > >>>>>which is a slow function and can block. This means that the device > >>>>>deleting process will need to hold the pool lock for a very long time > >>>>>to wait for this function to delete the whole data mapping subtree. > >>>>>Since I/O to the devices on the same pool needs to held the same pool > >>>>>lock to lookup/insert/delete data mapping, all I/O will be blocked > >>>>>until the delete process finish. > >>>>> > >>>>>For now, I have to discard all the mappings of a thin device before > >>>>>deleting it to prevent I/O from being blocked. Since these discard > >>>>>requests not only take lots of time to finish but hurt the pool I/O > >>>>>throughput, I am still looking for other better solutions to fix this > >>>>>issue. > >>>>> > >>>>>I think the main problem is still the big pool lock in dm-thin which > >>>>>hurts both the scalability and performance of. I am wondering if there > >>>>>is any plan on improving this or any better fix for the I/O block > >>>>>problem. > >> > >>Just so I'm aware: which kernel are you using? > >> > >>dm_pool_delete_thin_device() takes pmd->root_lock so yes it is very > >>coarse-grained; especially when you consider concurrent IO to another > >>thin device from the same pool will call interfaces, like > >>dm_thin_find_block(), which also take the same pmd->root_lock. > > > >We have seen lvremove of thin snapshots sometimes minutes, > >even ~20 minutes before. > >So that means blocking IO to other devices in that pool > >(e.g. the typically currently in-use "origin") for minutes. > > > >That was, iirc, with ~10 TB origin, mostly allocated, > >tens of "rotating" snapshots, 64k chunk size, > >and considerable random write change rate on the origin. > > > >I'd like to propose a different approach for lvremove of thin devices > >(using "made up terms" instead of the correct device mapper vocabulary, > >because I'm lazy): > >on lvremove of a thin device, take all the locks you need, > >even if that implies blocking IO to other devices, > >BUT > >then don't do all the "delete" right there while holding those > >locks, but convert the device into a "i-am-currently-removing-myself" > >target, and release all the locks. That should be fast (enough). > > > >Then this "i-am-currently-removing-myself" target would have its .open() > >return some error, so it cannot even be opened anymore (or something > >with similar effect), start some kernel thread that does the actual > >"wipe" and "unref/unmap" from the tree and all that stuff "in the > >background", using much finer granular temporary locking for each > >processed region. > > > >If that then takes 20 minutes, someone may still care, but at least it > >does not block IO to the other active devices in the pool. > > > >Or is something like this already going on? > > Nothing is going on yet but I'll work with Joe on how to skin this cat. > Hi > > Please always specify kernel in-use. > Eventually retry with last officially released one (e.g. 4.4) > There were number of improvements in speed of discard. > > Also - you may try to use thin-pool with '--discards nopassdown' > (or even ignore) in case TRIM is very limiting factor > (with impacting free space in thin-pool for 'ignore' one) discard isn't the same as device delete. I guess you're proposing discarding the thin device with something like blkdiscard before deleting the device (like someone else in this thread has already tired, though not clear they were using the latest thinp discard advances). In any case, these hacks are unfortunate and I'm going to make fixing this coarse-grained locking a priority. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: I/O block when removing thin device on the same pool 2016-01-22 13:38 ` Lars Ellenberg 2016-01-22 13:58 ` Zdenek Kabelac @ 2016-01-22 16:43 ` Joe Thornber 2016-01-25 9:13 ` Dennis Yang 2016-01-29 14:50 ` Lars Ellenberg 1 sibling, 2 replies; 19+ messages in thread From: Joe Thornber @ 2016-01-22 16:43 UTC (permalink / raw) To: device-mapper development On Fri, Jan 22, 2016 at 02:38:28PM +0100, Lars Ellenberg wrote: > We have seen lvremove of thin snapshots sometimes minutes, > even ~20 minutes before. I did some work on speeding up thin removal in autumn '14, in particular agressively prefetching metadata pages sped up the tree traversal hugely. Could you confirm you're seeing pauses of this duration with currently kernels please? Obviously any pause, even a few seconds is unacceptable. Having a background kernel worker thread doing the delete, as you describe, is the way to go. But there are complications to do with transactionality and crash protection that have prevented me implementing it. I'll think on it some more now I know it's such a problem for you. - Joe ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: I/O block when removing thin device on the same pool 2016-01-22 16:43 ` Joe Thornber @ 2016-01-25 9:13 ` Dennis Yang 2016-01-26 16:19 ` Joe Thornber 2016-01-29 14:50 ` Lars Ellenberg 1 sibling, 1 reply; 19+ messages in thread From: Dennis Yang @ 2016-01-25 9:13 UTC (permalink / raw) To: device-mapper development [-- Attachment #1.1: Type: text/plain, Size: 2721 bytes --] Hi, I had done some experiments with kernel 4.2.8 that I am using for production right now and kernel 4.4 with commit 3d5f6733 ("dm thin metadata: speed up discard of partially mapped volumes") for comparison. All the experiments below are performed with a dm-thin pool (512KB block size) which is built with a RAID 0 composed by two Intel 480GB SSDs as metadata device and a zero-target DM device as the data device. The machine is equipped with an Intel E3-1246 v3 CPU and 16GB ram. To discard all the mappings of a fully-mapped 10TB thin device with 512KB block size, kernel 4.4 takes 6m57s kernel 4.2.8 takes 6m49s To delete a fully-mapped 10TB thin device, kernel 4.4 takes 48s kernel 4.2.8 takes 47s In another experiment, I create an empty thin device and a fully-mapped 10TB thin device. Then, I start writing to the empty thin device sequentially with fio before deleting the fully-mapped thin device. It can be observed that the write requests get blocked for couple of seconds (47~48sec) until the deletion process finishes on both kernel 4.2.8 and kernel 4.4. If we discard all the mappings in parallel with fio instead of deleting the fully-mapped thin device, write requests will still be blocked until all discard requests finished. I think this is because that pool's deferred list is full of all those discard requests and thus having no spare computation resource for new write requests to the other thin device. The kworker thread of thinp cause 100% CPU utilisation while processing the discard requests. Hope this information helps. Thanks, Dennis 2016-01-23 0:43 GMT+08:00 Joe Thornber <thornber@redhat.com>: > On Fri, Jan 22, 2016 at 02:38:28PM +0100, Lars Ellenberg wrote: > > We have seen lvremove of thin snapshots sometimes minutes, > > even ~20 minutes before. > > I did some work on speeding up thin removal in autumn '14, in > particular agressively prefetching metadata pages sped up the tree > traversal hugely. Could you confirm you're seeing pauses of this > duration with currently kernels please? > > Obviously any pause, even a few seconds is unacceptable. Having a > background kernel worker thread doing the delete, as you describe, is > the way to go. But there are complications to do with > transactionality and crash protection that have prevented me > implementing it. I'll think on it some more now I know it's such a > problem for you. > > - Joe > > -- > dm-devel mailing list > dm-devel@redhat.com > https://www.redhat.com/mailman/listinfo/dm-devel > -- Dennis Yang QNAP Systems, Inc. Skype: qnap.dennis.yang Email: dennisyang@qnap.com Tel: (+886)-2-2393-5152 ext. 15018 Address: 13F., No.56, Sec. 1, Xinsheng S. Rd., Zhongzheng Dist., Taipei City, Taiwan [-- Attachment #1.2: Type: text/html, Size: 4788 bytes --] [-- Attachment #2: Type: text/plain, Size: 0 bytes --] ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: I/O block when removing thin device on the same pool 2016-01-25 9:13 ` Dennis Yang @ 2016-01-26 16:19 ` Joe Thornber 2016-01-27 4:51 ` Dennis Yang 0 siblings, 1 reply; 19+ messages in thread From: Joe Thornber @ 2016-01-26 16:19 UTC (permalink / raw) To: device-mapper development Hi Dennis, This is indeed useful. Is there any chance you could re-run your test with this patch applied please? https://github.com/jthornber/linux-2.6/commit/64197a3802320c7a7359ff4a3e592e2bc5bb73dc - Joe ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: I/O block when removing thin device on the same pool 2016-01-26 16:19 ` Joe Thornber @ 2016-01-27 4:51 ` Dennis Yang 2016-01-28 10:44 ` Joe Thornber 0 siblings, 1 reply; 19+ messages in thread From: Dennis Yang @ 2016-01-27 4:51 UTC (permalink / raw) To: device-mapper development [-- Attachment #1.1: Type: text/plain, Size: 1989 bytes --] Hi Joe, I have applied this patch to kernel 4.4 and get the following result. To delete a fully-mapped 10TB thin devices, with this patch takes 48 sec. without this patch takes 48 sec. To read an empty thin device while deleting a fully-mapped 10TB thin devices, with this patch I/O throughput drops from 4.6TB/s to 4.3TB/s without this patch, I/O blocks. To write an empty thin device while deleting a fully-mapped 10TB thin devices, with this patch I/O throughput drops from 3.2TB/s to below 4MB/s without this patch, I/O blocks Since it looks like the write performance still suffer from the lock contention, I make it to sleep 100 msec between lock release and reacquire in commit_decs(). To write an empty thin device while deleting a fully-mapped 10TB thin devices With sleep in commit_decs(), I/O throughput drops from 3.2TB/s to 2.2TB/s, but the deletion time grows from 48sec to 2m54sec. The one thing I am curious about is what data structures are dm-thin tries to protect by holding the pool lock during all those btree operations. At first, I think the lock is held to protect the btree itself. But based on the comments in the source code, I believe that it has already been protected by the read/writes lock in transaction manager (dm_tm_read/write_lock). Does this mean that the pool lock is held only to protect the reference count bitmap/btree? Thanks, Dennis 2016-01-27 0:19 GMT+08:00 Joe Thornber <thornber@redhat.com>: > Hi Dennis, > > This is indeed useful. Is there any chance you could re-run your test > with this patch applied please? > > > https://github.com/jthornber/linux-2.6/commit/64197a3802320c7a7359ff4a3e592e2bc5bb73dc > > - Joe > > -- > dm-devel mailing list > dm-devel@redhat.com > https://www.redhat.com/mailman/listinfo/dm-devel > -- Dennis Yang QNAP Systems, Inc. Skype: qnap.dennis.yang Email: dennisyang@qnap.com Tel: (+886)-2-2393-5152 ext. 15018 Address: 13F., No.56, Sec. 1, Xinsheng S. Rd., Zhongzheng Dist., Taipei City, Taiwan [-- Attachment #1.2: Type: text/html, Size: 3925 bytes --] [-- Attachment #2: Type: text/plain, Size: 0 bytes --] ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: I/O block when removing thin device on the same pool 2016-01-27 4:51 ` Dennis Yang @ 2016-01-28 10:44 ` Joe Thornber 2016-01-29 11:01 ` Dennis Yang 0 siblings, 1 reply; 19+ messages in thread From: Joe Thornber @ 2016-01-28 10:44 UTC (permalink / raw) To: device-mapper development On Wed, Jan 27, 2016 at 12:51:09PM +0800, Dennis Yang wrote: > Hi Joe, > > I have applied this patch to kernel 4.4 and get the following result. Thanks for taking the time to do this. > To delete a fully-mapped 10TB thin devices, > with this patch takes 48 sec. > without this patch takes 48 sec. > > To read an empty thin device while deleting a fully-mapped 10TB thin > devices, > with this patch I/O throughput drops from 4.6TB/s to 4.3TB/s > without this patch, I/O blocks. > > To write an empty thin device while deleting a fully-mapped 10TB thin > devices, > with this patch I/O throughput drops from 3.2TB/s to below 4MB/s > without this patch, I/O blocks > > Since it looks like the write performance still suffer from the lock > contention, I make it to sleep 100 msec between lock release and reacquire > in commit_decs(). Well it's really provisioning or breaking of sharing that's slow, not writes in general. Rather than adding a sleep it would be worth playing with the MAX_DECS #define. eg, try 16 and 8192 and see how that effects the throughput. > The one thing I am curious about is what data structures are dm-thin tries > to protect by holding the pool lock during all those btree operations. At > first, I think the lock is held to protect the btree itself. But based on > the comments in the source code, I believe that it has already been > protected by the read/writes lock in transaction manager > (dm_tm_read/write_lock). Does this mean that the pool lock is held only to > protect the reference count bitmap/btree? You're correct. I wrote all of persistent data using a rolling lock scheme that allows updates to the btrees to occur in parallel. But I didn't extend this to cover the superblock or in core version of the superblock. So the top level rw semaphore is protecting this in core sb. I've spent a couple of days this week experimenting with switching over to using the rolling lock scheme properly. The changes are extensive: - Strip out the root_lock rw_sem. - Introduce a transaction_lock. Every metadata op needs to hold a rw sem in 'read' mode (even if it's doing an update), except commit which would hold in write mode, this forces all the threads to synchronise whenever we commit. - Change the block manager to allow us to 'lock' abstract things, like the in core sb. This is really just introducing another namespace to the bm. Pretty easy. - Change the interfaces to persistent data structures like btree, array etc. to take one of these locks representing the superblock. Audit to make sure this lock is released early to support the rolling lock scheme. - Add locking/protection to the space maps (which have some in core data structs). Given this long list of changes I don't think it's worth the risk. So I'd rather use the patch I posted earlier. - Joe ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: I/O block when removing thin device on the same pool 2016-01-28 10:44 ` Joe Thornber @ 2016-01-29 11:01 ` Dennis Yang 2016-01-29 16:05 ` Joe Thornber 0 siblings, 1 reply; 19+ messages in thread From: Dennis Yang @ 2016-01-29 11:01 UTC (permalink / raw) To: device-mapper development [-- Attachment #1.1: Type: text/plain, Size: 4051 bytes --] Hi, 2016-01-28 18:44 GMT+08:00 Joe Thornber <thornber@redhat.com>: > On Wed, Jan 27, 2016 at 12:51:09PM +0800, Dennis Yang wrote: > > Hi Joe, > > > > I have applied this patch to kernel 4.4 and get the following result. > > Thanks for taking the time to do this. > > > To delete a fully-mapped 10TB thin devices, > > with this patch takes 48 sec. > > without this patch takes 48 sec. > > > > To read an empty thin device while deleting a fully-mapped 10TB thin > > devices, > > with this patch I/O throughput drops from 4.6TB/s to 4.3TB/s > > without this patch, I/O blocks. > > > > To write an empty thin device while deleting a fully-mapped 10TB thin > > devices, > > with this patch I/O throughput drops from 3.2TB/s to below 4MB/s > > without this patch, I/O blocks > > > > Since it looks like the write performance still suffer from the lock > > contention, I make it to sleep 100 msec between lock release and > reacquire > > in commit_decs(). > > Well it's really provisioning or breaking of sharing that's slow, not > writes in general. Rather than adding a sleep it would be worth > playing with the MAX_DECS #define. eg, try 16 and 8192 and see how > that effects the throughput. > I had tried to define MAX_DECS as 1, 16, and 8192, and here is the throughput I got. When #define MAX_DECS 1, throughput drops from 3.2GB/s to around 800 ~ 950 MB/s. When #define MAX_DECS 16, throughput drops from 3.2GB/s to around 150 ~ 400 MB/s When #define MAX_DECS 8192, the I/O blocks until deletion is done. These throughput is gathered by writing to a newly created thin device which means lots of provisioning take place. So it seems that the more fine grained lock we use here results in the higher throughput. Is there any concern if I set MAX_DECS to 1 for production? Thanks for your help again. Dennis > > The one thing I am curious about is what data structures are dm-thin > tries > > to protect by holding the pool lock during all those btree operations. At > > first, I think the lock is held to protect the btree itself. But based on > > the comments in the source code, I believe that it has already been > > protected by the read/writes lock in transaction manager > > (dm_tm_read/write_lock). Does this mean that the pool lock is held only > to > > protect the reference count bitmap/btree? > > You're correct. I wrote all of persistent data using a rolling lock > scheme that allows updates to the btrees to occur in parallel. But I > didn't extend this to cover the superblock or in core version of the > superblock. So the top level rw semaphore is protecting this in core > sb. I've spent a couple of days this week experimenting with > switching over to using the rolling lock scheme properly. The changes > are extensive: > > - Strip out the root_lock rw_sem. > > - Introduce a transaction_lock. > > Every metadata op needs to hold a rw sem in 'read' mode (even if > it's doing an update), except commit which would hold in write > mode, this forces all the threads to synchronise whenever we > commit. > > - Change the block manager to allow us to 'lock' abstract things, > like the in core sb. This is really just introducing another > namespace to the bm. Pretty easy. > > - Change the interfaces to persistent data structures like btree, > array etc. to take one of these locks representing the > superblock. Audit to make sure this lock is released early to > support the rolling lock scheme. > > - Add locking/protection to the space maps (which have some in > core data structs). > > Given this long list of changes I don't think it's worth the risk. So > I'd rather use the patch I posted earlier. > > - Joe > > -- > dm-devel mailing list > dm-devel@redhat.com > https://www.redhat.com/mailman/listinfo/dm-devel > -- Dennis Yang QNAP Systems, Inc. Skype: qnap.dennis.yang Email: dennisyang@qnap.com Tel: (+886)-2-2393-5152 ext. 15018 Address: 13F., No.56, Sec. 1, Xinsheng S. Rd., Zhongzheng Dist., Taipei City, Taiwan [-- Attachment #1.2: Type: text/html, Size: 6389 bytes --] [-- Attachment #2: Type: text/plain, Size: 0 bytes --] ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: I/O block when removing thin device on the same pool 2016-01-29 11:01 ` Dennis Yang @ 2016-01-29 16:05 ` Joe Thornber 2016-02-01 3:52 ` Dennis Yang 0 siblings, 1 reply; 19+ messages in thread From: Joe Thornber @ 2016-01-29 16:05 UTC (permalink / raw) To: Dennis Yang; +Cc: device-mapper development On Fri, Jan 29, 2016 at 07:01:44PM +0800, Dennis Yang wrote: > I had tried to define MAX_DECS as 1, 16, and 8192, and here is the > throughput I got. > When #define MAX_DECS 1, throughput drops from 3.2GB/s to around 800 ~ 950 > MB/s. > When #define MAX_DECS 16, throughput drops from 3.2GB/s to around 150 ~ 400 > MB/s > When #define MAX_DECS 8192, the I/O blocks until deletion is done. > > These throughput is gathered by writing to a newly created thin device > which means lots of provisioning take place. So it seems that the more fine > grained lock we use here results in the higher throughput. Is there any > concern if I set MAX_DECS to 1 for production? Does the time taken to remove the thin device change as you drop it to one? - Joe ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: I/O block when removing thin device on the same pool 2016-01-29 16:05 ` Joe Thornber @ 2016-02-01 3:52 ` Dennis Yang 0 siblings, 0 replies; 19+ messages in thread From: Dennis Yang @ 2016-02-01 3:52 UTC (permalink / raw) To: device-mapper development [-- Attachment #1.1: Type: text/plain, Size: 1676 bytes --] Hi, 2016-01-30 0:05 GMT+08:00 Joe Thornber <thornber@redhat.com>: > On Fri, Jan 29, 2016 at 07:01:44PM +0800, Dennis Yang wrote: > > I had tried to define MAX_DECS as 1, 16, and 8192, and here is the > > throughput I got. > > When #define MAX_DECS 1, throughput drops from 3.2GB/s to around 800 ~ > 950 > > MB/s. > > When #define MAX_DECS 16, throughput drops from 3.2GB/s to around 150 ~ > 400 > > MB/s > > When #define MAX_DECS 8192, the I/O blocks until deletion is done. > > > > These throughput is gathered by writing to a newly created thin device > > which means lots of provisioning take place. So it seems that the more > fine > > grained lock we use here results in the higher throughput. Is there any > > concern if I set MAX_DECS to 1 for production? > > Does the time taken to remove the thin device change as you drop it to one? > > - Joe > Not that I am aware of, but I redo the experiment and the results are listed below. #define MAX_DECS 1 Delete a fully-mapped 10TB device without concurrent I/O takes 49 secs. Delete a fully-mapped 10TB device with concurrent I/O to pool takes 44 secs. #define MAX_DECS 16 Delete a fully-mapped 10TB device without concurrent I/O takes 47 secs. Delete a fully-mapped 10TB device with concurrent I/O to pool takes 46 secs. #define MAX_DECS 8192 Delete a fully-mapped 10TB device without concurrent I/O takes 47 secs. Delete a fully-mapped 10TB device with concurrent I/O to pool takes 50 secs. Thanks, Dennis -- Dennis Yang QNAP Systems, Inc. Skype: qnap.dennis.yang Email: dennisyang@qnap.com Tel: (+886)-2-2393-5152 ext. 15018 Address: 13F., No.56, Sec. 1, Xinsheng S. Rd., Zhongzheng Dist., Taipei City, Taiwan [-- Attachment #1.2: Type: text/html, Size: 3723 bytes --] [-- Attachment #2: Type: text/plain, Size: 0 bytes --] ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: I/O block when removing thin device on the same pool 2016-01-22 16:43 ` Joe Thornber 2016-01-25 9:13 ` Dennis Yang @ 2016-01-29 14:50 ` Lars Ellenberg 2016-01-29 16:04 ` Joe Thornber 1 sibling, 1 reply; 19+ messages in thread From: Lars Ellenberg @ 2016-01-29 14:50 UTC (permalink / raw) To: device-mapper development On Fri, Jan 22, 2016 at 04:43:46PM +0000, Joe Thornber wrote: > On Fri, Jan 22, 2016 at 02:38:28PM +0100, Lars Ellenberg wrote: > > We have seen lvremove of thin snapshots sometimes minutes, > > even ~20 minutes before. > > I did some work on speeding up thin removal in autumn '14, in > particular agressively prefetching metadata pages sped up the tree > traversal hugely. Could you confirm you're seeing pauses of this > duration with currently kernels please? There is https://bugzilla.redhat.com/show_bug.cgi?id=990583 Bug 990583 - lvremove of thin snapshots takes 5 to 20 minutes (single core cpu bound?) From August 2013, closed by you in October 2015, as "not a bug", also pointing to meta data prefetch. Now, you tell me, how prefetching meta data (doing disk IO more efficiently) helps with something that is clearly CPU bound (eating 100% single core CPU traversing whatever)... Reason I mention this bug again here is: there should be a lvm thin meta data dump in there, which you could use for benchmarking improvements yourself. > Obviously any pause, even a few seconds is unacceptable. Having a > background kernel worker thread doing the delete, as you describe, is > the way to go. But there are complications to do with > transactionality and crash protection that have prevented me > implementing it. I'll think on it some more now I know it's such a > problem for you. > > - Joe Thanks, Lars ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: I/O block when removing thin device on the same pool 2016-01-29 14:50 ` Lars Ellenberg @ 2016-01-29 16:04 ` Joe Thornber 2016-02-01 17:40 ` Lars Ellenberg 0 siblings, 1 reply; 19+ messages in thread From: Joe Thornber @ 2016-01-29 16:04 UTC (permalink / raw) To: Lars Ellenberg; +Cc: device-mapper development On Fri, Jan 29, 2016 at 03:50:31PM +0100, Lars Ellenberg wrote: > On Fri, Jan 22, 2016 at 04:43:46PM +0000, Joe Thornber wrote: > > On Fri, Jan 22, 2016 at 02:38:28PM +0100, Lars Ellenberg wrote: > > > We have seen lvremove of thin snapshots sometimes minutes, > > > even ~20 minutes before. > > > > I did some work on speeding up thin removal in autumn '14, in > > particular agressively prefetching metadata pages sped up the tree > > traversal hugely. Could you confirm you're seeing pauses of this > > duration with currently kernels please? > > There is > https://bugzilla.redhat.com/show_bug.cgi?id=990583 > Bug 990583 - lvremove of thin snapshots takes 5 to 20 minutes (single > core cpu bound?) > > >From August 2013, closed by you in October 2015, > as "not a bug", also pointing to meta data prefetch. > > Now, you tell me, how prefetching meta data (doing disk IO > more efficiently) helps with something that is clearly CPU bound > (eating 100% single core CPU traversing whatever)... > > Reason I mention this bug again here is: > there should be a lvm thin meta data dump in there, > which you could use for benchmarking improvements yourself. There is no metadata dump attached to that bug. I do benchmark stuff myself, and I found prefetching to make a big difference (obviously I'm not cpu bound like you). We all have different hardware, which is why I ask people with more real world scenarios to test stuff separately. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: I/O block when removing thin device on the same pool 2016-01-29 16:04 ` Joe Thornber @ 2016-02-01 17:40 ` Lars Ellenberg 0 siblings, 0 replies; 19+ messages in thread From: Lars Ellenberg @ 2016-02-01 17:40 UTC (permalink / raw) To: device-mapper development On Fri, Jan 29, 2016 at 04:04:21PM +0000, Joe Thornber wrote: > On Fri, Jan 29, 2016 at 03:50:31PM +0100, Lars Ellenberg wrote: > > On Fri, Jan 22, 2016 at 04:43:46PM +0000, Joe Thornber wrote: > > > On Fri, Jan 22, 2016 at 02:38:28PM +0100, Lars Ellenberg wrote: > > > > We have seen lvremove of thin snapshots sometimes minutes, > > > > even ~20 minutes before. > > > > > > I did some work on speeding up thin removal in autumn '14, in > > > particular agressively prefetching metadata pages sped up the tree > > > traversal hugely. Could you confirm you're seeing pauses of this > > > duration with currently kernels please? > > > > There is > > https://bugzilla.redhat.com/show_bug.cgi?id=990583 > > Bug 990583 - lvremove of thin snapshots takes 5 to 20 minutes (single > > core cpu bound?) > > > > >From August 2013, closed by you in October 2015, > > as "not a bug", also pointing to meta data prefetch. > > > > Now, you tell me, how prefetching meta data (doing disk IO > > more efficiently) helps with something that is clearly CPU bound > > (eating 100% single core CPU traversing whatever)... > > > > Reason I mention this bug again here is: > > there should be a lvm thin meta data dump in there, > > which you could use for benchmarking improvements yourself. > > There is no metadata dump attached to that bug. Hm. Then we must have communicated via side channels (irc, uploads) back then, I'm pretty sure we uploaded it somewhere. > I do benchmark stuff > myself, and I found prefetching to make a big difference (obviously > I'm not cpu bound like you). We all have different hardware, which is > why I ask people with more real world scenarios to test stuff separately. Thank you for suggestions (and re-opening that bug). I'll have someone follow up in the bugzilla, as soon as we have something to report. I just checked, we still have some similar setups running regularly, and according to log files of the one system I looked at, apparently snapshot removals are typically between 2.5 and 4 minutes now, when they used to be 15 to 20 minutes most of the time. But on first glance, I can see no correlation between kernel upgrades or other changes and reduced removal times, so this may simply be a change of access pattern on the origin :-/ I still can get you meta dumps, if you are interested, or maybe have the guys try some things. I would be communicating via a number of hops, so no "direct access lab setup" right now. Let me know what data you would be interested in most, and I'll try to "make it happen", and relay it to you. Cheers, Lars ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2016-02-01 17:40 UTC | newest] Thread overview: 19+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-01-20 10:05 I/O block when removing thin device on the same pool Dennis Yang 2016-01-20 11:27 ` Zdenek Kabelac 2016-01-20 16:17 ` Dennis Yang 2016-01-21 17:33 ` Nikolay Borisov 2016-01-21 19:44 ` Mike Snitzer 2016-01-22 13:38 ` Lars Ellenberg 2016-01-22 13:58 ` Zdenek Kabelac 2016-01-22 16:07 ` Mike Snitzer 2016-01-22 16:43 ` Joe Thornber 2016-01-25 9:13 ` Dennis Yang 2016-01-26 16:19 ` Joe Thornber 2016-01-27 4:51 ` Dennis Yang 2016-01-28 10:44 ` Joe Thornber 2016-01-29 11:01 ` Dennis Yang 2016-01-29 16:05 ` Joe Thornber 2016-02-01 3:52 ` Dennis Yang 2016-01-29 14:50 ` Lars Ellenberg 2016-01-29 16:04 ` Joe Thornber 2016-02-01 17:40 ` Lars Ellenberg
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.