* safe to defrag XFS on live system? @ 2012-09-14 15:51 Travis Rhoden 2012-09-14 17:06 ` Tommi Virtanen 0 siblings, 1 reply; 10+ messages in thread From: Travis Rhoden @ 2012-09-14 15:51 UTC (permalink / raw) To: ceph-devel Hello folks, On a running Ceph cluster using XFS for the OSD's, is it safe to defrag the OSD devices while the system is live? I did a quick check of one device: xfs_db -c frag -r /dev/sdd actual 637596, ideal 144935, fragmentation factor 77.27% I've only been running on these particular machines for a couple of weeks. I am thinking of putting in a cron task that defrags disks every week or as needed (on a rolling schedule). While I'm talking about XFS... I know that RBD's use a default object size of 4MB. I've stuck with that so far.. Would it be beneficial to mount XFS with -o allocsize=4M ? What is the object size that gets used for non-RBD cases -- i.e. just dumping objects into data pool? Thanks, - Travis ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: safe to defrag XFS on live system? 2012-09-14 15:51 safe to defrag XFS on live system? Travis Rhoden @ 2012-09-14 17:06 ` Tommi Virtanen 2012-09-14 17:15 ` Josh Durgin 2012-09-14 17:15 ` Nick Couchman 0 siblings, 2 replies; 10+ messages in thread From: Tommi Virtanen @ 2012-09-14 17:06 UTC (permalink / raw) To: Travis Rhoden; +Cc: ceph-devel On Fri, Sep 14, 2012 at 8:51 AM, Travis Rhoden <trhoden@gmail.com> wrote: > On a running Ceph cluster using XFS for the OSD's, is it safe to > defrag the OSD devices while the system is live? > > I did a quick check of one device: > > xfs_db -c frag -r /dev/sdd > actual 637596, ideal 144935, fragmentation factor 77.27% If it's safe to defrag xfs while it's mounted in general, it's safe to do it when an OSD is running. Xfs either keeps its promises as a filesystem, or doesn't. How that affects performance is another question.. > While I'm talking about XFS... I know that RBD's use a default object > size of 4MB. I've stuck with that so far.. Would it be beneficial to > mount XFS with -o allocsize=4M ? What is the object size that gets > used for non-RBD cases -- i.e. just dumping objects into data pool? Don't know about -o allocsize -- benchmark it! Objects are the size they are; Ceph does not dictate any size. RBD and CephFS both stripe a thing (image/file) over multiple objects, at a constant size; you already know that, RBD defaults to 4MB. Other users of RADOS create objects of any size they please, and an OSD stores those as files in the underlying filesystem. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: safe to defrag XFS on live system? 2012-09-14 17:06 ` Tommi Virtanen @ 2012-09-14 17:15 ` Josh Durgin 2012-09-14 17:15 ` Nick Couchman 1 sibling, 0 replies; 10+ messages in thread From: Josh Durgin @ 2012-09-14 17:15 UTC (permalink / raw) To: Tommi Virtanen; +Cc: Travis Rhoden, ceph-devel On 09/14/2012 10:06 AM, Tommi Virtanen wrote: > On Fri, Sep 14, 2012 at 8:51 AM, Travis Rhoden <trhoden@gmail.com> wrote: >> On a running Ceph cluster using XFS for the OSD's, is it safe to >> defrag the OSD devices while the system is live? >> >> I did a quick check of one device: >> >> xfs_db -c frag -r /dev/sdd >> actual 637596, ideal 144935, fragmentation factor 77.27% > > If it's safe to defrag xfs while it's mounted in general, it's safe to > do it when an OSD is running. Xfs either keeps its promises as a > filesystem, or doesn't. > > How that affects performance is another question.. > >> While I'm talking about XFS... I know that RBD's use a default object >> size of 4MB. I've stuck with that so far.. Would it be beneficial to >> mount XFS with -o allocsize=4M ? What is the object size that gets >> used for non-RBD cases -- i.e. just dumping objects into data pool? > > Don't know about -o allocsize -- benchmark it! > > Objects are the size they are; Ceph does not dictate any size. RBD and > CephFS both stripe a thing (image/file) over multiple objects, at a > constant size; you already know that, RBD defaults to 4MB. Other users > of RADOS create objects of any size they please, and an OSD stores > those as files in the underlying filesystem. Also keep in mind that objects can be sparse - the 4MB stripe size doesn't mean the full 4MB are used by an object in CephFS or RBD. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: safe to defrag XFS on live system? 2012-09-14 17:06 ` Tommi Virtanen 2012-09-14 17:15 ` Josh Durgin @ 2012-09-14 17:15 ` Nick Couchman 2012-09-14 17:22 ` Travis Rhoden 2012-09-14 18:49 ` Mark Nelson 1 sibling, 2 replies; 10+ messages in thread From: Nick Couchman @ 2012-09-14 17:15 UTC (permalink / raw) To: Travis Rhoden, Tommi Virtanen; +Cc: ceph-devel > >> While I'm talking about XFS... I know that RBD's use a default object >> size of 4MB. I've stuck with that so far.. Would it be beneficial to >> mount XFS with -o allocsize=4M ? What is the object size that gets >> used for non-RBD cases -- i.e. just dumping objects into data pool? > > Don't know about -o allocsize -- benchmark it! ...and let us know what you come up with! I'm also using XFS for the underlying filesystem on which CEPH runs (and using RBD), and would be really interested to know if changing the alloc size improves performance! -Nick -------- This e-mail may contain confidential and privileged material for the sole use of the intended recipient. If this email is not intended for you, or you are not responsible for the delivery of this message to the intended recipient, please note that this message may contain SEAKR Engineering (SEAKR) Privileged/Proprietary Information. In such a case, you are strictly prohibited from downloading, photocopying, distributing or otherwise using this message, its contents or attachments in any way. If you have received this message in error, please notify us immediately by replying to this e-mail and delete the message from your mailbox. Information contained in this message that does not relate to the business of SEAKR is neither endorsed by nor attributable to SEAKR. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: safe to defrag XFS on live system? 2012-09-14 17:15 ` Nick Couchman @ 2012-09-14 17:22 ` Travis Rhoden 2012-09-14 18:49 ` Mark Nelson 1 sibling, 0 replies; 10+ messages in thread From: Travis Rhoden @ 2012-09-14 17:22 UTC (permalink / raw) To: Nick Couchman; +Cc: Tommi Virtanen, ceph-devel > If it's safe to defrag xfs while it's mounted in general, it's safe to > do it when an OSD is running. Xfs either keeps its promises as a > filesystem, or doesn't. That was my expectation. Thanks for the feedback. Just wanted to confirm. Also, I will report back on -o allocsize. Probably some time next week. Got a few other things to take care of first. Thanks again! - Travis On Fri, Sep 14, 2012 at 1:15 PM, Nick Couchman <Nick.Couchman@seakr.com> wrote: >> >>> While I'm talking about XFS... I know that RBD's use a default object >>> size of 4MB. I've stuck with that so far.. Would it be beneficial to >>> mount XFS with -o allocsize=4M ? What is the object size that gets >>> used for non-RBD cases -- i.e. just dumping objects into data pool? >> >> Don't know about -o allocsize -- benchmark it! > > ...and let us know what you come up with! I'm also using XFS for the underlying filesystem on which CEPH runs (and using RBD), and would be really interested to know if changing the alloc size improves performance! > > -Nick > > > > > -------- > > This e-mail may contain confidential and privileged material for the sole use of the intended recipient. If this email is not intended for you, or you are not responsible for the delivery of this message to the intended recipient, please note that this message may contain SEAKR Engineering (SEAKR) Privileged/Proprietary Information. In such a case, you are strictly prohibited from downloading, photocopying, distributing or otherwise using this message, its contents or attachments in any way. If you have received this message in error, please notify us immediately by replying to this e-mail and delete the message from your mailbox. Information contained in this message that does not relate to the business of SEAKR is neither endorsed by nor attributable to SEAKR. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: safe to defrag XFS on live system? 2012-09-14 17:15 ` Nick Couchman 2012-09-14 17:22 ` Travis Rhoden @ 2012-09-14 18:49 ` Mark Nelson 2012-09-14 18:56 ` Nick Couchman 2012-09-14 18:56 ` Travis Rhoden 1 sibling, 2 replies; 10+ messages in thread From: Mark Nelson @ 2012-09-14 18:49 UTC (permalink / raw) To: Nick Couchman; +Cc: Travis Rhoden, Tommi Virtanen, ceph-devel On 09/14/2012 12:15 PM, Nick Couchman wrote: >> >>> While I'm talking about XFS... I know that RBD's use a default object >>> size of 4MB. I've stuck with that so far.. Would it be beneficial to >>> mount XFS with -o allocsize=4M ? What is the object size that gets >>> used for non-RBD cases -- i.e. just dumping objects into data pool? >> >> Don't know about -o allocsize -- benchmark it! > > ...and let us know what you come up with! I'm also using XFS for the underlying filesystem on which CEPH runs (and using RBD), and would be really interested to know if changing the alloc size improves performance! > > -Nick Hi Guys, There was a change 2.6.38 to the way that speculative preallocation works that basically lets small writes behave like allocsize is not set, and large writes behave like a large one is set: http://permalink.gmane.org/gmane.comp.file-systems.xfs.general/38403 Having said that, I had my test gear all ready to go so I decided to give it a try: Setup: - 1 node - 6 OSDs with 7200rpm data disks. - Journals on 2 Intel 520 SSDs (3 per SSD) - LSI SAS2008 Controller (9211-8i) - Network: Localhost - Ceph 0.50 - Ubuntu 12.04 - Kernel 3.4 - XFS mkfs options: -f -i size=2048 - Common XFS mount options: -o noatime - No replication - 8 concurrent rados bench instances. - 32 concurrent 4MB ops per instance (256 concurrent ops total) Without allocsize=4M: 781.454MB/s With allocsize=4M: 453.335MB/s I'm guessing that it's perhaps slower as we've told XFS to optimize for large files, but the metadata in /meta is very small, and we were already getting benefits from the new speculative preallocation patches that were introduced in 2.6.38 to combat fragmentation of the 4MB objects. Mark > > > > > -------- > This e-mail may contain confidential and privileged material for the sole use of the intended recipient. If this email is not intended for you, or you are not responsible for the delivery of this message to the intended recipient, please note that this message may contain SEAKR Engineering (SEAKR) Privileged/Proprietary Information. In such a case, you are strictly prohibited from downloading, photocopying, distributing or otherwise using this message, its contents or attachments in any way. If you have received this message in error, please notify us immediately by replying to this e-mail and delete the message from your mailbox. Information contained in this message that does not relate to the business of SEAKR is neither endorsed by nor attributable to SEAKR. > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: safe to defrag XFS on live system? 2012-09-14 18:49 ` Mark Nelson @ 2012-09-14 18:56 ` Nick Couchman 2012-09-14 19:50 ` Mark Nelson 2012-09-14 18:56 ` Travis Rhoden 1 sibling, 1 reply; 10+ messages in thread From: Nick Couchman @ 2012-09-14 18:56 UTC (permalink / raw) To: Mark Nelson; +Cc: Travis Rhoden, Tommi Virtanen, ceph-devel > > Hi Guys, > > There was a change 2.6.38 to the way that speculative preallocation > works that basically lets small writes behave like allocsize is not set, > and large writes behave like a large one is set: > > http://permalink.gmane.org/gmane.comp.file-systems.xfs.general/38403 > > Having said that, I had my test gear all ready to go so I decided to > give it a try: > > Setup: > > - 1 node > - 6 OSDs with 7200rpm data disks. > - Journals on 2 Intel 520 SSDs (3 per SSD) > - LSI SAS2008 Controller (9211-8i) > - Network: Localhost > - Ceph 0.50 > - Ubuntu 12.04 > - Kernel 3.4 > - XFS mkfs options: -f -i size=2048 > - Common XFS mount options: -o noatime > - No replication > - 8 concurrent rados bench instances. > - 32 concurrent 4MB ops per instance (256 concurrent ops total) > > Without allocsize=4M: > > 781.454MB/s > > With allocsize=4M: > > 453.335MB/s > > I'm guessing that it's perhaps slower as we've told XFS to optimize for > large files, but the metadata in /meta is very small, and we were > already getting benefits from the new speculative preallocation patches > that were introduced in 2.6.38 to combat fragmentation of the 4MB objects. > > Mark Interesting, thanks for the results, Mark. So, I guess don't tune unless you have a very good reason to do so? Or, if you're really going to try to squeeze all the performance possible, put your metadata on a separate FS with a different alloc size (or no alloc size specified) so that metadata access isn't adversely impacted by trying to tune data access? -Nick -------- This e-mail may contain confidential and privileged material for the sole use of the intended recipient. If this email is not intended for you, or you are not responsible for the delivery of this message to the intended recipient, please note that this message may contain SEAKR Engineering (SEAKR) Privileged/Proprietary Information. In such a case, you are strictly prohibited from downloading, photocopying, distributing or otherwise using this message, its contents or attachments in any way. If you have received this message in error, please notify us immediately by replying to this e-mail and delete the message from your mailbox. Information contained in this message that does not relate to the business of SEAKR is neither endorsed by nor attributable to SEAKR. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: safe to defrag XFS on live system? 2012-09-14 18:56 ` Nick Couchman @ 2012-09-14 19:50 ` Mark Nelson 2012-09-14 20:01 ` Nick Couchman 0 siblings, 1 reply; 10+ messages in thread From: Mark Nelson @ 2012-09-14 19:50 UTC (permalink / raw) To: Nick Couchman; +Cc: Travis Rhoden, Tommi Virtanen, ceph-devel On 09/14/2012 01:56 PM, Nick Couchman wrote: >> >> Hi Guys, >> >> There was a change 2.6.38 to the way that speculative preallocation >> works that basically lets small writes behave like allocsize is not set, >> and large writes behave like a large one is set: >> >> http://permalink.gmane.org/gmane.comp.file-systems.xfs.general/38403 >> >> Having said that, I had my test gear all ready to go so I decided to >> give it a try: >> >> Setup: >> >> - 1 node >> - 6 OSDs with 7200rpm data disks. >> - Journals on 2 Intel 520 SSDs (3 per SSD) >> - LSI SAS2008 Controller (9211-8i) >> - Network: Localhost >> - Ceph 0.50 >> - Ubuntu 12.04 >> - Kernel 3.4 >> - XFS mkfs options: -f -i size=2048 >> - Common XFS mount options: -o noatime >> - No replication >> - 8 concurrent rados bench instances. >> - 32 concurrent 4MB ops per instance (256 concurrent ops total) >> >> Without allocsize=4M: >> >> 781.454MB/s >> >> With allocsize=4M: >> >> 453.335MB/s >> >> I'm guessing that it's perhaps slower as we've told XFS to optimize for >> large files, but the metadata in /meta is very small, and we were >> already getting benefits from the new speculative preallocation patches >> that were introduced in 2.6.38 to combat fragmentation of the 4MB objects. >> >> Mark > > Interesting, thanks for the results, Mark. So, I guess don't tune unless you have a very good reason to do so? Or, if you're really going to try to squeeze all the performance possible, put your metadata on a separate FS with a different alloc size (or no alloc size specified) so that metadata access isn't adversely impacted by trying to tune data access? > > -Nick Well, the XFS guys certainly suggest default tuning in most cases... :) http://xfs.org/index.php/XFS_FAQ#Q:_I_want_to_tune_my_XFS_filesystems_for_.3Csomething.3E I think there is value in investigating things when you suspect a problem though! We've tried putting the meta directory on alternate partitions (note: this isn't a good idea with btrfs). It hasn't really done much in some of the tests we've done, but we weren't looking at testing this specific scenario. I think the bigger question is, what problem are you trying to solve? Are you noticing lots of fragmentation? Slow performance with 4MB writes? slow performance with small IO? > > > > -------- > This e-mail may contain confidential and privileged material for the sole use of the intended recipient. If this email is not intended for you, or you are not responsible for the delivery of this message to the intended recipient, please note that this message may contain SEAKR Engineering (SEAKR) Privileged/Proprietary Information. In such a case, you are strictly prohibited from downloading, photocopying, distributing or otherwise using this message, its contents or attachments in any way. If you have received this message in error, please notify us immediately by replying to this e-mail and delete the message from your mailbox. Information contained in this message that does not relate to the business of SEAKR is neither endorsed by nor attributable to SEAKR. Mark ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: safe to defrag XFS on live system? 2012-09-14 19:50 ` Mark Nelson @ 2012-09-14 20:01 ` Nick Couchman 0 siblings, 0 replies; 10+ messages in thread From: Nick Couchman @ 2012-09-14 20:01 UTC (permalink / raw) To: Mark Nelson; +Cc: Travis Rhoden, Tommi Virtanen, ceph-devel >> >> Interesting, thanks for the results, Mark. So, I guess don't tune unless > you have a very good reason to do so? Or, if you're really going to try to > squeeze all the performance possible, put your metadata on a separate FS with > a different alloc size (or no alloc size specified) so that metadata access > isn't adversely impacted by trying to tune data access? >> >> -Nick > > Well, the XFS guys certainly suggest default tuning in most cases... :) > > http://xfs.org/index.php/XFS_FAQ#Q:_I_want_to_tune_my_XFS_filesystems_for_.3 > Csomething.3E > > I think there is value in investigating things when you suspect a > problem though! > > We've tried putting the meta directory on alternate partitions (note: > this isn't a good idea with btrfs). It hasn't really done much in some > of the tests we've done, but we weren't looking at testing this specific > scenario. > > I think the bigger question is, what problem are you trying to solve? > Are you noticing lots of fragmentation? Slow performance with 4MB > writes? slow performance with small IO? > Agreed...and, since there's not really a problem I'm addressing at this point, sticking with the defaults is my best bet. I was merely curious as to how to get the maximum performance. If I see problems, maybe I'll dig into it a big, but, at this point, there's no reason to mess with the defaults, at least in my scenarios. -Nick -------- This e-mail may contain confidential and privileged material for the sole use of the intended recipient. If this email is not intended for you, or you are not responsible for the delivery of this message to the intended recipient, please note that this message may contain SEAKR Engineering (SEAKR) Privileged/Proprietary Information. In such a case, you are strictly prohibited from downloading, photocopying, distributing or otherwise using this message, its contents or attachments in any way. If you have received this message in error, please notify us immediately by replying to this e-mail and delete the message from your mailbox. Information contained in this message that does not relate to the business of SEAKR is neither endorsed by nor attributable to SEAKR. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: safe to defrag XFS on live system? 2012-09-14 18:49 ` Mark Nelson 2012-09-14 18:56 ` Nick Couchman @ 2012-09-14 18:56 ` Travis Rhoden 1 sibling, 0 replies; 10+ messages in thread From: Travis Rhoden @ 2012-09-14 18:56 UTC (permalink / raw) To: Mark Nelson; +Cc: Nick Couchman, Tommi Virtanen, ceph-devel Mark, That's pretty definitive! Thanks for doing that test so quickly, and the link to the mailing list discussions. Sounds like -o allocsize isn't really useful these days. I guess those are the consequences of looking at slightly older blog posts about performance tuning (on my part)... - Travis On Fri, Sep 14, 2012 at 2:49 PM, Mark Nelson <mark.nelson@inktank.com> wrote: > On 09/14/2012 12:15 PM, Nick Couchman wrote: >>> >>> >>>> While I'm talking about XFS... I know that RBD's use a default object >>>> size of 4MB. I've stuck with that so far.. Would it be beneficial to >>>> mount XFS with -o allocsize=4M ? What is the object size that gets >>>> used for non-RBD cases -- i.e. just dumping objects into data pool? >>> >>> >>> Don't know about -o allocsize -- benchmark it! >> >> >> ...and let us know what you come up with! I'm also using XFS for the >> underlying filesystem on which CEPH runs (and using RBD), and would be >> really interested to know if changing the alloc size improves performance! >> >> -Nick > > > Hi Guys, > > There was a change 2.6.38 to the way that speculative preallocation works > that basically lets small writes behave like allocsize is not set, and large > writes behave like a large one is set: > > http://permalink.gmane.org/gmane.comp.file-systems.xfs.general/38403 > > Having said that, I had my test gear all ready to go so I decided to give it > a try: > > Setup: > > - 1 node > - 6 OSDs with 7200rpm data disks. > - Journals on 2 Intel 520 SSDs (3 per SSD) > - LSI SAS2008 Controller (9211-8i) > - Network: Localhost > - Ceph 0.50 > - Ubuntu 12.04 > - Kernel 3.4 > - XFS mkfs options: -f -i size=2048 > - Common XFS mount options: -o noatime > - No replication > - 8 concurrent rados bench instances. > - 32 concurrent 4MB ops per instance (256 concurrent ops total) > > Without allocsize=4M: > > 781.454MB/s > > With allocsize=4M: > > 453.335MB/s > > I'm guessing that it's perhaps slower as we've told XFS to optimize for > large files, but the metadata in /meta is very small, and we were already > getting benefits from the new speculative preallocation patches that were > introduced in 2.6.38 to combat fragmentation of the 4MB objects. > > Mark >> >> >> >> >> >> -------- >> This e-mail may contain confidential and privileged material for the sole >> use of the intended recipient. If this email is not intended for you, or >> you are not responsible for the delivery of this message to the intended >> recipient, please note that this message may contain SEAKR Engineering >> (SEAKR) Privileged/Proprietary Information. In such a case, you are >> strictly prohibited from downloading, photocopying, distributing or >> otherwise using this message, its contents or attachments in any way. If >> you have received this message in error, please notify us immediately by >> replying to this e-mail and delete the message from your mailbox. >> Information contained in this message that does not relate to the business >> of SEAKR is neither endorsed by nor attributable to SEAKR. >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2012-09-14 20:01 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-09-14 15:51 safe to defrag XFS on live system? Travis Rhoden 2012-09-14 17:06 ` Tommi Virtanen 2012-09-14 17:15 ` Josh Durgin 2012-09-14 17:15 ` Nick Couchman 2012-09-14 17:22 ` Travis Rhoden 2012-09-14 18:49 ` Mark Nelson 2012-09-14 18:56 ` Nick Couchman 2012-09-14 19:50 ` Mark Nelson 2012-09-14 20:01 ` Nick Couchman 2012-09-14 18:56 ` Travis Rhoden
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.