From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Jim Schutt" Subject: Re: CephFS Space Accounting and Quotas Date: Wed, 6 Mar 2013 16:14:42 -0700 Message-ID: <5137CDE2.70105@sandia.gov> References: <51363490.4070408@42on.com> <1F15E079964848B9BE079E974A1946B4@inktank.com> <51363B30.7080006@42on.com> <513793FD.7010001@sandia.gov> <340852C7DC4E472A9D6EA3E0AEDE6EB0@inktank.com> <51379FCD.9000502@sandia.gov> <5137B4E9.1030505@sandia.gov> <1856965A675D4D5D971D5DC9AB01C657@inktank.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: Received: from sentry-two.sandia.gov ([132.175.109.14]:38818 "EHLO sentry-two.sandia.gov" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753128Ab3CFXPx (ORCPT ); Wed, 6 Mar 2013 18:15:53 -0500 In-Reply-To: <1856965A675D4D5D971D5DC9AB01C657@inktank.com> Sender: ceph-devel-owner@vger.kernel.org List-ID: To: Greg Farnum Cc: ceph-devel@vger.kernel.org, Sage Weil , Wido den Hollander On 03/06/2013 02:39 PM, Greg Farnum wrote: > On Wednesday, March 6, 2013 at 1:28 PM, Jim Schutt wrote: >> On 03/06/2013 01:21 PM, Greg Farnum wrote: >>>>> Also, this issue of stat on files created on other clients seems >>>>> like it's going to be problematic for many interactions our users >>>>> will have with the files created by their parallel compute jobs - >>>>> any suggestion on how to avoid or fix it? >>>> =20 >>> =20 >>> =20 >>> Brief background: stat is required to provide file size information= , >>> and so when you do a stat Ceph needs to find out the actual file >>> size. If the file is currently in use by somebody, that requires >>> gathering up the latest metadata from them. Separately, while Ceph >>> allows a client and the MDS to proceed with a bunch of operations >>> (ie, mknod) without having it go to disk first, it requires anythin= g >>> which is visible to a third party (another client) be durable on di= sk >>> for consistency reasons. >>> =20 >>> These combine to mean that if you do a stat on a file which a clien= t >>> currently has buffered writes for, that buffer must be flushed out = to >>> disk before the stat can return. This is the usual cause of the slo= w >>> stats you're seeing. You should be able to adjust dirty data >>> thresholds to encourage faster writeouts, do fsyncs once a client i= s >>> done with a file, etc in order to minimize the likelihood of runnin= g >>> into this. Also, I'd have to check but I believe opening a file wit= h >>> LAZY_IO or whatever will weaken those requirements =E2=80=94 it's p= robably >>> not the solution you'd like here but it's an option, and if this >>> turns out to be a serious issue then config options to reduce >>> consistency on certain operations are likely to make their way into >>> the roadmap. :) >> =20 >> =20 >> =20 >> That all makes sense. >> =20 >> But, it turns out the files in question were written yesterday, >> and I did the stat operations today. >> =20 >> So, shouldn't the dirty buffer issue not be in play here? > Probably not. :/ >=20 >=20 >> Is there anything else that might be going on? > In that case it sounds like either there's a slowdown on disk access > that is propagating up the chain very bizarrely, there's a serious > performance issue on the MDS (ie, swapping for everything), or the > clients are still holding onto capabilities for the files in question > and you're running into some issues with the capability revocation > mechanisms. > Can you describe your setup a bit more? What versions are you > running, kernel or userspace clients, etc. What config options are > you setting on the MDS? Assuming you're on something semi-recent, > getting a perfcounter dump from the MDS might be illuminating as > well. When I'm doing these stat operations the file system is otherwise idle. What is happening is that once one of these slow stat operations on a file completes, it never happens again for that file, from any client. At least, that's the case if I'm not writing to the file any more. I haven't checked if appending to the files restarts the behavior. On the client side I'm running with 3.8.2 + the ceph patch queue that was merged into 3.9-rc1. On the server side I'm running recent next branch (commit 0f42eddef5), with the tcp receive socket buffer option patches cherry-picked. I've also got a patch that allows mkcephfs to use osd_pool_default_pg_n= um rather than pg_bits to set initial number of PGs (same for pgp_num), and a patch that lets me run with just one pool that contains both data and metadata. I'm testing data distribution uniformity with 512K = PGs. My MDS tunables are all at default settings. >=20 > We'll probably want to get a high-debug log of the MDS during these s= low stats as well. OK. Do you want me to try to reproduce with a more standard setup? Also, I see Sage just pushed a patch to pgid decoding - I expect I need that as well, if I'm running the latest client code. Do you want the MDS log at 10 or 20? -- Jim > -Greg >=20 >=20 >=20 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" i= n the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html