[Ocfs2-devel] Re: [PATCH] Dynamic lockres hash table

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Ocfs2-devel] Re: [PATCH] Dynamic lockres hash table
       [not found] <20080304102448.GB24335@duck.suse.cz>
@ 2008-03-04 18:33 ` Sunil Mushran
  2008-03-05 10:40   ` Mark Fasheh
  2008-03-05 18:26   ` [Ocfs2-devel] " Jan Kara
  0 siblings, 2 replies; 9+ messages in thread
From: Sunil Mushran @ 2008-03-04 18:33 UTC (permalink / raw)
  To: ocfs2-devel

My main problem with a mount option is that it is not dynamic.

I was thinking along lines of having a sysfs param that will
allow users to dynamically resize the number of pages alloted
to the hash. This will definitely require us running tests to see
how long it takes to rehash with 500K lockres under the
dlm_spinlock.

I guess as a first step, we should add a avg lookup time stat.

But all this will take time.

How about we increase the defaults in 1.4 from 4 pages to 16 or
even 32 pages. This will be for Enterprise Kernels only and we
should be able to assume that they will have 128K per mount to
spare.

Comments?

Sunil

Jan Kara wrote:
>   Hello,
>
>   because SLES10 SP2 is closer than I thought, I've written the patch to
> dynamically size the hash table with locks in DLM. First, there's new mount
> option hash_buckets which allows you to set number of hash buckets
> explicitely. Then there is also code which tries to estimate reasonable
> hash size when mounting the filesystem - what I put there is:
>  1) we estimate the number of possible files a device_size / max(64KB,
> 4*cluster_size) - this is used as the number of buckets (number of locks
> we need to store in memory is roughly twice the number of cached files in
> memory).
>  2) we never take more than 1/2048 of total ram
>
>   If you think the estimates should be different, please speak up.
>
> 									Honza
>
>   
> ------------------------------------------------------------------------
>
> From: Jan Kara <jack@suse.cz>
> Subject: Allow setting of size of lockres hash
>
> Hash table with cluster locks had a fixed size of 2048 entries on 64-bit archs.
> This is too few when used for a larger filesystem. Add the possibility to set
> the size of the hash table as a mount option and also introduce some better
> estimation on the needed table size.
>
> Signed-off-by: Jan Kara <jack@suse.cz>
>
> Index: linux-2.6.16-SLES10_SP2_BRANCH/fs/ocfs2/dlm/dlmapi.h
> ===================================================================
> --- linux-2.6.16-SLES10_SP2_BRANCH.orig/fs/ocfs2/dlm/dlmapi.h
> +++ linux-2.6.16-SLES10_SP2_BRANCH/fs/ocfs2/dlm/dlmapi.h
> @@ -193,7 +193,8 @@ enum dlm_status dlmunlock(struct dlm_ctx
>  			  dlm_astunlockfunc_t *unlockast,
>  			  void *data);
>  
> -struct dlm_ctxt * dlm_register_domain(const char *domain, u32 key);
> +struct dlm_ctxt * dlm_register_domain(const char *domain, u32 key,
> +	unsigned int buckets);
>  
>  void dlm_unregister_domain(struct dlm_ctxt *dlm);
>  
> Index: linux-2.6.16-SLES10_SP2_BRANCH/fs/ocfs2/dlm/dlmcommon.h
> ===================================================================
> --- linux-2.6.16-SLES10_SP2_BRANCH.orig/fs/ocfs2/dlm/dlmcommon.h
> +++ linux-2.6.16-SLES10_SP2_BRANCH/fs/ocfs2/dlm/dlmcommon.h
> @@ -37,14 +37,8 @@
>  #define DLM_THREAD_SHUFFLE_INTERVAL    5     // flush everything every 5 passes
>  #define DLM_THREAD_MS                  200   // flush at least every 200 ms
>  
> -#define DLM_HASH_SIZE_DEFAULT	(1 << 14)
> -#if DLM_HASH_SIZE_DEFAULT < PAGE_SIZE
> -# define DLM_HASH_PAGES		1
> -#else
> -# define DLM_HASH_PAGES		(DLM_HASH_SIZE_DEFAULT / PAGE_SIZE)
> -#endif
> +#define DLM_DEFAULT_HASH_BUCKETS (1 << 14)
>  #define DLM_BUCKETS_PER_PAGE	(PAGE_SIZE / sizeof(struct hlist_head))
> -#define DLM_HASH_BUCKETS	(DLM_HASH_PAGES * DLM_BUCKETS_PER_PAGE)
>  
>  /* Intended to make it easier for us to switch out hash functions */
>  #define dlm_lockid_hash(_n, _l) full_name_hash(_n, _l)
> @@ -96,6 +90,7 @@ enum dlm_ctxt_state {
>  struct dlm_ctxt
>  {
>  	struct list_head list;
> +	unsigned int lockres_hash_buckets;
>  	struct hlist_head **lockres_hash;
>  	struct list_head dirty_list;
>  	struct list_head purge_list;
> @@ -148,7 +143,7 @@ struct dlm_ctxt
>  
>  static inline struct hlist_head *dlm_lockres_hash(struct dlm_ctxt *dlm, unsigned i)
>  {
> -	return dlm->lockres_hash[(i / DLM_BUCKETS_PER_PAGE) % DLM_HASH_PAGES] + (i % DLM_BUCKETS_PER_PAGE);
> +	return dlm->lockres_hash[(i % dlm->lockres_hash_buckets) / DLM_BUCKETS_PER_PAGE] + (i % DLM_BUCKETS_PER_PAGE);
>  }
>  
>  /* these keventd work queue items are for less-frequently
> Index: linux-2.6.16-SLES10_SP2_BRANCH/fs/ocfs2/dlm/dlmdebug.c
> ===================================================================
> --- linux-2.6.16-SLES10_SP2_BRANCH.orig/fs/ocfs2/dlm/dlmdebug.c
> +++ linux-2.6.16-SLES10_SP2_BRANCH/fs/ocfs2/dlm/dlmdebug.c
> @@ -381,7 +381,7 @@ void dlm_dump_lock_resources(struct dlm_
>  	}
>  
>  	spin_lock(&dlm->spinlock);
> -	for (i=0; i<DLM_HASH_BUCKETS; i++) {
> +	for (i=0; i<dlm->lockres_hash_buckets; i++) {
>  		bucket = dlm_lockres_hash(dlm, i);
>  		hlist_for_each_entry(res, iter, bucket, hash_node)
>  			dlm_print_one_lock_resource(res);
> Index: linux-2.6.16-SLES10_SP2_BRANCH/fs/ocfs2/dlm/dlmdomain.c
> ===================================================================
> --- linux-2.6.16-SLES10_SP2_BRANCH.orig/fs/ocfs2/dlm/dlmdomain.c
> +++ linux-2.6.16-SLES10_SP2_BRANCH/fs/ocfs2/dlm/dlmdomain.c
> @@ -98,9 +98,8 @@ static void **dlm_alloc_pagevec(int page
>  		if (!(vec[i] = (void *)__get_free_page(GFP_KERNEL)))
>  			goto out_free;
>  
> -	mlog(0, "Allocated DLM hash pagevec; %d pages (%lu expected), %lu buckets per page\n",
> -	     pages, (unsigned long)DLM_HASH_PAGES,
> -	     (unsigned long)DLM_BUCKETS_PER_PAGE);
> +	mlog(0, "Allocated DLM hash pagevec; %d pages, %lu buckets per page\n",
> +	     pages, (unsigned long)DLM_BUCKETS_PER_PAGE);
>  	return vec;
>  out_free:
>  	dlm_free_pagevec(vec, i);
> @@ -289,7 +288,8 @@ static void dlm_free_ctxt_mem(struct dlm
>  	dlm_proc_del_domain(dlm);
>  
>  	if (dlm->lockres_hash)
> -		dlm_free_pagevec((void **)dlm->lockres_hash, DLM_HASH_PAGES);
> +		dlm_free_pagevec((void **)dlm->lockres_hash,
> +			dlm->lockres_hash_buckets / DLM_BUCKETS_PER_PAGE);
>  
>  	if (dlm->name)
>  		kfree(dlm->name);
> @@ -412,7 +412,7 @@ static int dlm_migrate_all_locks(struct 
>  
>  	num = 0;
>  	spin_lock(&dlm->spinlock);
> -	for (i = 0; i < DLM_HASH_BUCKETS; i++) {
> +	for (i = 0; i < dlm->lockres_hash_buckets; i++) {
>  redo_bucket:
>  		n = 0;
>  		bucket = dlm_lockres_hash(dlm, i);
> @@ -1360,8 +1360,8 @@ bail:
>  	return status;
>  }
>  
> -static struct dlm_ctxt *dlm_alloc_ctxt(const char *domain,
> -				u32 key)
> +static struct dlm_ctxt *dlm_alloc_ctxt(const char *domain, u32 key,
> +				unsigned int buckets)
>  {
>  	int i;
>  	struct dlm_ctxt *dlm = NULL;
> @@ -1380,7 +1380,14 @@ static struct dlm_ctxt *dlm_alloc_ctxt(c
>  		goto leave;
>  	}
>  
> -	dlm->lockres_hash = (struct hlist_head **)dlm_alloc_pagevec(DLM_HASH_PAGES);
> +	if (!buckets)
> +		buckets = DLM_DEFAULT_HASH_BUCKETS;
> +	buckets = (buckets + DLM_BUCKETS_PER_PAGE - 1) / DLM_BUCKETS_PER_PAGE
> +		  * DLM_BUCKETS_PER_PAGE;
> +	dlm->lockres_hash_buckets = buckets;
> +
> +	dlm->lockres_hash = (struct hlist_head **)dlm_alloc_pagevec(buckets
> +				/ DLM_BUCKETS_PER_PAGE);
>  	if (!dlm->lockres_hash) {
>  		mlog_errno(-ENOMEM);
>  		kfree(dlm->name);
> @@ -1389,7 +1396,7 @@ static struct dlm_ctxt *dlm_alloc_ctxt(c
>  		goto leave;
>  	}
>  
> -	for (i = 0; i < DLM_HASH_BUCKETS; i++)
> +	for (i = 0; i < dlm->lockres_hash_buckets; i++)
>  		INIT_HLIST_HEAD(dlm_lockres_hash(dlm, i));
>  
>  	strcpy(dlm->name, domain);
> @@ -1458,8 +1465,8 @@ leave:
>  /*
>   * dlm_register_domain: one-time setup per "domain"
>   */
> -struct dlm_ctxt * dlm_register_domain(const char *domain,
> -			       u32 key)
> +struct dlm_ctxt * dlm_register_domain(const char *domain, u32 key,
> +			unsigned int buckets)
>  {
>  	int ret;
>  	struct dlm_ctxt *dlm = NULL;
> @@ -1515,7 +1522,7 @@ retry:
>  	if (!new_ctxt) {
>  		spin_unlock(&dlm_domain_lock);
>  
> -		new_ctxt = dlm_alloc_ctxt(domain, key);
> +		new_ctxt = dlm_alloc_ctxt(domain, key, buckets);
>  		if (new_ctxt)
>  			goto retry;
>  
> Index: linux-2.6.16-SLES10_SP2_BRANCH/fs/ocfs2/dlm/dlmrecovery.c
> ===================================================================
> --- linux-2.6.16-SLES10_SP2_BRANCH.orig/fs/ocfs2/dlm/dlmrecovery.c
> +++ linux-2.6.16-SLES10_SP2_BRANCH/fs/ocfs2/dlm/dlmrecovery.c
> @@ -2020,7 +2020,7 @@ static void dlm_finish_local_lockres_rec
>  	 * for now we need to run the whole hash, clear
>  	 * the RECOVERING state and set the owner
>  	 * if necessary */
> -	for (i = 0; i < DLM_HASH_BUCKETS; i++) {
> +	for (i = 0; i < dlm->lockres_hash_buckets; i++) {
>  		bucket = dlm_lockres_hash(dlm, i);
>  		hlist_for_each_entry(res, hash_iter, bucket, hash_node) {
>  			if (res->state & DLM_LOCK_RES_RECOVERING) {
> @@ -2201,7 +2201,7 @@ static void dlm_do_local_recovery_cleanu
>  	 *    can be kicked again to see if any ASTs or BASTs
>  	 *    need to be fired as a result.
>  	 */
> -	for (i = 0; i < DLM_HASH_BUCKETS; i++) {
> +	for (i = 0; i < dlm->lockres_hash_buckets; i++) {
>  		bucket = dlm_lockres_hash(dlm, i);
>  		hlist_for_each_entry(res, iter, bucket, hash_node) {
>   			/* always prune any $RECOVERY entries for dead nodes,
> Index: linux-2.6.16-SLES10_SP2_BRANCH/fs/ocfs2/dlm/userdlm.c
> ===================================================================
> --- linux-2.6.16-SLES10_SP2_BRANCH.orig/fs/ocfs2/dlm/userdlm.c
> +++ linux-2.6.16-SLES10_SP2_BRANCH/fs/ocfs2/dlm/userdlm.c
> @@ -661,7 +661,7 @@ struct dlm_ctxt *user_dlm_register_conte
>  
>  	snprintf(domain, name->len + 1, "%.*s", name->len, name->name);
>  
> -	dlm = dlm_register_domain(domain, dlm_key);
> +	dlm = dlm_register_domain(domain, dlm_key, 0);
>  	if (IS_ERR(dlm))
>  		mlog_errno(PTR_ERR(dlm));
>  
> Index: linux-2.6.16-SLES10_SP2_BRANCH/fs/ocfs2/dlmglue.c
> ===================================================================
> --- linux-2.6.16-SLES10_SP2_BRANCH.orig/fs/ocfs2/dlmglue.c
> +++ linux-2.6.16-SLES10_SP2_BRANCH/fs/ocfs2/dlmglue.c
> @@ -2514,7 +2514,8 @@ int ocfs2_dlm_init(struct ocfs2_super *o
>  	dlm_key = crc32_le(0, osb->uuid_str, strlen(osb->uuid_str));
>  
>  	/* for now, uuid == domain */
> -	dlm = dlm_register_domain(osb->uuid_str, dlm_key);
> +	dlm = dlm_register_domain(osb->uuid_str, dlm_key,
> +			osb->dlm_hash_buckets);
>  	if (IS_ERR(dlm)) {
>  		status = PTR_ERR(dlm);
>  		mlog_errno(status);
> Index: linux-2.6.16-SLES10_SP2_BRANCH/fs/ocfs2/ocfs2.h
> ===================================================================
> --- linux-2.6.16-SLES10_SP2_BRANCH.orig/fs/ocfs2/ocfs2.h
> +++ linux-2.6.16-SLES10_SP2_BRANCH/fs/ocfs2/ocfs2.h
> @@ -218,6 +218,7 @@ struct ocfs2_super
>  
>  	unsigned long s_mount_opt;
>  	unsigned int s_atime_quantum;
> +	unsigned int dlm_hash_buckets;
>  
>  	u16 max_slots;
>  	s16 node_num;
> Index: linux-2.6.16-SLES10_SP2_BRANCH/fs/ocfs2/super.c
> ===================================================================
> --- linux-2.6.16-SLES10_SP2_BRANCH.orig/fs/ocfs2/super.c
> +++ linux-2.6.16-SLES10_SP2_BRANCH/fs/ocfs2/super.c
> @@ -40,6 +40,7 @@
>  #include <linux/crc32.h>
>  #include <linux/debugfs.h>
>  #include <linux/mount.h>
> +#include <linux/mm.h>
>  
>  #include <cluster/nodemanager.h>
>  
> @@ -88,6 +89,7 @@ struct mount_options
>  	unsigned int	atime_quantum;
>  	signed short	slot;
>  	unsigned int	localalloc_opt;
> +	unsigned int	dlm_hash_buckets;
>  };
>  
>  static int ocfs2_parse_options(struct super_block *sb, char *options,
> @@ -169,6 +171,7 @@ enum {
>  	Opt_commit,
>  	Opt_localalloc,
>  	Opt_localflocks,
> +	Opt_dlm_hash_buckets,
>  #ifdef OCFS2_ORACORE_WORKAROUNDS
>  	Opt_datavolume,
>  #endif
> @@ -190,6 +193,7 @@ static match_table_t tokens = {
>  	{Opt_commit, "commit=%u"},
>  	{Opt_localalloc, "localalloc=%d"},
>  	{Opt_localflocks, "localflocks"},
> +	{Opt_dlm_hash_buckets, "hash_buckets=%u"},
>  #ifdef OCFS2_ORACORE_WORKAROUNDS
>  	{Opt_datavolume, "datavolume"},
>  #endif
> @@ -633,6 +637,22 @@ static int ocfs2_fill_super(struct super
>  	osb->preferred_slot = parsed_options.slot;
>  	osb->osb_commit_interval = parsed_options.commit_interval;
>  	osb->local_alloc_size = parsed_options.localalloc_opt;
> +	if (parsed_options.dlm_hash_buckets)
> +		osb->dlm_hash_buckets = parsed_options.dlm_hash_buckets;
> +	else {
> +		/* Let's count 4 clusters per file, 64 KB at least */
> +		unsigned int exp_file_size_shift =
> +				max(16, osb->s_clustersize_bits + 2);
> +		struct sysinfo i;
> +
> +		si_meminfo(&i);
> +		/* Estimate number of files on FS and limit space used by
> +		 * hash table by 1/2048 of kernel memory */
> +		osb->dlm_hash_buckets = min_t(unsigned long long,
> +			sb->s_bdev->bd_inode->i_size >> exp_file_size_shift,
> +			(i.totalram >> 11) * (PAGE_SIZE /
> +					sizeof(struct hlist_head)));
> +	}
>  
>  #ifdef OCFS2_ORACORE_WORKAROUNDS
>  	if (osb->s_mount_opt & OCFS2_MOUNT_COMPAT_OCFS)
> @@ -807,6 +827,7 @@ static int ocfs2_parse_options(struct su
>  	mopt->atime_quantum = OCFS2_DEFAULT_ATIME_QUANTUM;
>  	mopt->slot = OCFS2_INVALID_SLOT;
>  	mopt->localalloc_opt = OCFS2_DEFAULT_LOCAL_ALLOC_SIZE;
> +	mopt->dlm_hash_buckets = 0;
>  
>  	if (!options) {
>  		status = 1;
> @@ -919,6 +940,19 @@ static int ocfs2_parse_options(struct su
>  			if (!is_remount)
>  				mopt->mount_opt |= OCFS2_MOUNT_LOCALFLOCKS;
>  			break;
> +		case Opt_dlm_hash_buckets:
> +			if (is_remount) {
> +				mlog(ML_ERROR, "Changing number of hash buckets"
> +					" during remount is not supported.\n");
> +				status = 0;
> +				goto bail;
> +			}
> +			if (match_int(&args[0], &option) || option <= 0) {
> +				status = 0;
> +				goto bail;
> +			}
> +			mopt->dlm_hash_buckets = option;
> +			break;
>  		default:
>  			mlog(ML_ERROR,
>  			     "Unrecognized mount option \"%s\" "
>   

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Ocfs2-devel] Re: [PATCH] Dynamic lockres hash table
  2008-03-04 18:33 ` [Ocfs2-devel] Re: [PATCH] Dynamic lockres hash table Sunil Mushran
@ 2008-03-05 10:40   ` Mark Fasheh
  2008-03-05 10:47     ` Sunil Mushran
  2008-03-05 11:28     ` Joel Becker
  2008-03-05 18:26   ` [Ocfs2-devel] " Jan Kara
  1 sibling, 2 replies; 9+ messages in thread
From: Mark Fasheh @ 2008-03-05 10:40 UTC (permalink / raw)
  To: ocfs2-devel

On Tue, Mar 04, 2008 at 06:33:03PM -0800, Sunil Mushran wrote:
> My main problem with a mount option is that it is not dynamic.
>
> I was thinking along lines of having a sysfs param that will
> allow users to dynamically resize the number of pages alloted
> to the hash. This will definitely require us running tests to see
> how long it takes to rehash with 500K lockres under the
> dlm_spinlock.

I like the idea of being able to change it on the fly, but I'm wondering
about how useful that ability will be for customers versus just being able
to set it at mount time.


> I guess as a first step, we should add a avg lookup time stat.
>
> But all this will take time.
>
> How about we increase the defaults in 1.4 from 4 pages to 16 or
> even 32 pages. This will be for Enterprise Kernels only and we
> should be able to assume that they will have 128K per mount to
> spare.

Please, can we solve this everywhere instead of having some ocfs2 1.4
specific hack.
	--Mark

--
Mark Fasheh
Principal Software Developer, Oracle
mark.fasheh@oracle.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Ocfs2-devel] Re: [PATCH] Dynamic lockres hash table
  2008-03-05 10:40   ` Mark Fasheh
@ 2008-03-05 10:47     ` Sunil Mushran
  2008-03-05 11:28     ` Joel Becker
  1 sibling, 0 replies; 9+ messages in thread
From: Sunil Mushran @ 2008-03-05 10:47 UTC (permalink / raw)
  To: ocfs2-devel

Look it at as a short term fix considering sles is shipping next
month and we won't have the time to do the dynamic resizing.

As far as dynamic resizing goes, if we provide info indicating that
the lookups are slow and that they can speeded up by increasing
the hash... we can not only let the user do it manually but at some
point resize it automatically too.

Mark Fasheh wrote:
> On Tue, Mar 04, 2008 at 06:33:03PM -0800, Sunil Mushran wrote:
>   
>> My main problem with a mount option is that it is not dynamic.
>>
>> I was thinking along lines of having a sysfs param that will
>> allow users to dynamically resize the number of pages alloted
>> to the hash. This will definitely require us running tests to see
>> how long it takes to rehash with 500K lockres under the
>> dlm_spinlock.
>>     
>
> I like the idea of being able to change it on the fly, but I'm wondering
> about how useful that ability will be for customers versus just being able
> to set it at mount time.
>
>
>   
>> I guess as a first step, we should add a avg lookup time stat.
>>
>> But all this will take time.
>>
>> How about we increase the defaults in 1.4 from 4 pages to 16 or
>> even 32 pages. This will be for Enterprise Kernels only and we
>> should be able to assume that they will have 128K per mount to
>> spare.
>>     
>
> Please, can we solve this everywhere instead of having some ocfs2 1.4
> specific hack.
> 	--Mark
>
> --
> Mark Fasheh
> Principal Software Developer, Oracle
> mark.fasheh@oracle.com
>   

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Ocfs2-devel] Re: [PATCH] Dynamic lockres hash table
  2008-03-05 10:40   ` Mark Fasheh
  2008-03-05 10:47     ` Sunil Mushran
@ 2008-03-05 11:28     ` Joel Becker
  2008-03-05 12:39       ` Mark Fasheh
  1 sibling, 1 reply; 9+ messages in thread
From: Joel Becker @ 2008-03-05 11:28 UTC (permalink / raw)
  To: ocfs2-devel

On Wed, Mar 05, 2008 at 10:40:21AM -0800, Mark Fasheh wrote:
> On Tue, Mar 04, 2008 at 06:33:03PM -0800, Sunil Mushran wrote:
> > My main problem with a mount option is that it is not dynamic.
> >
> > I was thinking along lines of having a sysfs param that will
> > allow users to dynamically resize the number of pages alloted
> > to the hash. This will definitely require us running tests to see
> > how long it takes to rehash with 500K lockres under the
> > dlm_spinlock.
> 
> I like the idea of being able to change it on the fly, but I'm wondering
> about how useful that ability will be for customers versus just being able
> to set it at mount time.

<snip>

> Please, can we solve this everywhere instead of having some ocfs2 1.4
> specific hack.

[Warning, a long email]

	Sunil and I discussed this a bit yesterday, and our basic
thought was that a mount-time option was a hack.  A customer doesn't
want to have to stop everything and remount to get this to work, they
have a live filesystem with live problems they'd like to alleviate.  In
the short-term world of stopgaps, a mount option works sure, but then we
have to support it for a long time, whereas a default size we don't even
have to tell people about is hidden and changeable.  But either way both
are interim solutions.
	We discussed some approaches of varying complexity.  Sunil
suggested hanging an rbtree off of each hash bucket - if you have long
chains, the lookup is now logN.  But that's complex.  I wondered if
maybe we should just remove the hasn and do a single rbtree.  Sure, for
small amounts of locks you might degrade the best case, but the worst
case is now ameliorated.  Full disclosure: I suspect that a hash+rbtree
will be faster than the full rbtree - the question is whether the
complexity trade-off is worth it.  
	In the end, though, we really need numbers.  It'd be awesome to
get latencies of lookups and be able to know how each scheme handles
10K locks and 500K locks (current hash, larger (32 page?) hash,
hash + rbtree, single rbtree).  We may be surprised.
	Once we have instrumentation of latencies, though, we can just
go ahead and automate it.  We'll probably find that the current hash is
best for 10K locks and a larger hash is best for 500K locks.  So we
could easily have tunables in sysfs for "min_lock_hash_size",
"max_lock_hash_size", and "latency_threshold_to_grow_hash".  Those are
tunables we could live with for a long time.

Joel

-- 

"What do you take me for, an idiot?"  
        - General Charles de Gaulle, when a journalist asked him
          if he was happy.

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Ocfs2-devel] Re: [PATCH] Dynamic lockres hash table
  2008-03-05 11:28     ` Joel Becker
@ 2008-03-05 12:39       ` Mark Fasheh
  2008-03-05 13:52         ` Joel Becker
  0 siblings, 1 reply; 9+ messages in thread
From: Mark Fasheh @ 2008-03-05 12:39 UTC (permalink / raw)
  To: ocfs2-devel

On Wed, Mar 05, 2008 at 11:27:53AM -0800, Joel Becker wrote:
> On Wed, Mar 05, 2008 at 10:40:21AM -0800, Mark Fasheh wrote:
> > On Tue, Mar 04, 2008 at 06:33:03PM -0800, Sunil Mushran wrote:
> > > My main problem with a mount option is that it is not dynamic.
> > >
> > > I was thinking along lines of having a sysfs param that will
> > > allow users to dynamically resize the number of pages alloted
> > > to the hash. This will definitely require us running tests to see
> > > how long it takes to rehash with 500K lockres under the
> > > dlm_spinlock.
> > 
> > I like the idea of being able to change it on the fly, but I'm wondering
> > about how useful that ability will be for customers versus just being able
> > to set it at mount time.
> 
> <snip>
> 
> > Please, can we solve this everywhere instead of having some ocfs2 1.4
> > specific hack.
> 
> [Warning, a long email]
> 
> 	Sunil and I discussed this a bit yesterday, and our basic
> thought was that a mount-time option was a hack.  A customer doesn't
> want to have to stop everything and remount to get this to work,

If the dlm supports dynamic resizing, neither approach requires the user to
"stop everything". "mount -oremount" is just as unobtrusive to the running
file system as echoing to a sysfs file.

Btw, if this is the direction we all want to go, can I revert the
"localalloc=" mount option patches before 2.6.25 gets released? It strikes
me as similarly hacky (seriously) and we already have a plan for dynamic
local alloc sizing.


> they have a live filesystem with live problems they'd like to alleviate.

Can you be more specific? How often do we think people will expect to change
this on a live file system? What sort of situations have we run into, or
expect to run into where the user needs to change the hash size, and can't
do it without unmounting the file system first, couldn't have reasonably
anticipiated a hash size to begin with, or a dynamically picked default
might fail?

Btw, just so we're all clear - any hash sizing scheme would pretty much
involve information known only to the local node. So we're not talking about
offlining the cluster here - just unmounting a node.


> In the short-term world of stopgaps, a mount option works sure, but then
> we have to support it for a long time, whereas a default size we don't
> even have to tell people about is hidden and changeable. But either way
> both are interim solutions.

In the sense of supporting "ABI", a sysfs file and a mount option are
equally inflexible - look at the business with /sys/o2cb as an example. Once
it's "published", we'll have a hard time taking it away from folks.


> 	We discussed some approaches of varying complexity.  Sunil
> suggested hanging an rbtree off of each hash bucket - if you have long
> chains, the lookup is now logN.  But that's complex.  I wondered if
> maybe we should just remove the hasn and do a single rbtree.  Sure, for
> small amounts of locks you might degrade the best case, but the worst
> case is now ameliorated.  Full disclosure: I suspect that a hash+rbtree
> will be faster than the full rbtree - the question is whether the
> complexity trade-off is worth it.  
> 	In the end, though, we really need numbers.  It'd be awesome to
> get latencies of lookups and be able to know how each scheme handles
> 10K locks and 500K locks (current hash, larger (32 page?) hash,
> hash + rbtree, single rbtree).  We may be surprised.
> 	Once we have instrumentation of latencies, though, we can just
> go ahead and automate it.  We'll probably find that the current hash is
> best for 10K locks and a larger hash is best for 500K locks.  So we
> could easily have tunables in sysfs for "min_lock_hash_size",
> "max_lock_hash_size", and "latency_threshold_to_grow_hash".  Those are
> tunables we could live with for a long time.

Ok, ignoring how it's done and who does it, I completely agree that it'd be
neat for the dlm to respond automatically to demand. There's a question in
my mind of how much of our time such a feature is worth though, and honestly
- when we'd actually get around to doing it. o2dlm is a pretty known
quantity at this point, and our direction has moved more towards fsdlm.


Hmm, maybe this would be a nice patch to give them :) They're actually worse
off with respect to hash sizing than any of what we've discussed now -
there's a global sysfs file which governs the default for lockspaces.


To recap, and make sure we're on the same page - AFAICT, the questions being
raised are:

1) Should the default dlm hash size be updated somehow?

2) Should the user be allowed to change the (default?) hash size?
  - If so, how would the user change this?

3) Should the dlm be changed to allow dynamic resizing of the hash?
  - Should it resize automatically depending on workload, or should the user
    initiate such resizing via whichever method is picked in (2)?


Make sense?
	--Mark

--
Mark Fasheh
Principal Software Developer, Oracle
mark.fasheh@oracle.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Ocfs2-devel] Re: [PATCH] Dynamic lockres hash table
  2008-03-05 12:39       ` Mark Fasheh
@ 2008-03-05 13:52         ` Joel Becker
  0 siblings, 0 replies; 9+ messages in thread
From: Joel Becker @ 2008-03-05 13:52 UTC (permalink / raw)
  To: ocfs2-devel

On Wed, Mar 05, 2008 at 12:38:37PM -0800, Mark Fasheh wrote:
> On Wed, Mar 05, 2008 at 11:27:53AM -0800, Joel Becker wrote:
> If the dlm supports dynamic resizing, neither approach requires the user to
> "stop everything". "mount -oremount" is just as unobtrusive to the running
> file system as echoing to a sysfs file.

	Oh, sure.  And your point about sysfs files being ABI is also
true.  That's why I wanted to look at a "what would we want this to
look like if we had to keep it a long time" perspective.  To be clear,
I'm not advocating that we have to have some super-awesome solution.  I
was just detailing the discussion Sunil and I had.

> Btw, if this is the direction we all want to go, can I revert the
> "localalloc=" mount option patches before 2.6.25 gets released? It strikes
> me as similarly hacky (seriously) and we already have a plan for dynamic
> local alloc sizing.

	I vote yes.

> Can you be more specific? How often do we think people will expect to change
> this on a live file system?

	I'm coming from Sunil's bug report - I don't have data myself.
Really, anything that allows an online change (remount,sysfs,automatic)
satisfies this.  And maybe it's not necessary to be all fancy.  More
below.

> Btw, just so we're all clear - any hash sizing scheme would pretty much
> involve information known only to the local node. So we're not talking about
> offlining the cluster here - just unmounting a node.

	Yup.

> To recap, and make sure we're on the same page - AFAICT, the questions being
> raised are:
> 
> 1) Should the default dlm hash size be updated somehow?
> 
> 2) Should the user be allowed to change the (default?) hash size?
>   - If so, how would the user change this?
> 
> 3) Should the dlm be changed to allow dynamic resizing of the hash?
>   - Should it resize automatically depending on workload, or should the user
>     initiate such resizing via whichever method is picked in (2)?

	Basically.  I'd say that we clearly have to say 'yes' to (1) -
the 500K lock problem dictates it.  My thoughts to Sunil were:

1) If we can hide the issue from the user, we win.  I don't care if
   that's automatic resizing based on latencies, an rbtree instead of a
   hash, whatever.
2) Otherwise, we need a method for the user to make this modification.
   It would be great if it could be live.  This ABI would become
   something we carry for a while, so we want to think about it a
   little.

Joel

-- 

Life's Little Instruction Book #80

	"Slow dance"

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Ocfs2-devel] [PATCH] Dynamic lockres hash table
  2008-03-04 18:33 ` [Ocfs2-devel] Re: [PATCH] Dynamic lockres hash table Sunil Mushran
  2008-03-05 10:40   ` Mark Fasheh
@ 2008-03-05 18:26   ` Jan Kara
  2008-03-05 10:42     ` [Ocfs2-devel] " Sunil Mushran
  1 sibling, 1 reply; 9+ messages in thread
From: Jan Kara @ 2008-03-05 18:26 UTC (permalink / raw)
  To: ocfs2-devel

On Tue 04-03-08 18:33:03, Sunil Mushran wrote:
> My main problem with a mount option is that it is not dynamic.
>
> I was thinking along lines of having a sysfs param that will
> allow users to dynamically resize the number of pages alloted
> to the hash. This will definitely require us running tests to see
> how long it takes to rehash with 500K lockres under the
> dlm_spinlock.
   I see. I didn't know you intended to do this dynamically. But yes, it's
better than what I have if rehashing will be fast enough.

> I guess as a first step, we should add a avg lookup time stat.
  Actually, it's non-trivial to measure (differently than by profiling).
You cannot use standard time functions because they have too low resolution
- we are speaking about microseconds here...

> But all this will take time.
>
> How about we increase the defaults in 1.4 from 4 pages to 16 or
> even 32 pages. This will be for Enterprise Kernels only and we
> should be able to assume that they will have 128K per mount to
> spare.
  Definitely, they should have 16 or even 32 pages per mount.  With 500K
lockres, which is not so extreme on bigger FS, 16 pages mean hash chains of
average length 61 on x86_64. It is not ideal but I guess it should be
sufficient.

								Honza

> Jan Kara wrote:
>>   Hello,
>>
>>   because SLES10 SP2 is closer than I thought, I've written the patch to
>> dynamically size the hash table with locks in DLM. First, there's new 
>> mount
>> option hash_buckets which allows you to set number of hash buckets
>> explicitely. Then there is also code which tries to estimate reasonable
>> hash size when mounting the filesystem - what I put there is:
>>  1) we estimate the number of possible files a device_size / max(64KB,
>> 4*cluster_size) - this is used as the number of buckets (number of locks
>> we need to store in memory is roughly twice the number of cached files in
>> memory).
>>  2) we never take more than 1/2048 of total ram
>>
>>   If you think the estimates should be different, please speak up.
>>
>> 									Honza
>>
>>   ------------------------------------------------------------------------
>>
>> From: Jan Kara <jack@suse.cz>
>> Subject: Allow setting of size of lockres hash
>>
>> Hash table with cluster locks had a fixed size of 2048 entries on 64-bit 
>> archs.
>> This is too few when used for a larger filesystem. Add the possibility to 
>> set
>> the size of the hash table as a mount option and also introduce some 
>> better
>> estimation on the needed table size.
>>
>> Signed-off-by: Jan Kara <jack@suse.cz>
>>
>> Index: linux-2.6.16-SLES10_SP2_BRANCH/fs/ocfs2/dlm/dlmapi.h
>> ===================================================================
>> --- linux-2.6.16-SLES10_SP2_BRANCH.orig/fs/ocfs2/dlm/dlmapi.h
>> +++ linux-2.6.16-SLES10_SP2_BRANCH/fs/ocfs2/dlm/dlmapi.h
>> @@ -193,7 +193,8 @@ enum dlm_status dlmunlock(struct dlm_ctx
>>  			  dlm_astunlockfunc_t *unlockast,
>>  			  void *data);
>>  -struct dlm_ctxt * dlm_register_domain(const char *domain, u32 key);
>> +struct dlm_ctxt * dlm_register_domain(const char *domain, u32 key,
>> +	unsigned int buckets);
>>   void dlm_unregister_domain(struct dlm_ctxt *dlm);
>>  Index: linux-2.6.16-SLES10_SP2_BRANCH/fs/ocfs2/dlm/dlmcommon.h
>> ===================================================================
>> --- linux-2.6.16-SLES10_SP2_BRANCH.orig/fs/ocfs2/dlm/dlmcommon.h
>> +++ linux-2.6.16-SLES10_SP2_BRANCH/fs/ocfs2/dlm/dlmcommon.h
>> @@ -37,14 +37,8 @@
>>  #define DLM_THREAD_SHUFFLE_INTERVAL    5     // flush everything every 5 
>> passes
>>  #define DLM_THREAD_MS                  200   // flush at least every 200 
>> ms
>>  -#define DLM_HASH_SIZE_DEFAULT	(1 << 14)
>> -#if DLM_HASH_SIZE_DEFAULT < PAGE_SIZE
>> -# define DLM_HASH_PAGES		1
>> -#else
>> -# define DLM_HASH_PAGES		(DLM_HASH_SIZE_DEFAULT / PAGE_SIZE)
>> -#endif
>> +#define DLM_DEFAULT_HASH_BUCKETS (1 << 14)
>>  #define DLM_BUCKETS_PER_PAGE	(PAGE_SIZE / sizeof(struct hlist_head))
>> -#define DLM_HASH_BUCKETS	(DLM_HASH_PAGES * DLM_BUCKETS_PER_PAGE)
>>   /* Intended to make it easier for us to switch out hash functions */
>>  #define dlm_lockid_hash(_n, _l) full_name_hash(_n, _l)
>> @@ -96,6 +90,7 @@ enum dlm_ctxt_state {
>>  struct dlm_ctxt
>>  {
>>  	struct list_head list;
>> +	unsigned int lockres_hash_buckets;
>>  	struct hlist_head **lockres_hash;
>>  	struct list_head dirty_list;
>>  	struct list_head purge_list;
>> @@ -148,7 +143,7 @@ struct dlm_ctxt
>>   static inline struct hlist_head *dlm_lockres_hash(struct dlm_ctxt *dlm, 
>> unsigned i)
>>  {
>> -	return dlm->lockres_hash[(i / DLM_BUCKETS_PER_PAGE) % DLM_HASH_PAGES] + 
>> (i % DLM_BUCKETS_PER_PAGE);
>> +	return dlm->lockres_hash[(i % dlm->lockres_hash_buckets) / 
>> DLM_BUCKETS_PER_PAGE] + (i % DLM_BUCKETS_PER_PAGE);
>>  }
>>   /* these keventd work queue items are for less-frequently
>> Index: linux-2.6.16-SLES10_SP2_BRANCH/fs/ocfs2/dlm/dlmdebug.c
>> ===================================================================
>> --- linux-2.6.16-SLES10_SP2_BRANCH.orig/fs/ocfs2/dlm/dlmdebug.c
>> +++ linux-2.6.16-SLES10_SP2_BRANCH/fs/ocfs2/dlm/dlmdebug.c
>> @@ -381,7 +381,7 @@ void dlm_dump_lock_resources(struct dlm_
>>  	}
>>   	spin_lock(&dlm->spinlock);
>> -	for (i=0; i<DLM_HASH_BUCKETS; i++) {
>> +	for (i=0; i<dlm->lockres_hash_buckets; i++) {
>>  		bucket = dlm_lockres_hash(dlm, i);
>>  		hlist_for_each_entry(res, iter, bucket, hash_node)
>>  			dlm_print_one_lock_resource(res);
>> Index: linux-2.6.16-SLES10_SP2_BRANCH/fs/ocfs2/dlm/dlmdomain.c
>> ===================================================================
>> --- linux-2.6.16-SLES10_SP2_BRANCH.orig/fs/ocfs2/dlm/dlmdomain.c
>> +++ linux-2.6.16-SLES10_SP2_BRANCH/fs/ocfs2/dlm/dlmdomain.c
>> @@ -98,9 +98,8 @@ static void **dlm_alloc_pagevec(int page
>>  		if (!(vec[i] = (void *)__get_free_page(GFP_KERNEL)))
>>  			goto out_free;
>>  -	mlog(0, "Allocated DLM hash pagevec; %d pages (%lu expected), %lu 
>> buckets per page\n",
>> -	     pages, (unsigned long)DLM_HASH_PAGES,
>> -	     (unsigned long)DLM_BUCKETS_PER_PAGE);
>> +	mlog(0, "Allocated DLM hash pagevec; %d pages, %lu buckets per page\n",
>> +	     pages, (unsigned long)DLM_BUCKETS_PER_PAGE);
>>  	return vec;
>>  out_free:
>>  	dlm_free_pagevec(vec, i);
>> @@ -289,7 +288,8 @@ static void dlm_free_ctxt_mem(struct dlm
>>  	dlm_proc_del_domain(dlm);
>>   	if (dlm->lockres_hash)
>> -		dlm_free_pagevec((void **)dlm->lockres_hash, DLM_HASH_PAGES);
>> +		dlm_free_pagevec((void **)dlm->lockres_hash,
>> +			dlm->lockres_hash_buckets / DLM_BUCKETS_PER_PAGE);
>>   	if (dlm->name)
>>  		kfree(dlm->name);
>> @@ -412,7 +412,7 @@ static int dlm_migrate_all_locks(struct   	num = 0;
>>  	spin_lock(&dlm->spinlock);
>> -	for (i = 0; i < DLM_HASH_BUCKETS; i++) {
>> +	for (i = 0; i < dlm->lockres_hash_buckets; i++) {
>>  redo_bucket:
>>  		n = 0;
>>  		bucket = dlm_lockres_hash(dlm, i);
>> @@ -1360,8 +1360,8 @@ bail:
>>  	return status;
>>  }
>>  -static struct dlm_ctxt *dlm_alloc_ctxt(const char *domain,
>> -				u32 key)
>> +static struct dlm_ctxt *dlm_alloc_ctxt(const char *domain, u32 key,
>> +				unsigned int buckets)
>>  {
>>  	int i;
>>  	struct dlm_ctxt *dlm = NULL;
>> @@ -1380,7 +1380,14 @@ static struct dlm_ctxt *dlm_alloc_ctxt(c
>>  		goto leave;
>>  	}
>>  -	dlm->lockres_hash = (struct hlist_head 
>> **)dlm_alloc_pagevec(DLM_HASH_PAGES);
>> +	if (!buckets)
>> +		buckets = DLM_DEFAULT_HASH_BUCKETS;
>> +	buckets = (buckets + DLM_BUCKETS_PER_PAGE - 1) / DLM_BUCKETS_PER_PAGE
>> +		  * DLM_BUCKETS_PER_PAGE;
>> +	dlm->lockres_hash_buckets = buckets;
>> +
>> +	dlm->lockres_hash = (struct hlist_head **)dlm_alloc_pagevec(buckets
>> +				/ DLM_BUCKETS_PER_PAGE);
>>  	if (!dlm->lockres_hash) {
>>  		mlog_errno(-ENOMEM);
>>  		kfree(dlm->name);
>> @@ -1389,7 +1396,7 @@ static struct dlm_ctxt *dlm_alloc_ctxt(c
>>  		goto leave;
>>  	}
>>  -	for (i = 0; i < DLM_HASH_BUCKETS; i++)
>> +	for (i = 0; i < dlm->lockres_hash_buckets; i++)
>>  		INIT_HLIST_HEAD(dlm_lockres_hash(dlm, i));
>>   	strcpy(dlm->name, domain);
>> @@ -1458,8 +1465,8 @@ leave:
>>  /*
>>   * dlm_register_domain: one-time setup per "domain"
>>   */
>> -struct dlm_ctxt * dlm_register_domain(const char *domain,
>> -			       u32 key)
>> +struct dlm_ctxt * dlm_register_domain(const char *domain, u32 key,
>> +			unsigned int buckets)
>>  {
>>  	int ret;
>>  	struct dlm_ctxt *dlm = NULL;
>> @@ -1515,7 +1522,7 @@ retry:
>>  	if (!new_ctxt) {
>>  		spin_unlock(&dlm_domain_lock);
>>  -		new_ctxt = dlm_alloc_ctxt(domain, key);
>> +		new_ctxt = dlm_alloc_ctxt(domain, key, buckets);
>>  		if (new_ctxt)
>>  			goto retry;
>>  Index: linux-2.6.16-SLES10_SP2_BRANCH/fs/ocfs2/dlm/dlmrecovery.c
>> ===================================================================
>> --- linux-2.6.16-SLES10_SP2_BRANCH.orig/fs/ocfs2/dlm/dlmrecovery.c
>> +++ linux-2.6.16-SLES10_SP2_BRANCH/fs/ocfs2/dlm/dlmrecovery.c
>> @@ -2020,7 +2020,7 @@ static void dlm_finish_local_lockres_rec
>>  	 * for now we need to run the whole hash, clear
>>  	 * the RECOVERING state and set the owner
>>  	 * if necessary */
>> -	for (i = 0; i < DLM_HASH_BUCKETS; i++) {
>> +	for (i = 0; i < dlm->lockres_hash_buckets; i++) {
>>  		bucket = dlm_lockres_hash(dlm, i);
>>  		hlist_for_each_entry(res, hash_iter, bucket, hash_node) {
>>  			if (res->state & DLM_LOCK_RES_RECOVERING) {
>> @@ -2201,7 +2201,7 @@ static void dlm_do_local_recovery_cleanu
>>  	 *    can be kicked again to see if any ASTs or BASTs
>>  	 *    need to be fired as a result.
>>  	 */
>> -	for (i = 0; i < DLM_HASH_BUCKETS; i++) {
>> +	for (i = 0; i < dlm->lockres_hash_buckets; i++) {
>>  		bucket = dlm_lockres_hash(dlm, i);
>>  		hlist_for_each_entry(res, iter, bucket, hash_node) {
>>   			/* always prune any $RECOVERY entries for dead nodes,
>> Index: linux-2.6.16-SLES10_SP2_BRANCH/fs/ocfs2/dlm/userdlm.c
>> ===================================================================
>> --- linux-2.6.16-SLES10_SP2_BRANCH.orig/fs/ocfs2/dlm/userdlm.c
>> +++ linux-2.6.16-SLES10_SP2_BRANCH/fs/ocfs2/dlm/userdlm.c
>> @@ -661,7 +661,7 @@ struct dlm_ctxt *user_dlm_register_conte
>>   	snprintf(domain, name->len + 1, "%.*s", name->len, name->name);
>>  -	dlm = dlm_register_domain(domain, dlm_key);
>> +	dlm = dlm_register_domain(domain, dlm_key, 0);
>>  	if (IS_ERR(dlm))
>>  		mlog_errno(PTR_ERR(dlm));
>>  Index: linux-2.6.16-SLES10_SP2_BRANCH/fs/ocfs2/dlmglue.c
>> ===================================================================
>> --- linux-2.6.16-SLES10_SP2_BRANCH.orig/fs/ocfs2/dlmglue.c
>> +++ linux-2.6.16-SLES10_SP2_BRANCH/fs/ocfs2/dlmglue.c
>> @@ -2514,7 +2514,8 @@ int ocfs2_dlm_init(struct ocfs2_super *o
>>  	dlm_key = crc32_le(0, osb->uuid_str, strlen(osb->uuid_str));
>>   	/* for now, uuid == domain */
>> -	dlm = dlm_register_domain(osb->uuid_str, dlm_key);
>> +	dlm = dlm_register_domain(osb->uuid_str, dlm_key,
>> +			osb->dlm_hash_buckets);
>>  	if (IS_ERR(dlm)) {
>>  		status = PTR_ERR(dlm);
>>  		mlog_errno(status);
>> Index: linux-2.6.16-SLES10_SP2_BRANCH/fs/ocfs2/ocfs2.h
>> ===================================================================
>> --- linux-2.6.16-SLES10_SP2_BRANCH.orig/fs/ocfs2/ocfs2.h
>> +++ linux-2.6.16-SLES10_SP2_BRANCH/fs/ocfs2/ocfs2.h
>> @@ -218,6 +218,7 @@ struct ocfs2_super
>>   	unsigned long s_mount_opt;
>>  	unsigned int s_atime_quantum;
>> +	unsigned int dlm_hash_buckets;
>>   	u16 max_slots;
>>  	s16 node_num;
>> Index: linux-2.6.16-SLES10_SP2_BRANCH/fs/ocfs2/super.c
>> ===================================================================
>> --- linux-2.6.16-SLES10_SP2_BRANCH.orig/fs/ocfs2/super.c
>> +++ linux-2.6.16-SLES10_SP2_BRANCH/fs/ocfs2/super.c
>> @@ -40,6 +40,7 @@
>>  #include <linux/crc32.h>
>>  #include <linux/debugfs.h>
>>  #include <linux/mount.h>
>> +#include <linux/mm.h>
>>   #include <cluster/nodemanager.h>
>>  @@ -88,6 +89,7 @@ struct mount_options
>>  	unsigned int	atime_quantum;
>>  	signed short	slot;
>>  	unsigned int	localalloc_opt;
>> +	unsigned int	dlm_hash_buckets;
>>  };
>>   static int ocfs2_parse_options(struct super_block *sb, char *options,
>> @@ -169,6 +171,7 @@ enum {
>>  	Opt_commit,
>>  	Opt_localalloc,
>>  	Opt_localflocks,
>> +	Opt_dlm_hash_buckets,
>>  #ifdef OCFS2_ORACORE_WORKAROUNDS
>>  	Opt_datavolume,
>>  #endif
>> @@ -190,6 +193,7 @@ static match_table_t tokens = {
>>  	{Opt_commit, "commit=%u"},
>>  	{Opt_localalloc, "localalloc=%d"},
>>  	{Opt_localflocks, "localflocks"},
>> +	{Opt_dlm_hash_buckets, "hash_buckets=%u"},
>>  #ifdef OCFS2_ORACORE_WORKAROUNDS
>>  	{Opt_datavolume, "datavolume"},
>>  #endif
>> @@ -633,6 +637,22 @@ static int ocfs2_fill_super(struct super
>>  	osb->preferred_slot = parsed_options.slot;
>>  	osb->osb_commit_interval = parsed_options.commit_interval;
>>  	osb->local_alloc_size = parsed_options.localalloc_opt;
>> +	if (parsed_options.dlm_hash_buckets)
>> +		osb->dlm_hash_buckets = parsed_options.dlm_hash_buckets;
>> +	else {
>> +		/* Let's count 4 clusters per file, 64 KB at least */
>> +		unsigned int exp_file_size_shift =
>> +				max(16, osb->s_clustersize_bits + 2);
>> +		struct sysinfo i;
>> +
>> +		si_meminfo(&i);
>> +		/* Estimate number of files on FS and limit space used by
>> +		 * hash table by 1/2048 of kernel memory */
>> +		osb->dlm_hash_buckets = min_t(unsigned long long,
>> +			sb->s_bdev->bd_inode->i_size >> exp_file_size_shift,
>> +			(i.totalram >> 11) * (PAGE_SIZE /
>> +					sizeof(struct hlist_head)));
>> +	}
>>   #ifdef OCFS2_ORACORE_WORKAROUNDS
>>  	if (osb->s_mount_opt & OCFS2_MOUNT_COMPAT_OCFS)
>> @@ -807,6 +827,7 @@ static int ocfs2_parse_options(struct su
>>  	mopt->atime_quantum = OCFS2_DEFAULT_ATIME_QUANTUM;
>>  	mopt->slot = OCFS2_INVALID_SLOT;
>>  	mopt->localalloc_opt = OCFS2_DEFAULT_LOCAL_ALLOC_SIZE;
>> +	mopt->dlm_hash_buckets = 0;
>>   	if (!options) {
>>  		status = 1;
>> @@ -919,6 +940,19 @@ static int ocfs2_parse_options(struct su
>>  			if (!is_remount)
>>  				mopt->mount_opt |= OCFS2_MOUNT_LOCALFLOCKS;
>>  			break;
>> +		case Opt_dlm_hash_buckets:
>> +			if (is_remount) {
>> +				mlog(ML_ERROR, "Changing number of hash buckets"
>> +					" during remount is not supported.\n");
>> +				status = 0;
>> +				goto bail;
>> +			}
>> +			if (match_int(&args[0], &option) || option <= 0) {
>> +				status = 0;
>> +				goto bail;
>> +			}
>> +			mopt->dlm_hash_buckets = option;
>> +			break;
>>  		default:
>>  			mlog(ML_ERROR,
>>  			     "Unrecognized mount option \"%s\" "
>>   
>
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Ocfs2-devel] Re: [PATCH] Dynamic lockres hash table
  2008-03-05 18:26   ` [Ocfs2-devel] " Jan Kara
@ 2008-03-05 10:42     ` Sunil Mushran
  2008-03-06 11:51       ` [Ocfs2-devel] " Jan Kara
  0 siblings, 1 reply; 9+ messages in thread
From: Sunil Mushran @ 2008-03-05 10:42 UTC (permalink / raw)
  To: ocfs2-devel

Jan Kara wrote:
>   Actually, it's non-trivial to measure (differently than by profiling).
> You cannot use standard time functions because they have too low resolution
> - we are speaking about microseconds here...
>   

jiffies should have enough resolution.

>   Definitely, they should have 16 or even 32 pages per mount.  With 500K
> lockres, which is not so extreme on bigger FS, 16 pages mean hash chains of
> average length 61 on x86_64. It is not ideal but I guess it should be
> sufficient.
>   

Any objections to increasing the default to 32 pages for 1.4?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Ocfs2-devel] [PATCH] Dynamic lockres hash table
  2008-03-05 10:42     ` [Ocfs2-devel] " Sunil Mushran
@ 2008-03-06 11:51       ` Jan Kara
  0 siblings, 0 replies; 9+ messages in thread
From: Jan Kara @ 2008-03-06 11:51 UTC (permalink / raw)
  To: ocfs2-devel

On Wed 05-03-08 10:41:34, Sunil Mushran wrote:
> Jan Kara wrote:
>>   Actually, it's non-trivial to measure (differently than by profiling).
>> You cannot use standard time functions because they have too low 
>> resolution
>> - we are speaking about microseconds here...
>>   
> jiffies should have enough resolution.
  Not really. On SLES kernels HZ=250 and thus 1 jiffie tick is 4ms. That's
quite a lot - actually, my lockres statistics show that quite a lot of lock
operations get below 4 ms (i.e., time for that acquisition was counted as
0). I guess they are operations that don't require any network
communication...

>>   Definitely, they should have 16 or even 32 pages per mount.  With 500K
>> lockres, which is not so extreme on bigger FS, 16 pages mean hash chains 
>> of
>> average length 61 on x86_64. It is not ideal but I guess it should be
>> sufficient.
>
> Any objections to increasing the default to 32 pages for 1.4?
  32 pages is fine with me.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2008-03-06 11:51 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20080304102448.GB24335@duck.suse.cz>
2008-03-04 18:33 ` [Ocfs2-devel] Re: [PATCH] Dynamic lockres hash table Sunil Mushran
2008-03-05 10:40   ` Mark Fasheh
2008-03-05 10:47     ` Sunil Mushran
2008-03-05 11:28     ` Joel Becker
2008-03-05 12:39       ` Mark Fasheh
2008-03-05 13:52         ` Joel Becker
2008-03-05 18:26   ` [Ocfs2-devel] " Jan Kara
2008-03-05 10:42     ` [Ocfs2-devel] " Sunil Mushran
2008-03-06 11:51       ` [Ocfs2-devel] " Jan Kara

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.