[Ocfs2-devel] [SUGGESSTION 1/1] OCFS2: automatic dlm hash table size

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Ocfs2-devel] [SUGGESSTION 1/1] OCFS2: automatic dlm hash table size
@ 2009-06-08  5:14 wengang wang
  2009-06-08  5:55 ` Tao Ma
  0 siblings, 1 reply; 7+ messages in thread
From: wengang wang @ 2009-06-08  5:14 UTC (permalink / raw)
  To: ocfs2-devel

backgroud:
	ocfs2 dlm uses a hash table to store dlm_lock_resource objects. the often used lookup is performed on the hash table.

problem:
	for usages that there are huge number of inodes(thus huge number of dlm_lock_resource objects) in a ocfs2 volume, the lookup performance becomes a problem. the lookup holds spin_lock which could put all others cpus into the state of aquring the spinlock. if the lock is held long enough by the lookup process, some hardware watchdog  could reboot box since it's not fed in a time(the fed has no change to be scheduled).

	enlarging the hash table is the way to speed up the lookup. but we don't know how large is a good size. --too small, performance is bad; too large, there is a memory waste.

suggestion:
	so I suggest a automatic resizing the dlm_lock_resource hash table feature. that means it can increase the size of the hash table per the number of dlm_lock_resource objects which are already in the hash table.
	the default(smallest) size is 16 in shift bits. when the number of dlm_lock_resource rearches 250,0000, auto-resizing is triggered and the destination size is 17. and when rearches 500,0000, resize to 18, for 1000,0000, resize to 19... though the numbers need to be discussed yet.
	with this we can use proper sized memory for runtime usage and keep good enough lookup performance.

if it's good, I'm glad to do it.

thanks,
wengang.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Ocfs2-devel] [SUGGESSTION 1/1] OCFS2: automatic dlm hash table size
  2009-06-08  5:14 [Ocfs2-devel] [SUGGESSTION 1/1] OCFS2: automatic dlm hash table size wengang wang
@ 2009-06-08  5:55 ` Tao Ma
  2009-06-08  6:24   ` Wengang Wang
  0 siblings, 1 reply; 7+ messages in thread
From: Tao Ma @ 2009-06-08  5:55 UTC (permalink / raw)
  To: ocfs2-devel

Hi Wengang,

Regards,
Tao

wengang wang wrote:
> backgroud:
> 	ocfs2 dlm uses a hash table to store dlm_lock_resource objects. the often used lookup is performed on the hash table.
> 
> problem:
> 	for usages that there are huge number of inodes(thus huge number of dlm_lock_resource objects) in a ocfs2 volume, the lookup performance becomes a problem. the lookup holds spin_lock which could put all others cpus into the state of aquring the spinlock. if the lock is held long enough by the lookup process, some hardware watchdog  could reboot box since it's not fed in a time(the fed has no change to be scheduled).		Why do you think a dlm res lookup can lock up cpu for such a long time 
that can lead to hardware watchdog reboot?
	I am not object to this. But do you have any test statistics that 
demonstrate your suggestion? I think people are more easy to be 
convinced if they see some exciting numbers.

> 
> 	enlarging the hash table is the way to speed up the lookup. but we don't know how large is a good size. --too small, performance is bad; too large, there is a memory waste.
> 
> suggestion:
> 	so I suggest a automatic resizing the dlm_lock_resource hash table feature. that means it can increase the size of the hash table per the number of dlm_lock_resource objects which are already in the hash table.
> 	the default(smallest) size is 16 in shift bits. when the number of dlm_lock_resource rearches 250,0000, auto-resizing is triggered and the destination size is 17. and when rearches 500,0000, resize to 18, for 1000,0000, resize to 19... though the numbers need to be discussed yet.
> 	with this we can use proper sized memory for runtime usage and keep good enough lookup performance.
So concerning the autosize, do you think of the process of rehash?

I think if you have reached 250,000 dlm entries, the rehash must hold 
the spin lock for quite a long time. And as you said above, if the 
hardware watchdog can even reboot for just one lock's lookup, it surely 
can't wait for your rehash.

Regards,
Tao

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Ocfs2-devel] [SUGGESSTION 1/1] OCFS2: automatic dlm hash table size
  2009-06-08  5:55 ` Tao Ma
@ 2009-06-08  6:24   ` Wengang Wang
  2009-06-08  6:40     ` Tao Ma
  0 siblings, 1 reply; 7+ messages in thread
From: Wengang Wang @ 2009-06-08  6:24 UTC (permalink / raw)
  To: ocfs2-devel

Hi Tao,

pls check inline.

Tao Ma wrote:
> Hi Wengang,
> 
> Regards,
> Tao
> 
> wengang wang wrote:
>> backgroud:
>>     ocfs2 dlm uses a hash table to store dlm_lock_resource objects. 
>> the often used lookup is performed on the hash table.
>>
>> problem:
>>     for usages that there are huge number of inodes(thus huge number 
>> of dlm_lock_resource objects) in a ocfs2 volume, the lookup 
>> performance becomes a problem. the lookup holds spin_lock which could 
>> put all others cpus into the state of aquring the spinlock. if the 
>> lock is held long enough by the lookup process, some hardware 
>> watchdog  could reboot box since it's not fed in a time(the fed has no 
>> change to be scheduled).        Why do you think a dlm res lookup can 
>> lock up cpu for such a long time 
> that can lead to hardware watchdog reboot?
>     I am not object to this. But do you have any test statistics that 
> demonstrate your suggestion? I think people are more easy to be 
> convinced if they see some exciting numbers.
> 

There is such a bug. there are more than 100,0000 inodes in a single 
ocfs2 volume. the system was suddenly rebooted. fortunately we got the 
vmcore, checking the processes currently running on all cpus that time,
they are either running in the hash lookup or trying to aquire the spin 
lock. Srini and I suspect it's rebooted by the hardware watchdog.

it is ocfs2 1.2 and the hash table is in size of 14 shift bits. I back 
ported the patches which enlarges hash table size to 17 and customer 
didn't get the same problem.

however, I can't say I have statistics for this.

>>
>>     enlarging the hash table is the way to speed up the lookup. but we 
>> don't know how large is a good size. --too small, performance is bad; 
>> too large, there is a memory waste.
>>
>> suggestion:
>>     so I suggest a automatic resizing the dlm_lock_resource hash table 
>> feature. that means it can increase the size of the hash table per the 
>> number of dlm_lock_resource objects which are already in the hash table.
>>     the default(smallest) size is 16 in shift bits. when the number of 
>> dlm_lock_resource rearches 250,0000, auto-resizing is triggered and 
>> the destination size is 17. and when rearches 500,0000, resize to 18, 
>> for 1000,0000, resize to 19... though the numbers need to be discussed 
>> yet.
>>     with this we can use proper sized memory for runtime usage and 
>> keep good enough lookup performance.
> So concerning the autosize, do you think of the process of rehash?
> 
> I think if you have reached 250,000 dlm entries, the rehash must hold 
> the spin lock for quite a long time. And as you said above, if the 
> hardware watchdog can even reboot for just one lock's lookup, it surely 
> can't wait for your rehash.
> 

Yes, I have a thought on it. maybe we can accomplish the rehash in 
several cycles, each cycle we takes the spinlock and between the cycles, 
we use cond_schedule() to release cpu when needed(how many dlm entries 
should be deal with in one cycle needs to be discussed). per this, 
during rehash progress, the lookup needs to be performed on 2 
hash_table, the old one and the new one(if not found in old one).

thanks,
wengang.

-- 
--just begin to learn, you are never too late...

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Ocfs2-devel] [SUGGESSTION 1/1] OCFS2: automatic dlm hash table size
  2009-06-08  6:24   ` Wengang Wang
@ 2009-06-08  6:40     ` Tao Ma
  2009-06-08  6:49       ` Wengang Wang
  0 siblings, 1 reply; 7+ messages in thread
From: Tao Ma @ 2009-06-08  6:40 UTC (permalink / raw)
  To: ocfs2-devel

hi wengang,

Wengang Wang wrote:
> Hi Tao,
> 
> pls check inline.
> 
> Tao Ma wrote:
>> Hi Wengang,
>>
>> Regards,
>> Tao
>>
>> wengang wang wrote:
>>> backgroud:
>>>     ocfs2 dlm uses a hash table to store dlm_lock_resource objects. 
>>> the often used lookup is performed on the hash table.
>>>
>>> problem:
>>>     for usages that there are huge number of inodes(thus huge number 
>>> of dlm_lock_resource objects) in a ocfs2 volume, the lookup 
>>> performance becomes a problem. the lookup holds spin_lock which could 
>>> put all others cpus into the state of aquring the spinlock. if the 
>>> lock is held long enough by the lookup process, some hardware 
>>> watchdog  could reboot box since it's not fed in a time(the fed has 
>>> no change to be scheduled).        Why do you think a dlm res lookup 
>>> can lock up cpu for such a long time 
>> that can lead to hardware watchdog reboot?
>>     I am not object to this. But do you have any test statistics that 
>> demonstrate your suggestion? I think people are more easy to be 
>> convinced if they see some exciting numbers.
>>
> 
> There is such a bug. there are more than 100,0000 inodes in a single 
> ocfs2 volume. the system was suddenly rebooted. fortunately we got the 
> vmcore, checking the processes currently running on all cpus that time,
> they are either running in the hash lookup or trying to aquire the spin 
> lock. Srini and I suspect it's rebooted by the hardware watchdog.
> 
> it is ocfs2 1.2 and the hash table is in size of 14 shift bits. I back 
> ported the patches which enlarges hash table size to 17 and customer 
> didn't get the same problem.
> 
> however, I can't say I have statistics for this.
got it. But I just checked 1.2, it use PAGE_SIZE, so it should be 12?
And the mainline kernel use 14. So are you writing some typo?
> 
>>>
>>>     enlarging the hash table is the way to speed up the lookup. but 
>>> we don't know how large is a good size. --too small, performance is 
>>> bad; too large, there is a memory waste.
>>>
>>> suggestion:
>>>     so I suggest a automatic resizing the dlm_lock_resource hash 
>>> table feature. that means it can increase the size of the hash table 
>>> per the number of dlm_lock_resource objects which are already in the 
>>> hash table.
>>>     the default(smallest) size is 16 in shift bits. when the number 
>>> of dlm_lock_resource rearches 250,0000, auto-resizing is triggered 
>>> and the destination size is 17. and when rearches 500,0000, resize to 
>>> 18, for 1000,0000, resize to 19... though the numbers need to be 
>>> discussed yet.
>>>     with this we can use proper sized memory for runtime usage and 
>>> keep good enough lookup performance.
>> So concerning the autosize, do you think of the process of rehash?
>>
>> I think if you have reached 250,000 dlm entries, the rehash must hold 
>> the spin lock for quite a long time. And as you said above, if the 
>> hardware watchdog can even reboot for just one lock's lookup, it 
>> surely can't wait for your rehash.
>>
> 
> Yes, I have a thought on it. maybe we can accomplish the rehash in 
> several cycles, each cycle we takes the spinlock and between the cycles, 
> we use cond_schedule() to release cpu when needed(how many dlm entries 
> should be deal with in one cycle needs to be discussed). per this, 
> during rehash progress, the lookup needs to be performed on 2 
> hash_table, the old one and the new one(if not found in old one).
It is a bit complicated from your description. So why not just increase 
it as what you did for the bug above? It is easier and straightforward. 
What's more, even with 18, there are only 256K, as we now have such a 
large memory, 256K is almost nothing. ;)

Regards,
Tao

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Ocfs2-devel] [SUGGESSTION 1/1] OCFS2: automatic dlm hash table size
  2009-06-08  6:40     ` Tao Ma
@ 2009-06-08  6:49       ` Wengang Wang
  2009-06-08 19:07         ` Sunil Mushran
  0 siblings, 1 reply; 7+ messages in thread
From: Wengang Wang @ 2009-06-08  6:49 UTC (permalink / raw)
  To: ocfs2-devel

Hi Tao,

Tao Ma wrote:
> hi wengang,
> 
> Wengang Wang wrote:
>> Hi Tao,
>>
>> pls check inline.
>>
>> Tao Ma wrote:
>>> Hi Wengang,
>>>
>>> Regards,
>>> Tao
>>>
>>> wengang wang wrote:
>>>> backgroud:
>>>>     ocfs2 dlm uses a hash table to store dlm_lock_resource objects. 
>>>> the often used lookup is performed on the hash table.
>>>>
>>>> problem:
>>>>     for usages that there are huge number of inodes(thus huge number 
>>>> of dlm_lock_resource objects) in a ocfs2 volume, the lookup 
>>>> performance becomes a problem. the lookup holds spin_lock which 
>>>> could put all others cpus into the state of aquring the spinlock. if 
>>>> the lock is held long enough by the lookup process, some hardware 
>>>> watchdog  could reboot box since it's not fed in a time(the fed has 
>>>> no change to be scheduled).        Why do you think a dlm res lookup 
>>>> can lock up cpu for such a long time 
>>> that can lead to hardware watchdog reboot?
>>>     I am not object to this. But do you have any test statistics that 
>>> demonstrate your suggestion? I think people are more easy to be 
>>> convinced if they see some exciting numbers.
>>>
>>
>> There is such a bug. there are more than 100,0000 inodes in a single 
>> ocfs2 volume. the system was suddenly rebooted. fortunately we got the 
>> vmcore, checking the processes currently running on all cpus that time,
>> they are either running in the hash lookup or trying to aquire the 
>> spin lock. Srini and I suspect it's rebooted by the hardware watchdog.
>>
>> it is ocfs2 1.2 and the hash table is in size of 14 shift bits. I back 
>> ported the patches which enlarges hash table size to 17 and customer 
>> didn't get the same problem.
>>
>> however, I can't say I have statistics for this.
> got it. But I just checked 1.2, it use PAGE_SIZE, so it should be 12?
> And the mainline kernel use 14. So are you writing some typo?
>>

yes, should be 12, one page for x86.

>>>>
>>>>     enlarging the hash table is the way to speed up the lookup. but 
>>>> we don't know how large is a good size. --too small, performance is 
>>>> bad; too large, there is a memory waste.
>>>>
>>>> suggestion:
>>>>     so I suggest a automatic resizing the dlm_lock_resource hash 
>>>> table feature. that means it can increase the size of the hash table 
>>>> per the number of dlm_lock_resource objects which are already in the 
>>>> hash table.
>>>>     the default(smallest) size is 16 in shift bits. when the number 
>>>> of dlm_lock_resource rearches 250,0000, auto-resizing is triggered 
>>>> and the destination size is 17. and when rearches 500,0000, resize 
>>>> to 18, for 1000,0000, resize to 19... though the numbers need to be 
>>>> discussed yet.
>>>>     with this we can use proper sized memory for runtime usage and 
>>>> keep good enough lookup performance.
>>> So concerning the autosize, do you think of the process of rehash?
>>>
>>> I think if you have reached 250,000 dlm entries, the rehash must hold 
>>> the spin lock for quite a long time. And as you said above, if the 
>>> hardware watchdog can even reboot for just one lock's lookup, it 
>>> surely can't wait for your rehash.
>>>
>>
>> Yes, I have a thought on it. maybe we can accomplish the rehash in 
>> several cycles, each cycle we takes the spinlock and between the 
>> cycles, we use cond_schedule() to release cpu when needed(how many dlm 
>> entries should be deal with in one cycle needs to be discussed). per 
>> this, during rehash progress, the lookup needs to be performed on 2 
>> hash_table, the old one and the new one(if not found in old one).
> It is a bit complicated from your description. So why not just increase 
> it as what you did for the bug above? It is easier and straightforward. 
> What's more, even with 18, there are only 256K, as we now have such a 
> large memory, 256K is almost nothing. ;)

just increasing it works. I'm concerning memory waste for few inodes 
usage case. I don't know how large it is going to be in the future..
even now, I just don't hope a memory waste though it's small though 
memory is cheap now... :)

-- 
--just begin to learn, you are never too late...

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Ocfs2-devel] [SUGGESSTION 1/1] OCFS2: automatic dlm hash table size
  2009-06-08  6:49       ` Wengang Wang
@ 2009-06-08 19:07         ` Sunil Mushran
  2009-06-09  4:20           ` Wengang Wang
  0 siblings, 1 reply; 7+ messages in thread
From: Sunil Mushran @ 2009-06-08 19:07 UTC (permalink / raw)
  To: ocfs2-devel

Wengang Wang wrote:
> just increasing it works. I'm concerning memory waste for few inodes
> usage case. I don't know how large it is going to be in the future..
> even now, I just don't hope a memory waste though it's small though
> memory is cheap now... :)

So, we did discuss dynamic resizing of the lockres hash over a
year ago. At that time our hash was very small. 1 page in 1.2,
and 4 pages in 1.4-beta/mainline. At that time, we decided to
bump up the default in 1.4 to 64 pages.

Resizing requires a feedback loop. As in... lookup is taking too
much time. I am working on adding instrumentation that provides this
info. (The number of lockres' is too crude a stat.)

Once we have that, I would prefer we make the lock per chain instead
of a global. That will allow us to get more bang for the buck. Will
allow us to reduce the hashtable from 64 pages.

In the end, I am not yet sold on dynamic resizing. One data point is
that inode/dcache hashes are not dynamically resized.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Ocfs2-devel] [SUGGESSTION 1/1] OCFS2: automatic dlm hash table size
  2009-06-08 19:07         ` Sunil Mushran
@ 2009-06-09  4:20           ` Wengang Wang
  0 siblings, 0 replies; 7+ messages in thread
From: Wengang Wang @ 2009-06-09  4:20 UTC (permalink / raw)
  To: ocfs2-devel

Sunil,

Sunil Mushran wrote:
> Wengang Wang wrote:
>> just increasing it works. I'm concerning memory waste for few inodes
>> usage case. I don't know how large it is going to be in the future..
>> even now, I just don't hope a memory waste though it's small though
>> memory is cheap now... :)
> 
> So, we did discuss dynamic resizing of the lockres hash over a
> year ago. At that time our hash was very small. 1 page in 1.2,
> and 4 pages in 1.4-beta/mainline. At that time, we decided to
> bump up the default in 1.4 to 64 pages.
> 

sorry I missed that.

> Resizing requires a feedback loop. As in... lookup is taking too
> much time. I am working on adding instrumentation that provides this
> info. (The number of lockres' is too crude a stat.)
> 

I don't understand why the lookup is taking too much time.
As my idea of resizing, it can follow the steps.

1) a insertion comes, it inserts the lockres to "current" hash table.
after insertion, it checks the number of lockres' in the table. if 
resizing is needs, it kicks a ASYNC resizing.
2) the ASYNC resizing can be dealled with in different processes(a 
kernel thread, exp. in dlm_thread).
the resizing process, in turn, does
2.1) allocates pages without the spinlock.
2.2) does the actual removing work:
2.2.1) takes spinlock;
2.2.2) moves a fixed number of lockres' to the new table from the 
"current" hash table;
2.2.3) if no lockres' left in the "current" table, let "current" point 
to the new table.
2.2.4) release spinlock;
2.2.5) free pages for the "current" table before step 2.2.3.
2.2.6) release cpu if needed.
3) a lookup comes. after taking the spinlock, it looks at the "current" 
hash table, if not found, it looks at the new hash table. then it 
release the spinlock.
I can't see where the lookup can take too much time than it does before 
resizing. or I missed something?

I didn't cover all detail about the resizing such as flags marking 
resizing in progress; the new table available to use; recalculate hash 
value for new hash table and so on.

> Once we have that, I would prefer we make the lock per chain instead
> of a global. That will allow us to get more bang for the buck. Will
> allow us to reduce the hashtable from 64 pages.
> 

that is a smart way to go:).
however, I think we shouldn't add too many stuff to lockres structure. 
if we do, the memory used for the new added stuff will be much more than 
the memory used for enlarging the hash table.

> In the end, I am not yet sold on dynamic resizing. One data point is
> that inode/dcache hashes are not dynamically resized.

:), but discussing is interesting!

thanks,
wengang.
-- 
--just begin to learn, you are never too late...

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2009-06-09  4:20 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-06-08  5:14 [Ocfs2-devel] [SUGGESSTION 1/1] OCFS2: automatic dlm hash table size wengang wang
2009-06-08  5:55 ` Tao Ma
2009-06-08  6:24   ` Wengang Wang
2009-06-08  6:40     ` Tao Ma
2009-06-08  6:49       ` Wengang Wang
2009-06-08 19:07         ` Sunil Mushran
2009-06-09  4:20           ` Wengang Wang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.