* [Ocfs2-devel] [PATCH 0/3] Add inode stealing for ocfs2.V1
@ 2008-02-22 0:42 Tao Ma
2008-02-22 0:48 ` [Ocfs2-devel] [PATCH 1/3] Add a new parameter for ocfs2_reserve_suballoc_bits.V1 Tao Ma
` (4 more replies)
0 siblings, 5 replies; 11+ messages in thread
From: Tao Ma @ 2008-02-22 0:42 UTC (permalink / raw)
To: ocfs2-devel
Hi all,
This patch set improve the method for inode allocation. Now they
are divided into 3 small patches, but I think maybe they can be merged
together as one. Any comments are welcomed.
In OCFS2, we allocate the inodes from slot specific inode_alloc to avoid
inode creation congestion. The local alloc file grows in a large contiguous
chunk. As for a 4K bs, it grows 4M every time. So 1024 inodes will be
allocated at a time.
Over time, if the fs gets fragmented enough(e.g, the user has created many
small files and also delete some of them), we can end up in a situation,
whereby we cannot extend the inode_alloc as we don't have a large chunk
free in the global_bitmap even if df shows few gigs free. More annoying is
that this situation will invariably mean that while one cannot create inodes
on one node but can from another node. Still more annoying is that an unused
slot may have space for plenty of inodes but is unusable as the user may not
be mounting as many nodes anymore.
This patch series implement a solution which is to steal inodes from another
slot. Now the whole inode allocation process looks like this:
1. Allocate from its own inode_alloc:000X
1) If we can reserve, OK.
2) If fails, try to allocate a large chunk and reserve once again.
2. If 1 fails, try to allocate from the last node's inode_alloc. This time,
Just try to reserve, we don't go for global_bitmap if this inode also
can't allocate the inode.
3. If 2 fails, try the node before it until we reach inode_alloc:0000.
In the process, we will skip its own inode_alloc.
4. If 3 fails, try to allocate from its own inode_alloc:000X once again. Here
is a chance that the global_bitmap may has a large enough chunk now during
the inode iteration process.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Ocfs2-devel] [PATCH 1/3] Add a new parameter for ocfs2_reserve_suballoc_bits.V1
2008-02-22 0:42 [Ocfs2-devel] [PATCH 0/3] Add inode stealing for ocfs2.V1 Tao Ma
@ 2008-02-22 0:48 ` Tao Ma
2008-02-22 0:49 ` [Ocfs2-devel] [PATCH 2/3] Add ac_alloc_slot in ocfs2_alloc_context.V1 Tao Ma
` (3 subsequent siblings)
4 siblings, 0 replies; 11+ messages in thread
From: Tao Ma @ 2008-02-22 0:48 UTC (permalink / raw)
To: ocfs2-devel
In some cases(Inode stealing from other nodes), we may not want
ocfs2_reserve_suballoc_bits to allocate new groups from the
global_bitmap since it may already be full. So add a new parameter
for this.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
---
fs/ocfs2/suballoc.c | 22 ++++++++++++++++++----
1 files changed, 18 insertions(+), 4 deletions(-)
diff --git a/fs/ocfs2/suballoc.c b/fs/ocfs2/suballoc.c
index 72c198a..3be4e73 100644
--- a/fs/ocfs2/suballoc.c
+++ b/fs/ocfs2/suballoc.c
@@ -46,6 +46,9 @@
#include "buffer_head_io.h"
+#define NOT_ALLOC_NEW_GROUP 0
+#define ALLOC_NEW_GROUP 1
+
static inline void ocfs2_debug_bg(struct ocfs2_group_desc *bg);
static inline void ocfs2_debug_suballoc_inode(struct ocfs2_dinode *fe);
static inline u16 ocfs2_find_victim_chain(struct ocfs2_chain_list *cl);
@@ -391,7 +394,8 @@ bail:
static int ocfs2_reserve_suballoc_bits(struct ocfs2_super *osb,
struct ocfs2_alloc_context *ac,
int type,
- u32 slot)
+ u32 slot,
+ int alloc_new_group)
{
int status;
u32 bits_wanted = ac->ac_bits_wanted;
@@ -446,6 +450,14 @@ static int ocfs2_reserve_suballoc_bits(struct ocfs2_super *osb,
goto bail;
}
+ if (alloc_new_group != ALLOC_NEW_GROUP) {
+ mlog(0, "Alloc File %u Full: wanted=%u, free_bits=%u, "
+ "and we don't alloc a new group for it.\n",
+ slot, bits_wanted, free_bits);
+ status = -ENOSPC;
+ goto bail;
+ }
+
status = ocfs2_block_group_alloc(osb, alloc_inode, bh);
if (status < 0) {
if (status != -ENOSPC)
@@ -490,7 +502,8 @@ int ocfs2_reserve_new_metadata(struct ocfs2_super *osb,
(*ac)->ac_group_search = ocfs2_block_group_search;
status = ocfs2_reserve_suballoc_bits(osb, (*ac),
- EXTENT_ALLOC_SYSTEM_INODE, slot);
+ EXTENT_ALLOC_SYSTEM_INODE,
+ slot, ALLOC_NEW_GROUP);
if (status < 0) {
if (status != -ENOSPC)
mlog_errno(status);
@@ -527,7 +540,7 @@ int ocfs2_reserve_new_inode(struct ocfs2_super *osb,
status = ocfs2_reserve_suballoc_bits(osb, *ac,
INODE_ALLOC_SYSTEM_INODE,
- osb->slot_num);
+ osb->slot_num, ALLOC_NEW_GROUP);
if (status < 0) {
if (status != -ENOSPC)
mlog_errno(status);
@@ -557,7 +570,8 @@ int ocfs2_reserve_cluster_bitmap_bits(struct ocfs2_super *osb,
status = ocfs2_reserve_suballoc_bits(osb, ac,
GLOBAL_BITMAP_SYSTEM_INODE,
- OCFS2_INVALID_SLOT);
+ OCFS2_INVALID_SLOT,
+ ALLOC_NEW_GROUP);
if (status < 0 && status != -ENOSPC) {
mlog_errno(status);
goto bail;
--
1.5.3.GIT
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [Ocfs2-devel] [PATCH 2/3] Add ac_alloc_slot in ocfs2_alloc_context.V1
2008-02-22 0:42 [Ocfs2-devel] [PATCH 0/3] Add inode stealing for ocfs2.V1 Tao Ma
2008-02-22 0:48 ` [Ocfs2-devel] [PATCH 1/3] Add a new parameter for ocfs2_reserve_suballoc_bits.V1 Tao Ma
@ 2008-02-22 0:49 ` Tao Ma
2008-02-22 0:49 ` [Ocfs2-devel] [PATCH 3/3] Add inode stealing for ocfs2_reserve_new_inode.V1 Tao Ma
` (2 subsequent siblings)
4 siblings, 0 replies; 11+ messages in thread
From: Tao Ma @ 2008-02-22 0:49 UTC (permalink / raw)
To: ocfs2-devel
In inode stealing, we no longer restrict the allocation to
happen in the local node. So it is neccessary for us to add
a new member in ocfs2_alloc_context to indicate which slot
we are using for allocation. We also modify the process of
local alloc so that this member can be used there also.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
---
fs/ocfs2/localalloc.c | 1 +
fs/ocfs2/suballoc.c | 1 +
fs/ocfs2/suballoc.h | 1 +
3 files changed, 3 insertions(+), 0 deletions(-)
diff --git a/fs/ocfs2/localalloc.c b/fs/ocfs2/localalloc.c
index add1ffd..250b4bc 100644
--- a/fs/ocfs2/localalloc.c
+++ b/fs/ocfs2/localalloc.c
@@ -526,6 +526,7 @@ int ocfs2_reserve_local_alloc_bits(struct ocfs2_super *osb,
}
ac->ac_inode = local_alloc_inode;
+ ac->ac_alloc_slot = osb->slot_num;
ac->ac_which = OCFS2_AC_USE_LOCAL;
get_bh(osb->local_alloc_bh);
ac->ac_bh = osb->local_alloc_bh;
diff --git a/fs/ocfs2/suballoc.c b/fs/ocfs2/suballoc.c
index 3be4e73..33d5573 100644
--- a/fs/ocfs2/suballoc.c
+++ b/fs/ocfs2/suballoc.c
@@ -424,6 +424,7 @@ static int ocfs2_reserve_suballoc_bits(struct ocfs2_super *osb,
}
ac->ac_inode = alloc_inode;
+ ac->ac_alloc_slot = slot;
fe = (struct ocfs2_dinode *) bh->b_data;
if (!OCFS2_IS_VALID_DINODE(fe)) {
diff --git a/fs/ocfs2/suballoc.h b/fs/ocfs2/suballoc.h
index 8799033..544c600 100644
--- a/fs/ocfs2/suballoc.h
+++ b/fs/ocfs2/suballoc.h
@@ -36,6 +36,7 @@ typedef int (group_search_t)(struct inode *,
struct ocfs2_alloc_context {
struct inode *ac_inode; /* which bitmap are we allocating from? */
struct buffer_head *ac_bh; /* file entry bh */
+ u32 ac_alloc_slot; /* which slot are we allocating from? */
u32 ac_bits_wanted;
u32 ac_bits_given;
#define OCFS2_AC_USE_LOCAL 1
--
1.5.3.GIT
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [Ocfs2-devel] [PATCH 3/3] Add inode stealing for ocfs2_reserve_new_inode.V1
2008-02-22 0:42 [Ocfs2-devel] [PATCH 0/3] Add inode stealing for ocfs2.V1 Tao Ma
2008-02-22 0:48 ` [Ocfs2-devel] [PATCH 1/3] Add a new parameter for ocfs2_reserve_suballoc_bits.V1 Tao Ma
2008-02-22 0:49 ` [Ocfs2-devel] [PATCH 2/3] Add ac_alloc_slot in ocfs2_alloc_context.V1 Tao Ma
@ 2008-02-22 0:49 ` Tao Ma
2008-02-22 0:57 ` [Ocfs2-devel] [PATCH 0/3] Add inode stealing for ocfs2.V1 wengang wang
2008-02-22 15:09 ` Mark Fasheh
4 siblings, 0 replies; 11+ messages in thread
From: Tao Ma @ 2008-02-22 0:49 UTC (permalink / raw)
To: ocfs2-devel
Add inode stealing for ocfs2_reserve_new_inode. Now the whole process is:
1. Allocate from its own inode_alloc:000X
1) If we can reserve, OK.
2) If fails, try to allocate a large chunk and reserve once again.
2. If 1 fails, try to allocate from the last node's inode_alloc. This time,
Just try to reserve, we don't go for global_bitmap if this inode also
can't allocate the inode.
3. If 2 fails, try the node before it until we reach inode_alloc:0000.
In the process, we will skip its own inode_alloc.
4. If 3 fails, try to allocate from its own inode_alloc:000X once again. Here
is a chance that the global_bitmap may has a large enough chunk now during
the inode iteration process.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
---
fs/ocfs2/namei.c | 2 +-
fs/ocfs2/suballoc.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 56 insertions(+), 3 deletions(-)
diff --git a/fs/ocfs2/namei.c b/fs/ocfs2/namei.c
index ae9ad95..ab5a227 100644
--- a/fs/ocfs2/namei.c
+++ b/fs/ocfs2/namei.c
@@ -424,7 +424,7 @@ static int ocfs2_mknod_locked(struct ocfs2_super *osb,
fe->i_fs_generation = cpu_to_le32(osb->fs_generation);
fe->i_blkno = cpu_to_le64(fe_blkno);
fe->i_suballoc_bit = cpu_to_le16(suballoc_bit);
- fe->i_suballoc_slot = cpu_to_le16(osb->slot_num);
+ fe->i_suballoc_slot = cpu_to_le16(inode_ac->ac_alloc_slot);
fe->i_uid = cpu_to_le32(current->fsuid);
if (dir->i_mode & S_ISGID) {
fe->i_gid = cpu_to_le32(dir->i_gid);
diff --git a/fs/ocfs2/suballoc.c b/fs/ocfs2/suballoc.c
index 33d5573..cf89ce3 100644
--- a/fs/ocfs2/suballoc.c
+++ b/fs/ocfs2/suballoc.c
@@ -109,7 +109,7 @@ static inline void ocfs2_block_to_cluster_group(struct inode *inode,
u64 *bg_blkno,
u16 *bg_bit_off);
-void ocfs2_free_alloc_context(struct ocfs2_alloc_context *ac)
+static void ocfs2_free_ac_resource(struct ocfs2_alloc_context *ac)
{
struct inode *inode = ac->ac_inode;
@@ -120,9 +120,17 @@ void ocfs2_free_alloc_context(struct ocfs2_alloc_context *ac)
mutex_unlock(&inode->i_mutex);
iput(inode);
+ ac->ac_inode = NULL;
}
- if (ac->ac_bh)
+ if (ac->ac_bh) {
brelse(ac->ac_bh);
+ ac->ac_bh = NULL;
+ }
+}
+
+void ocfs2_free_alloc_context(struct ocfs2_alloc_context *ac)
+{
+ ocfs2_free_ac_resource(ac);
kfree(ac);
}
@@ -522,6 +530,28 @@ bail:
return status;
}
+static int ocfs2_steal_inode_from_other_nodes(struct ocfs2_super *osb,
+ struct ocfs2_alloc_context *ac,
+ int slot)
+{
+ int status = -ENOSPC, i;
+
+ for (i = osb->max_slots - 1; i >= 0; i--) {
+ if (i == slot)
+ continue;
+
+ status = ocfs2_reserve_suballoc_bits(osb, ac,
+ INODE_ALLOC_SYSTEM_INODE,
+ i, NOT_ALLOC_NEW_GROUP);
+ if (status >= 0)
+ break;
+
+ ocfs2_free_ac_resource(ac);
+ }
+
+ return status;
+}
+
int ocfs2_reserve_new_inode(struct ocfs2_super *osb,
struct ocfs2_alloc_context **ac)
{
@@ -542,6 +572,29 @@ int ocfs2_reserve_new_inode(struct ocfs2_super *osb,
status = ocfs2_reserve_suballoc_bits(osb, *ac,
INODE_ALLOC_SYSTEM_INODE,
osb->slot_num, ALLOC_NEW_GROUP);
+ if (status >= 0) {
+ status = 0;
+ goto bail;
+ } else if (status < 0 && status != -ENOSPC) {
+ mlog_errno(status);
+ goto bail;
+ }
+
+ ocfs2_free_ac_resource(*ac);
+
+ status = ocfs2_steal_inode_from_other_nodes(osb, *ac, osb->slot_num);
+ if (status >= 0) {
+ status = 0;
+ goto bail;
+ }
+
+ /*
+ * We can't steal inode from other nodes, so try to allocate it from
+ * our own once again.
+ */
+ status = ocfs2_reserve_suballoc_bits(osb, *ac,
+ INODE_ALLOC_SYSTEM_INODE,
+ osb->slot_num, ALLOC_NEW_GROUP);
if (status < 0) {
if (status != -ENOSPC)
mlog_errno(status);
--
1.5.3.GIT
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [Ocfs2-devel] [PATCH 0/3] Add inode stealing for ocfs2.V1
2008-02-22 0:42 [Ocfs2-devel] [PATCH 0/3] Add inode stealing for ocfs2.V1 Tao Ma
` (2 preceding siblings ...)
2008-02-22 0:49 ` [Ocfs2-devel] [PATCH 3/3] Add inode stealing for ocfs2_reserve_new_inode.V1 Tao Ma
@ 2008-02-22 0:57 ` wengang wang
2008-02-22 1:03 ` tao.ma
2008-02-22 10:30 ` Sunil Mushran
2008-02-22 15:09 ` Mark Fasheh
4 siblings, 2 replies; 11+ messages in thread
From: wengang wang @ 2008-02-22 0:57 UTC (permalink / raw)
To: ocfs2-devel
not know it clearly, but I remember when extending a file, meta is
allocated in extent_alloc instead of inode_alloc if necessary(correct me
if i am wrong).
if so, do we need to take extent_alloc into consideration as well?
thanks,
wengang.
Tao Ma wrote:
> Hi all,
> This patch set improve the method for inode allocation. Now they
> are divided into 3 small patches, but I think maybe they can be merged
> together as one. Any comments are welcomed.
>
> In OCFS2, we allocate the inodes from slot specific inode_alloc to avoid
> inode creation congestion. The local alloc file grows in a large contiguous
> chunk. As for a 4K bs, it grows 4M every time. So 1024 inodes will be
> allocated at a time.
>
> Over time, if the fs gets fragmented enough(e.g, the user has created many
> small files and also delete some of them), we can end up in a situation,
> whereby we cannot extend the inode_alloc as we don't have a large chunk
> free in the global_bitmap even if df shows few gigs free. More annoying is
> that this situation will invariably mean that while one cannot create inodes
> on one node but can from another node. Still more annoying is that an unused
> slot may have space for plenty of inodes but is unusable as the user may not
> be mounting as many nodes anymore.
>
> This patch series implement a solution which is to steal inodes from another
> slot. Now the whole inode allocation process looks like this:
> 1. Allocate from its own inode_alloc:000X
> 1) If we can reserve, OK.
> 2) If fails, try to allocate a large chunk and reserve once again.
> 2. If 1 fails, try to allocate from the last node's inode_alloc. This time,
> Just try to reserve, we don't go for global_bitmap if this inode also
> can't allocate the inode.
> 3. If 2 fails, try the node before it until we reach inode_alloc:0000.
> In the process, we will skip its own inode_alloc.
> 4. If 3 fails, try to allocate from its own inode_alloc:000X once again. Here
> is a chance that the global_bitmap may has a large enough chunk now during
> the inode iteration process.
>
> _______________________________________________
> Ocfs2-devel mailing list
> Ocfs2-devel@oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-devel
>
--
Wengang Wang
Member of Technical Staff
Oracle Asia R&D Center
Open Source Technologies Development
Tel: +86 10 8278 6265
Mobile: +86 13381078925
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Ocfs2-devel] [PATCH 0/3] Add inode stealing for ocfs2.V1
2008-02-22 0:57 ` [Ocfs2-devel] [PATCH 0/3] Add inode stealing for ocfs2.V1 wengang wang
@ 2008-02-22 1:03 ` tao.ma
2008-02-22 1:17 ` wengang wang
2008-02-22 10:30 ` Sunil Mushran
1 sibling, 1 reply; 11+ messages in thread
From: tao.ma @ 2008-02-22 1:03 UTC (permalink / raw)
To: ocfs2-devel
wengang wang wrote:
> not know it clearly, but I remember when extending a file, meta is
> allocated in extent_alloc instead of inode_alloc if necessary(correct me
> if i am wrong).
yes, extending a file using extent_alloc. But I'm stealing inode not
extent block. So don't need to touch extent_alloc.
> if so, do we need to take extent_alloc into consideration as well?
This solution is just for inodes, not extents.
>
> thanks,
> wengang.
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Ocfs2-devel] [PATCH 0/3] Add inode stealing for ocfs2.V1
2008-02-22 1:03 ` tao.ma
@ 2008-02-22 1:17 ` wengang wang
2008-02-22 1:26 ` tao.ma
0 siblings, 1 reply; 11+ messages in thread
From: wengang wang @ 2008-02-22 1:17 UTC (permalink / raw)
To: ocfs2-devel
tao.ma wrote:
>
>
> wengang wang wrote:
>> not know it clearly, but I remember when extending a file, meta is
>> allocated in extent_alloc instead of inode_alloc if necessary(correct
>> me if i am wrong).
> yes, extending a file using extent_alloc. But I'm stealing inode not
> extent block. So don't need to touch extent_alloc.
>
extent block needs not being stolen? ;)
>> if so, do we need to take extent_alloc into consideration as well?
> This solution is just for inodes, not extents.
>>
>> thanks,
>> wengang.
>>
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Ocfs2-devel] [PATCH 0/3] Add inode stealing for ocfs2.V1
2008-02-22 1:17 ` wengang wang
@ 2008-02-22 1:26 ` tao.ma
0 siblings, 0 replies; 11+ messages in thread
From: tao.ma @ 2008-02-22 1:26 UTC (permalink / raw)
To: ocfs2-devel
wengang wang wrote:
> tao.ma wrote:
>>
>>
>> wengang wang wrote:
>>> not know it clearly, but I remember when extending a file, meta is
>>> allocated in extent_alloc instead of inode_alloc if necessary(correct
>>> me if i am wrong).
>> yes, extending a file using extent_alloc. But I'm stealing inode not
>> extent block. So don't need to touch extent_alloc.
>>
> extent block needs not being stolen? ;)
This patch series is only for inode stealing(See the patch title), so no
extent block or extent alloc is touched.
>>> if so, do we need to take extent_alloc into consideration as well?
>> This solution is just for inodes, not extents.
>>>
>>> thanks,
>>> wengang.
>>>
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Ocfs2-devel] [PATCH 0/3] Add inode stealing for ocfs2.V1
2008-02-22 0:57 ` [Ocfs2-devel] [PATCH 0/3] Add inode stealing for ocfs2.V1 wengang wang
2008-02-22 1:03 ` tao.ma
@ 2008-02-22 10:30 ` Sunil Mushran
1 sibling, 0 replies; 11+ messages in thread
From: Sunil Mushran @ 2008-02-22 10:30 UTC (permalink / raw)
To: ocfs2-devel
True... however, in 1.2 (or anything before 2.6.23) only extent_alloc:0000
is used by all nodes. This was done to avoid deadlocks during truncate.
2.6.23 or 24 onwards Mark added code to allow use of all extent_alloc after
adding code to prevent deadlocks during truncate.
In general, allocations from extent_alloc are not that common as we have
fairly flat trees. If this does become an issue, we will handle it
similarly.
wengang wang wrote:
> not know it clearly, but I remember when extending a file, meta is
> allocated in extent_alloc instead of inode_alloc if necessary(correct
> me if i am wrong).
> if so, do we need to take extent_alloc into consideration as well?
>
> thanks,
> wengang.
>
> Tao Ma wrote:
>> Hi all,
>> This patch set improve the method for inode allocation. Now they
>> are divided into 3 small patches, but I think maybe they can be merged
>> together as one. Any comments are welcomed.
>>
>> In OCFS2, we allocate the inodes from slot specific inode_alloc to avoid
>> inode creation congestion. The local alloc file grows in a large
>> contiguous
>> chunk. As for a 4K bs, it grows 4M every time. So 1024 inodes will be
>> allocated at a time.
>>
>> Over time, if the fs gets fragmented enough(e.g, the user has created
>> many
>> small files and also delete some of them), we can end up in a situation,
>> whereby we cannot extend the inode_alloc as we don't have a large chunk
>> free in the global_bitmap even if df shows few gigs free. More
>> annoying is
>> that this situation will invariably mean that while one cannot create
>> inodes
>> on one node but can from another node. Still more annoying is that an
>> unused
>> slot may have space for plenty of inodes but is unusable as the user
>> may not
>> be mounting as many nodes anymore.
>>
>> This patch series implement a solution which is to steal inodes from
>> another
>> slot. Now the whole inode allocation process looks like this:
>> 1. Allocate from its own inode_alloc:000X
>> 1) If we can reserve, OK.
>> 2) If fails, try to allocate a large chunk and reserve once again.
>> 2. If 1 fails, try to allocate from the last node's inode_alloc. This
>> time,
>> Just try to reserve, we don't go for global_bitmap if this inode also
>> can't allocate the inode.
>> 3. If 2 fails, try the node before it until we reach inode_alloc:0000.
>> In the process, we will skip its own inode_alloc.
>> 4. If 3 fails, try to allocate from its own inode_alloc:000X once
>> again. Here
>> is a chance that the global_bitmap may has a large enough chunk
>> now during
>> the inode iteration process.
>>
>> _______________________________________________
>> Ocfs2-devel mailing list
>> Ocfs2-devel@oss.oracle.com
>> http://oss.oracle.com/mailman/listinfo/ocfs2-devel
>>
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Ocfs2-devel] [PATCH 0/3] Add inode stealing for ocfs2.V1
2008-02-22 0:42 [Ocfs2-devel] [PATCH 0/3] Add inode stealing for ocfs2.V1 Tao Ma
` (3 preceding siblings ...)
2008-02-22 0:57 ` [Ocfs2-devel] [PATCH 0/3] Add inode stealing for ocfs2.V1 wengang wang
@ 2008-02-22 15:09 ` Mark Fasheh
2008-02-22 16:11 ` Tao Ma
4 siblings, 1 reply; 11+ messages in thread
From: Mark Fasheh @ 2008-02-22 15:09 UTC (permalink / raw)
To: ocfs2-devel
On Fri, Feb 22, 2008 at 04:41:49PM +0800, tao.ma wrote:
> This patch set improve the method for inode allocation. Now they
> are divided into 3 small patches, but I think maybe they can be merged
> together as one. Any comments are welcomed.
Thank you for the thorough description. One thing that was left out - could
you give me a short description of how these changes were tested?
> In OCFS2, we allocate the inodes from slot specific inode_alloc to avoid
> inode creation congestion. The local alloc file grows in a large contiguous
> chunk. As for a 4K bs, it grows 4M every time. So 1024 inodes will be
> allocated at a time.
>
> Over time, if the fs gets fragmented enough(e.g, the user has created many
> small files and also delete some of them), we can end up in a situation,
> whereby we cannot extend the inode_alloc as we don't have a large chunk
> free in the global_bitmap even if df shows few gigs free. More annoying is
> that this situation will invariably mean that while one cannot create inodes
> on one node but can from another node. Still more annoying is that an unused
> slot may have space for plenty of inodes but is unusable as the user may not
> be mounting as many nodes anymore.
>
> This patch series implement a solution which is to steal inodes from another
> slot. Now the whole inode allocation process looks like this:
> 1. Allocate from its own inode_alloc:000X
> 1) If we can reserve, OK.
> 2) If fails, try to allocate a large chunk and reserve once again.
Do you have a mechanism in place to remember which inode alloc file you were
last able to sucessfully allocate from? If you did that, then we could avoid
needlessly searching our own slot every time.
You could even reset your "last inode alloc slot" pointer to the local slot
when space is freed from the local allocator.
> 2. If 1 fails, try to allocate from the last node's inode_alloc. This time,
> Just try to reserve, we don't go for global_bitmap if this inode also
> can't allocate the inode.
Does every node go to the same inode allocator after it's own? Wouldn't this
create a lot of traffic in one slot?
Why not search inode alloc in the next slot and loop back until you reach
yours again? So, if the nodes slot is '3' and max slots is 6, it'd search
4, 5, 0, 1, 2 before giving up.
> 3. If 2 fails, try the node before it until we reach inode_alloc:0000.
> In the process, we will skip its own inode_alloc.
> 4. If 3 fails, try to allocate from its own inode_alloc:000X once again. Here
> is a chance that the global_bitmap may has a large enough chunk now during
> the inode iteration process.
What are the chances that the global bitmap emptied enough in the time it
took us to search the other allocators? It doesn't seem like that would
happen very much, so I wouldn't bother with this last step unless we had
evidence that it would make a real difference.
--Mark
--
Mark Fasheh
Principal Software Developer, Oracle
mark.fasheh@oracle.com
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Ocfs2-devel] [PATCH 0/3] Add inode stealing for ocfs2.V1
2008-02-22 15:09 ` Mark Fasheh
@ 2008-02-22 16:11 ` Tao Ma
0 siblings, 0 replies; 11+ messages in thread
From: Tao Ma @ 2008-02-22 16:11 UTC (permalink / raw)
To: ocfs2-devel
Mark Fasheh Wrote:
> On Fri, Feb 22, 2008 at 04:41:49PM +0800, tao.ma wrote:
>
>> This patch set improve the method for inode allocation. Now they
>> are divided into 3 small patches, but I think maybe they can be merged
>> together as one. Any comments are welcomed.
>>
>
> Thank you for the thorough description. One thing that was left out - could
> you give me a short description of how these changes were tested?
>
I have created a test script. It will create some inode in other nodes,
then use up all the spaces in the volume and all the inodes in the this
node's local inode_alloc. Then it will try to allocate from other nodes.
I am using debugfs.ocfs2 to check whether the "i_suballoc_slot" for the
new created inode is in the right slot and then delete it to be sure
kernel can handle it successfully. In the end, the volume will be
umounted and fscked for any possible error.
Since this patch is only V1, I'm ready for any comments and will modify
the test scripts according to the new modification.
>
>
>> In OCFS2, we allocate the inodes from slot specific inode_alloc to avoid
>> inode creation congestion. The local alloc file grows in a large contiguous
>> chunk. As for a 4K bs, it grows 4M every time. So 1024 inodes will be
>> allocated at a time.
>>
>> Over time, if the fs gets fragmented enough(e.g, the user has created many
>> small files and also delete some of them), we can end up in a situation,
>> whereby we cannot extend the inode_alloc as we don't have a large chunk
>> free in the global_bitmap even if df shows few gigs free. More annoying is
>> that this situation will invariably mean that while one cannot create inodes
>> on one node but can from another node. Still more annoying is that an unused
>> slot may have space for plenty of inodes but is unusable as the user may not
>> be mounting as many nodes anymore.
>>
>> This patch series implement a solution which is to steal inodes from another
>> slot. Now the whole inode allocation process looks like this:
>> 1. Allocate from its own inode_alloc:000X
>> 1) If we can reserve, OK.
>> 2) If fails, try to allocate a large chunk and reserve once again.
>>
>
> Do you have a mechanism in place to remember which inode alloc file you were
> last able to sucessfully allocate from? If you did that, then we could avoid
> needlessly searching our own slot every time.
>
> You could even reset your "last inode alloc slot" pointer to the local slot
> when space is freed from the local allocator.
>
You are right. I don't have this mechanism. I will investigate on it and
see how it can works. Thanks.
>
>
>> 2. If 1 fails, try to allocate from the last node's inode_alloc. This time,
>> Just try to reserve, we don't go for global_bitmap if this inode also
>> can't allocate the inode.
>>
>
> Does every node go to the same inode allocator after it's own? Wouldn't this
> create a lot of traffic in one slot?
>
> Why not search inode alloc in the next slot and loop back until you reach
> yours again? So, if the nodes slot is '3' and max slots is 6, it'd search
> 4, 5, 0, 1, 2 before giving up.
>
Not sure whether your suggestion is reasonable. I start from the last
node because:
1. It is not often used like others.
2. If there is only one node whose inode alloc is full, it will only
contact the last node so that the congestion will be mainly between this
one and the last one.
3. If there is more nodes whose inode allocs are full, there is a very
large chance that all the mounted one are full, so the very first times
of inode alloc may just fail until it reach a really empty node. And I
think the node which has the largest chance of "being empty" is the last
node.
Make sense?
So I think maybe if I add the mechanism of recording "last inode alloc
slot", it should work OK.
Comments?
>
>
>> 3. If 2 fails, try the node before it until we reach inode_alloc:0000.
>> In the process, we will skip its own inode_alloc.
>>
>
>
>> 4. If 3 fails, try to allocate from its own inode_alloc:000X once again. Here
>> is a chance that the global_bitmap may has a large enough chunk now during
>> the inode iteration process.
>>
>
> What are the chances that the global bitmap emptied enough in the time it
> took us to search the other allocators? It doesn't seem like that would
> happen very much, so I wouldn't bother with this last step unless we had
> evidence that it would make a real difference.
>
OK, I will try to find out whether there is a real scenario. If none, I
wil remove this.
Regards,
Tao
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2008-02-22 16:11 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-02-22 0:42 [Ocfs2-devel] [PATCH 0/3] Add inode stealing for ocfs2.V1 Tao Ma
2008-02-22 0:48 ` [Ocfs2-devel] [PATCH 1/3] Add a new parameter for ocfs2_reserve_suballoc_bits.V1 Tao Ma
2008-02-22 0:49 ` [Ocfs2-devel] [PATCH 2/3] Add ac_alloc_slot in ocfs2_alloc_context.V1 Tao Ma
2008-02-22 0:49 ` [Ocfs2-devel] [PATCH 3/3] Add inode stealing for ocfs2_reserve_new_inode.V1 Tao Ma
2008-02-22 0:57 ` [Ocfs2-devel] [PATCH 0/3] Add inode stealing for ocfs2.V1 wengang wang
2008-02-22 1:03 ` tao.ma
2008-02-22 1:17 ` wengang wang
2008-02-22 1:26 ` tao.ma
2008-02-22 10:30 ` Sunil Mushran
2008-02-22 15:09 ` Mark Fasheh
2008-02-22 16:11 ` Tao Ma
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.