* [RFC, PATCH] Reservation based ext3 preallocation
[not found] ` <20040321015746.14b3c0dc.akpm@osdl.org>
@ 2004-03-30 8:55 ` Mingming Cao
2004-03-30 9:45 ` Andrew Morton
0 siblings, 1 reply; 28+ messages in thread
From: Mingming Cao @ 2004-03-30 8:55 UTC (permalink / raw)
To: Andrew Morton, tytso; +Cc: Badari Pulavarty, ext2-devel, linux-kernel, cmm
[-- Attachment #1: Type: text/plain, Size: 5778 bytes --]
Hi, Andrew, Ted, All,
Ext3 preallocation is currently missing. This is the first cut of the
prototype for the reservation based ext3 preallocation based on the
ideas suggested by Andrew and Ted. The implementation is incomplete, but
I want to hear your valuable opinion about the current design.
What I have done in this version of prototype:
1)basic reservation structure and operations
2)reservation based ext3 block allocation
3)and reservation window allocations
4)block allocation when fs reservation is turned off
For 1) Use a sorted double linked list for the per-filesystem
reservation list, like the vm_region does. The operations on double
linked list are abstract so later if necessary we could replace it with
other sohpysicated tree easily.
Each inode have a reservation structure inside it's ext3_inode_info
structure. Each reservation structure contains(start, end, list_head,
goal_window_size)
For 2) The basic idea is: When we try to allocate a new block for a
inode, if there is a reservation window for it, it will try to do
allocation from there.
If it does not have a reservation window, we will allocate a block and
make a reservation window for it. Instead of doing the block allocation
first then do the reservation window allocation second, we make the
reservation window first, then allocate a block within the window. The
new reservation window has at least one free block and does not overlap
with other reservation windows. This way we avoid keeping looking up the
reservation list again and again when we found a free bit on bitmap and
not sure if it belongs to any body's reservation window.
For 3) To allocate a new reservation window, we search the part of
filesystem reservation list that fall into the group which we are trying
to allocate a block from. We will have a goal block to guide where we
want the new reservation window start from. If we already have a old
reservation, we will discard it first, then search the part of list that
after the old reservation window. Otherwise the sub-list start from the
beginning of the group. The new reservation window could cross group
boundary. The reservation window has contains at least one free block.
For 4) If the filesystem has reservation turned off, all the code/path
for new block allocation is the same as the current code-- just call
ext3_try_to_allocate() with a NULL reservation window pointer.
Above logic has been verified on a user level simulation program.
Attached prototype patch (against 2.6.4 kernel) compiles and boots. I
have done initial test of the patch on a 2way PIII 700Mhz box.
Below is the debugfs output after running a simple test. The test has 8
threads sequentially write 20M on different files at the same time, in
the same directory, on a fresh created ext3 filesystem. Basically, after
the apply the patch, the filesystem is much lest fragmented:
before(2.6.4 kernel):
Inode: 12 Type: regular Mode: 0644 Flags: 0x0 Generation:
3375236196
.....
BLOCKS:
(0):8716, (1):8718, (2):8720, (3):8722, (4):8724, (5):8726, (6):8728,
(7):8730, (8):8732, (9):8734, (10):8736, (11):8738, (IND):8741,
(12):8742, (13):8744, (14):8746, (15):8748, (16):8750, (17):8752,
(18):8754, (19):8756, (20):8758, (21):8760, (22):8762, (23):8764,
(24):8766, (25):8768, (26):8770, (27):8772, (28):8774, (29):8776,
(30):8778, (31):8780, (32):8782, (33):8784, (34):8786, (35):8788,
(36):8790, (37):8792, (38):8794, (39):8796, (40):8798, (41):8800,
(42):8802, (43):8804, (44):8806, (45):8808, (46):8810, (47):8812,
(48):8814, (49):8816, (50):8818, (51):8820, (52):8822, (53):8824,
(54):8826, (55):8828, (56):8830, (57):8832, (58):8834, (59):8836, (6
0):8838, (61):8840, (62):8842, (63):8844, (64):8846, (65):8848,
(66):8850, (67):8852, (68):8854, (69):8856, (70):8858, (71):8860,
(72):8862, (73):8864, (74):8866, (75):8868, (76):8870, (77):8872,
(78):8874, (79):8876, (80):8878, (81):8880,
......
......
......
after apply the patch(reservation window size is 128 blocks):
Inode: 15 Type: regular Mode: 0644 Flags: 0x0 Generation:
2351221293
......
BLOCKS:
(0-11):24576-24587, (IND):24588, (12-1035):24592-25615, (DIND):25616,
(IND):25624,(1036-2027):25632-26623, (2028-2059):37116-37147,
(IND):37148, (2060-2151):37152-37243, (2152-2279):37372-37499,
(2280-2407):37756-37883, (2408-2535):38012-38139,
(2536-2663):38268-38395, (2664-2791):38524-38651,
(2792-2919):38780-38907, (2920-3083):43132-43295, (IND):43296,
(3084-3167):43304-43387, (3168-3551):43516-43899,
(3552-3679):44028-44155, (3680-3807):44284-44411,
(3808-3935):44540-44667, (3936-4063):44924-45051,
(4064-4107):45180-45223, (IND):45224, (4108-4183):45232-45307, (4184-456
7):45436-45819, (4568-4695):45948-46075, (4696-4823):46204-46331,
(4824-4828):46875-46879, (4829-4956):48380-48507,
(4957-4999):48636-48678
TOTAL: 5006
Things to do:
1) Dynamic increase the reservation window for individual files.
2) Prevent bogus early ENOSPC error when filesystem is being full
reserved.
3) Preserve the reservation window on file/close on some files which
frequently append
4) Play with other tree structures to replace sorted double linked list
for the reservation tree/list if necessary.
Before working on above todos, I would like to hear you valuable
comments, suggestions, ideas on current design of this reservation based
ext3 preallocation. Patch is attached and against 2.6.4 kernel.
Thanks!
Mingming
diffstat ext3_reservation-7.diff
fs/ext3/balloc.c | 485
+++++++++++++++++++++++++++++++++++++++++++--
fs/ext3/ialloc.c | 6
fs/ext3/inode.c | 9
fs/ext3/super.c | 7
include/linux/ext3_fs.h | 4
include/linux/ext3_fs_i.h | 12 +
include/linux/ext3_fs_sb.h | 4
7 files changed, 511 insertions(+), 16 deletions(-)
[-- Attachment #2: ext3_reservation-7.diff --]
[-- Type: text/x-patch, Size: 23988 bytes --]
diff -urNp linux-2.6.4/fs/ext3/balloc.c 264-rsv-no_debug/fs/ext3/balloc.c
--- linux-2.6.4/fs/ext3/balloc.c 2004-03-10 18:55:21.000000000 -0800
+++ 264-rsv-no_debug/fs/ext3/balloc.c 2004-03-30 07:29:58.559179056 -0800
@@ -96,6 +96,332 @@ read_block_bitmap(struct super_block *sb
error_out:
return bh;
}
+/*
+ * The reservation window structure operations
+ * --------------------------------------------
+ * Operations include:
+ * dump, find, add, remove, is_empty, find_next_reservable_window, etc.
+ *
+ * We use sorted double linked list for the per-filesystem reservation
+ * window list. (like in vm_region).
+ *
+ * Initially, we keep those small operations in the abstract functions,
+ * so later if we need a better searching tree than double linked-list,
+ * we could easily switch to that without changing too much
+ * code.
+ */
+#ifdef LATER
+static inline void rsv_window_dump(struct reserve_window *head, char *fn)
+{
+ struct reserve_window *rsv;
+ printk("Block Allocation Reservation Windows Map (%s):\n", fn);
+ list_for_each_entry(rsv, &head->rsv_list, rsv_list) {
+ printk("reservation window 0x%x start: %ld, end: %ld(%d)\n",
+ rsv, rsv->rsv_start, rsv->rsv_end,
+ rsv->rsv_end-c->rsv_start);
+ }
+}
+#endif
+
+static inline int goal_in_my_reservation(struct reserve_window *rsv, int goal)
+{
+ return ((goal >= rsv->rsv_start) && (goal <= rsv->rsv_end));
+}
+
+/*
+ * find if the given block is within any reservation on the list
+ */
+static inline int rsv_window_find(struct reserve_window *start, int block, int last_block)
+{
+ struct reserve_window *rsv;
+
+ list_for_each_entry(rsv, &start->rsv_list, rsv_list) {
+ if (goal_in_my_reservation(rsv, block))
+ return 1; /* found it*/
+ if (rsv->rsv_start > last_block)
+ return 0;
+ }
+ return 0;
+}
+
+static inline void rsv_window_add(struct reserve_window *rsv, struct reserve_window *prev)
+{
+ /* insert the new reservation window after the head */
+ list_add(&rsv->rsv_list, &prev->rsv_list);
+}
+
+static inline void rsv_window_remove(struct reserve_window *rsv)
+{
+ rsv->rsv_start = 0;
+ rsv->rsv_end = 0;
+ list_del(&rsv->rsv_list);
+}
+static inline int rsv_is_empty(struct reserve_window *rsv)
+{
+ /* a valid reservation end block could not be 0 */
+ return (rsv->rsv_end == 0);
+}
+
+void ext3_discard_reservation(struct inode *inode)
+{
+ struct ext3_inode_info *ei = EXT3_I(inode);
+ struct reserve_window *rsv = &ei->i_rsv_window;
+ spinlock_t * rsv_lock = &EXT3_SB(inode->i_sb)->s_rsv_window_lock;
+
+ if (!rsv_is_empty(rsv)) {
+ spin_lock(rsv_lock);
+ rsv_window_remove(rsv);
+ spin_unlock(rsv_lock);
+ }
+}
+/**
+ * find_next_reservable_window():
+ * find a reservable space within the given range
+ * It does not allocate the reservation window for now
+ * alloc_new_reservation() will do the work later.
+ *
+ * @search_head: the head of the searching list;
+ * This is not necessary the list head of the whole filesystem
+ *
+ * we have both head and start_block to assist the search
+ * for the reservable space. The list start from head,
+ * but we will shift to the place where start_block is,
+ * then start from there, we looking for a resevable space.
+ * @fs_rsv_head: per-filesystem reservation list head
+ *
+ * @size: the target new reservation window size
+ * @group_first_block: the first block we consider to start the real search from
+ *
+ * @last_block:
+ * the maxium block number that our goal reservable space
+ * could start from. This is normally the last block in this
+ * group. The search will end when we found the start of next
+ * possiblereservable space is out of this boundary.
+ * This could handle the cross bounday reservation window request.
+ *
+ * basically we search from the given range, rather than the whole
+ * reservation double linked list, (start_block, last_block)
+ * to find a free region that of of my size and has not
+ * been reserved.
+ *
+ * on succeed, it returns the reservation window to be append to.
+ * failed, return NULL.
+ */
+static inline
+struct reserve_window* find_next_reservable_window(
+ struct reserve_window *search_head,
+ struct reserve_window *fs_rsv_head,
+ int size, int *start_block, int last_block)
+{
+ struct reserve_window *rsv;
+ int cur;
+
+ /* TODO: make the start of the reservation window byte alligned */
+ cur = *start_block;
+ rsv = list_entry(search_head->rsv_list.next, struct reserve_window, rsv_list);
+ while (rsv != fs_rsv_head) {
+ if (cur + size <= rsv->rsv_start) {
+ /*
+ * found a reserable space big enough
+ * we could have a reservation cross
+ * the group boundary here
+ */
+ goto found;
+ }
+ if (cur <= rsv->rsv_end)
+ cur = rsv->rsv_end + 1;
+
+ /* TODO?
+ * in the case we could not find a reservable space
+ * that is what is expected, during the research, we could
+ * remember what's the largest reservable space we could have
+ * and return that on.
+ *
+ * for now it will fail if we could not find the reservable
+ * space with expected-size (or more)...
+ */
+ rsv = list_entry(rsv->rsv_list.next, struct reserve_window, rsv_list);
+ if (cur > last_block)
+ goto out;
+ }
+found:
+ /*
+ * we come here either :
+ * when we rearch to the end of the whole list,
+ * and there is empty reservable space after last entry in the list.
+ * append it to the end of the list.
+ *
+ * or we found one reservable space in the middle of the list,
+ * return the reservation window that we could append to.
+ * succeed.
+ */
+ *start_block = cur;
+ return list_entry(rsv->rsv_list.prev, struct reserve_window, rsv_list);
+out:
+ return NULL; /* failed */
+}
+
+/**
+ * alloc_new_reservation()--allocate a new reservation window
+ * if there is an existing reservation, discard it first
+ * then allocate the new one from there
+ * otherwise allocate the new reservation from the given
+ * start block, or the beginning of the group, if a goal
+ * is not given.
+ *
+ * To make a new reservation, we search part of the filesystem
+ * reservation list(the list that inside the group).
+ *
+ * If we have a old reservation, the search goal is the end of
+ * last reservation. If we do not have a old reservatio, then we
+ * start from a given goal, or the first block of the group, if
+ * the goal is not given.
+ *
+ * We first find a reservable space after the goal, then from
+ * there,we check the bitmap for the first free block after
+ * it. If there is no free block until the end of group, then the
+ * whole group is full, we failed. Otherwise, check if the free block
+ * is inside the expected reservable space, if so, we succeed.
+ * If the first free block is outside the reseravle space, then
+ * start from the first free block, we search for next avalibale
+ * space, and go on.
+ *
+ * on succeed, a new reservation will be found and inserted into the list
+ * It contains at least one free block, and it is not overlap with other
+ * reservation window.
+ *
+ * failed: we failed to found a reservation window in this group
+ *
+ * @rsv: the reservation
+ *
+ * @start_block: The goal. It is where the search for a
+ * free reservable space should start from.
+ * if we have a old reservation, start_block is the end of
+ * old reservation. Otherwise,
+ * if we have a goal(start_block >0 ), then start from there,
+ * no goal(start_block = -1-, we start from the first block
+ * of the group.
+ *
+ * @sb: the super block
+ * @group: the group we are trying to do allocate in
+ * @bitmap_bh: the block group block bitmap
+ */
+static int alloc_new_reservation(struct reserve_window *my_rsv, int start_block,
+ struct super_block *sb, int group, struct buffer_head *bitmap_bh)
+{
+ struct reserve_window *search_head;
+ int group_first_block, group_end_block, first_free_block;
+ int reservable_space_start;
+ struct reserve_window *prev_rsv;
+ struct reserve_window *fs_rsv_head = &EXT3_SB(sb)->s_rsv_window_head;
+ int size;
+
+ group_first_block = le32_to_cpu(EXT3_SB(sb)->s_es->s_first_data_block) +
+ group * EXT3_BLOCKS_PER_GROUP(sb);
+ group_end_block = group_first_block + EXT3_BLOCKS_PER_GROUP(sb) - 1;
+
+ if (start_block < 0)
+ start_block = 0;
+ start_block += group_first_block;
+
+ /* if we have a old reservation, discard it first */
+ if (!rsv_is_empty(my_rsv)) {
+ /*
+ * if the old reservation is cross group boundary
+ * we will come here when we just failed to allocate from
+ * the first part of the window. We still have another part
+ * that belongs to the next group. In this case, there is no
+ * point to discard our window and try to allocate a new one
+ * in this group(which will fail). we should
+ * keep the reservation window, just simply move on.
+ *
+ * Maybe we could shift the start block of the reservation window
+ * to the first block of next group...
+ */
+ if (my_rsv->rsv_end >= group_end_block)
+ return -1;
+
+ /* remember where we are before we discard the old one */
+ if (my_rsv->rsv_end + 1 > start_block)
+ start_block = my_rsv->rsv_end + 1;
+ search_head = list_entry(my_rsv->rsv_list.prev,
+ struct reserve_window, rsv_list);
+
+ rsv_window_remove(my_rsv);
+ }
+ else {
+ /*
+ * we don't have a reservation,
+ * we set our goal(start_block) and
+ * the list head for the search
+ */
+ search_head = fs_rsv_head;
+ }
+
+ /*
+ * find_next_reservable_window() simply find a reservable window
+ * inside the given range(start_block, group_end_block).
+ *
+ * To make sure the reservation window has a free bit inside it, we need
+ * to check the bitmap after we found a reservable window.
+ */
+ size = my_rsv->rsv_goal_size;
+retry:
+ prev_rsv = find_next_reservable_window(search_head, fs_rsv_head, size,
+ &start_block, group_end_block);
+ if (prev_rsv == NULL)
+ goto failed;
+
+ reservable_space_start = start_block;
+ /*
+ * on succeed, find_next_reservable_window() returns the
+ * reservation window where there is a reservable space after it.
+ * Before we reserve this reservable space, we need
+ * to make sure there is at least a free block inside this region.
+ *
+ * searching the first free bit on the block bitmap, start from
+ * the start block of the reservable space we just found.
+ */
+ first_free_block = ext3_find_next_zero_bit(bitmap_bh->b_data,
+ group_end_block - group_first_block + 1,
+ reservable_space_start - group_first_block);
+ if (first_free_block < 0)
+ /*
+ * no free block left on the bitmap, no point
+ * to reserve the space. return failed.
+ */
+ goto failed;
+
+ start_block = first_free_block + group_first_block;
+ /*
+ * check if the first free block is within the
+ * free space we just found
+ */
+ if ((start_block >= reservable_space_start) &&
+ (start_block < reservable_space_start + size))
+ goto found_rsv_window;
+ /*
+ * if the first free bit we found is out of the reservable space
+ * this means there is no free block on the reservable space
+ * we should continue search for next reservable space,
+ * start from where the free block is,
+ * we also shift the list head to where we stopped last time
+ */
+ search_head = prev_rsv;
+ goto retry;
+
+found_rsv_window:
+ /*
+ * great! the reservable space contains some free blocks.
+ * Insert it to the list.
+ */
+ rsv_window_add(my_rsv, prev_rsv);
+ my_rsv->rsv_start = reservable_space_start;
+ my_rsv->rsv_end = my_rsv->rsv_start + size - 1;
+ return 0; /* succeed */
+failed:
+ return -1; /* failed */
+}
/* Free given blocks, update quota and i_blocks field */
void ext3_free_blocks (handle_t *handle, struct inode * inode,
@@ -407,11 +733,12 @@ claim_block(spinlock_t *lock, int block,
*/
static int
ext3_try_to_allocate(struct super_block *sb, handle_t *handle, int group,
- struct buffer_head *bitmap_bh, int goal, int *errp)
+ struct buffer_head *bitmap_bh, int goal,
+ struct reserve_window * my_rsv, int *errp)
{
- int i;
int fatal;
int credits = 0;
+ int group_first_block, start, end;
*errp = 0;
@@ -426,26 +753,49 @@ ext3_try_to_allocate(struct super_block
*errp = fatal;
goto fail;
}
+ /* if EXT3_RESERVATION */
+ if (my_rsv) {
+ group_first_block =
+ le32_to_cpu(EXT3_SB(sb)->s_es->s_first_data_block) +
+ group * EXT3_BLOCKS_PER_GROUP(sb);
+ if (my_rsv->rsv_start >= group_first_block)
+ start = my_rsv->rsv_start - group_first_block;
+ else
+ /* reservation window cross group boundary */
+ start = group_first_block;
+ end = my_rsv->rsv_end - group_first_block;
+ if (end > EXT3_BLOCKS_PER_GROUP(sb))
+ /* reservation window cross group boundary */
+ end = EXT3_BLOCKS_PER_GROUP(sb);
+ }
+ else {
+ if (goal > 0)
+ start = goal;
+ else
+ start = 0;
+ end = EXT3_BLOCKS_PER_GROUP(sb);
+ }
repeat:
if (goal < 0 || !ext3_test_allocatable(goal, bitmap_bh)) {
- goal = find_next_usable_block(goal, bitmap_bh,
- EXT3_BLOCKS_PER_GROUP(sb));
+ goal = find_next_usable_block(start, bitmap_bh, end);
if (goal < 0)
goto fail_access;
- for (i = 0; i < 7 && goal > 0 &&
+ /*for (i = 0; i < 7 && goal > 0 &&
ext3_test_allocatable(goal - 1, bitmap_bh);
i++, goal--);
+ */
}
+ start = goal;
if (!claim_block(sb_bgl_lock(EXT3_SB(sb), group), goal, bitmap_bh)) {
/*
* The block was allocated by another thread, or it was
* allocated and then freed by another thread
*/
- goal++;
- if (goal >= EXT3_BLOCKS_PER_GROUP(sb))
+ start++;
+ if (start >= end)
goto fail_access;
goto repeat;
}
@@ -466,6 +816,118 @@ fail:
}
/*
+ * This is the main function used to allocate a new block and
+ * it's reservation window.
+ * each time when a new block allocation is need, first try to allocate
+ * from it's own reservation.
+ * If it does not have a reservation window, instead of looking
+ * for a free bit on bitmap first, then look up the reservation list to see if
+ * it is inside somebody else's reservation window,
+ * we try to allocate a reservation window for it start from the goal first.
+ * Then do the block allocation within the reservation window.
+ *
+ * This will aviod keep searching the reservation list again and again
+ * when someboday is looking for a free block(without reservation),
+ * and there are lots of free blocks, but they are all being reserved
+ *
+ * We use a sorted double linked list for the per-filesystem reservation list.
+ * The insert, remove and find a free space(non-reserved) operations for the
+ * sorted double linked list should be fast.
+ *
+ */
+static int
+ext3_try_to_allocate_with_rsv(struct super_block *sb, handle_t *handle,
+ int group, struct buffer_head *bitmap_bh,
+ int goal, struct reserve_window * my_rsv,int *errp)
+{
+ spinlock_t *rsv_lock;
+ int group_first_block;
+ int ret;
+#ifdef EXT3_RESERVATION
+ rsv_lock = &EXT3_SB(sb)->s_rsv_window_lock;
+ /*
+ * goal is a group relative block number (if there is a goal)
+ * 0 < goal < EXT3_BLOCKS_PER_GROUP(sb)
+ * first block is a filesystem wide block number
+ * first block is the block number of the first block in this group
+ */
+ group_first_block = le32_to_cpu(EXT3_SB(sb)->s_es->s_first_data_block) +
+ group * EXT3_BLOCKS_PER_GROUP(sb);
+ /*
+ * if we don't have a reservation, get a reservation first
+ * then try to allocation from the reservation
+ */
+ if (rsv_is_empty(my_rsv)) {
+ spin_lock(rsv_lock);
+ ret = alloc_new_reservation(my_rsv, goal, sb, group, bitmap_bh);
+ spin_unlock(rsv_lock);
+ if (ret < 0)
+ /*
+ * alloc_new_reservation() failed when there is
+ * no "free" window for reservation in this group
+ * from the start point.
+ * In that case, we should move onto next group
+ */
+ return -1;
+ if (!goal_in_my_reservation(my_rsv, goal))
+ goal = -1;
+ goto alloc_from_rsv;
+ }
+ /*
+ * we already have a reservation
+ * if we come here with a goal, and the goal is inside the reservation
+ * then try to allocate from the reservation, start from the goal
+ * if we donot have a goal here,
+ * then try to allocate from reservation, start from the beginning
+ */
+ if ((goal < 0 ) || (goal_in_my_reservation(my_rsv, goal+group_first_block)))
+ goto alloc_from_rsv;
+
+ /*
+ * here, we have a reservation, but our goal is outside of the
+ * reservation, we should discard the reservation,
+ * get a new reservation based on the goal,
+ * then try to allocate from the new reservation.
+ *
+ * there is another way to deal with this case,
+ * we could keep the old reservation, give up the goal,
+ * try to allocate from the old reservation
+ * in the normal case, we already allocated all blocks
+ * in our reservation, going back may just waste of time
+ * and also, the goal already indicate that the reservation
+ * window is out-dated. so...
+ */
+ spin_lock(rsv_lock);
+ ret = alloc_new_reservation(my_rsv, goal, sb, group, bitmap_bh);
+ spin_unlock(rsv_lock);
+ if (ret < 0)
+ return -1;
+ goal = -1;
+
+alloc_from_rsv:
+ ret = ext3_try_to_allocate(sb, handle, group, bitmap_bh, goal,
+ my_rsv, errp);
+ if (ret < 0) {
+ spin_lock(rsv_lock);
+ ret = alloc_new_reservation(my_rsv, goal, sb, group, bitmap_bh);
+ spin_unlock(rsv_lock);
+
+ if (ret < 0 )
+ return -1;
+
+ goal = -1;
+ goto alloc_from_rsv;
+ }
+ /*
+ * okey, we successfully allocate a block from my reservation
+ */
+ return ret; /* succeed */
+#else
+ return ext3_try_to_allocate(sb, handle, group, bitmap, goal, NULL, errp);
+#endif
+}
+
+/*
* ext3_new_block uses a goal block to assist allocation. If the goal is
* free, or there is a free block within 32 blocks of the goal, that block
* is allocated. Otherwise a forward search is made for a free block; within
@@ -490,6 +952,7 @@ ext3_new_block(handle_t *handle, struct
struct ext3_group_desc *gdp;
struct ext3_super_block *es;
struct ext3_sb_info *sbi;
+ struct reserve_window *my_rsv = &EXT3_I(inode)->i_rsv_window;
#ifdef EXT3FS_DEBUG
static int goal_hits, goal_attempts;
#endif
@@ -540,8 +1003,8 @@ ext3_new_block(handle_t *handle, struct
bitmap_bh = read_block_bitmap(sb, group_no);
if (!bitmap_bh)
goto io_error;
- ret_block = ext3_try_to_allocate(sb, handle, group_no,
- bitmap_bh, ret_block, &fatal);
+ ret_block = ext3_try_to_allocate_with_rsv(sb, handle, group_no,
+ bitmap_bh, ret_block, my_rsv, &fatal);
if (fatal)
goto out;
if (ret_block >= 0)
@@ -569,8 +1032,8 @@ ext3_new_block(handle_t *handle, struct
bitmap_bh = read_block_bitmap(sb, group_no);
if (!bitmap_bh)
goto io_error;
- ret_block = ext3_try_to_allocate(sb, handle, group_no,
- bitmap_bh, -1, &fatal);
+ ret_block = ext3_try_to_allocate_with_rsv(sb, handle, group_no,
+ bitmap_bh, -1, my_rsv, &fatal);
if (fatal)
goto out;
if (ret_block >= 0)
diff -urNp linux-2.6.4/fs/ext3/ialloc.c 264-rsv-no_debug/fs/ext3/ialloc.c
--- linux-2.6.4/fs/ext3/ialloc.c 2004-03-10 18:55:27.000000000 -0800
+++ 264-rsv-no_debug/fs/ext3/ialloc.c 2004-03-29 19:42:06.664888408 -0800
@@ -585,6 +585,12 @@ got:
ei->i_prealloc_block = 0;
ei->i_prealloc_count = 0;
#endif
+#ifdef EXT3_RESERVATION
+ ei->i_rsv_window.rsv_start = 0;
+ ei->i_rsv_window.rsv_end= 0;
+ ei->i_rsv_window.rsv_goal_size = EXT3_DEFAULT_RESERVE_BLOCKS;
+ INIT_LIST_HEAD(&ei->i_rsv_window.rsv_list);
+#endif
ei->i_block_group = group;
ext3_set_inode_flags(inode);
diff -urNp linux-2.6.4/fs/ext3/inode.c 264-rsv-no_debug/fs/ext3/inode.c
--- linux-2.6.4/fs/ext3/inode.c 2004-03-10 18:55:35.000000000 -0800
+++ 264-rsv-no_debug/fs/ext3/inode.c 2004-03-29 19:42:06.672887192 -0800
@@ -186,7 +186,9 @@ static int ext3_journal_test_restart(han
void ext3_put_inode(struct inode *inode)
{
if (!is_bad_inode(inode))
- ext3_discard_prealloc(inode);
+
+ ext3_discard_reservation(inode);
+ /* ext3_discard_prealloc(inode); */
}
/*
@@ -2137,8 +2139,9 @@ void ext3_truncate(struct inode * inode)
return;
if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
return;
-
- ext3_discard_prealloc(inode);
+
+ ext3_discard_reservation(inode);
+ /* ext3_discard_prealloc(inode); */
/*
* We have to lock the EOF page here, because lock_page() nests
diff -urNp linux-2.6.4/fs/ext3/super.c 264-rsv-no_debug/fs/ext3/super.c
--- linux-2.6.4/fs/ext3/super.c 2004-03-10 18:55:44.000000000 -0800
+++ 264-rsv-no_debug/fs/ext3/super.c 2004-03-29 19:42:06.677886432 -0800
@@ -1291,6 +1291,13 @@ static int ext3_fill_super (struct super
sbi->s_gdb_count = db_count;
get_random_bytes(&sbi->s_next_generation, sizeof(u32));
spin_lock_init(&sbi->s_next_gen_lock);
+ /* per fileystem reservation list head & lock */
+ spin_lock_init(&sbi->s_rsv_window_lock);
+ INIT_LIST_HEAD(&sbi->s_rsv_window_head.rsv_list);
+ sbi->s_rsv_window_head.rsv_start = 0;
+ sbi->s_rsv_window_head.rsv_end = 0;
+ sbi->s_rsv_window_head.rsv_goal_size = 0;
+
/*
* set up enough so that it can read an inode
*/
diff -urNp linux-2.6.4/include/linux/ext3_fs.h 264-rsv-no_debug/include/linux/ext3_fs.h
--- linux-2.6.4/include/linux/ext3_fs.h 2004-03-10 18:55:33.000000000 -0800
+++ 264-rsv-no_debug/include/linux/ext3_fs.h 2004-03-30 07:42:26.299505248 -0800
@@ -37,7 +37,8 @@ struct statfs;
*/
#undef EXT3_PREALLOCATE /* @@@ Fix this! */
#define EXT3_DEFAULT_PREALLOC_BLOCKS 8
-
+#define EXT3_RESERVATION
+#define EXT3_DEFAULT_RESERVE_BLOCKS 8
/*
* Always enable hashed directories
*/
@@ -728,6 +729,7 @@ extern void ext3_put_inode (struct inode
extern void ext3_delete_inode (struct inode *);
extern int ext3_sync_inode (handle_t *, struct inode *);
extern void ext3_discard_prealloc (struct inode *);
+extern void ext3_discard_reservation (struct inode *);
extern void ext3_dirty_inode(struct inode *);
extern int ext3_change_inode_journal_flag(struct inode *, int);
extern void ext3_truncate (struct inode *);
diff -urNp linux-2.6.4/include/linux/ext3_fs_i.h 264-rsv-no_debug/include/linux/ext3_fs_i.h
--- linux-2.6.4/include/linux/ext3_fs_i.h 2004-03-10 18:55:21.000000000 -0800
+++ 264-rsv-no_debug/include/linux/ext3_fs_i.h 2004-03-29 19:42:06.681885824 -0800
@@ -18,8 +18,15 @@
#include <linux/rwsem.h>
+struct reserve_window{
+ struct list_head rsv_list;
+ __u32 rsv_start;
+ __u32 rsv_end;
+ int rsv_goal_size;
+};
+
/*
- * second extended file system inode data in memory
+ * third extended file system inode data in memory
*/
struct ext3_inode_info {
__u32 i_data[15];
@@ -61,6 +68,9 @@ struct ext3_inode_info {
__u32 i_prealloc_block;
__u32 i_prealloc_count;
#endif
+ /* block reservation window */
+ struct reserve_window i_rsv_window;
+
__u32 i_dir_start_lookup;
#ifdef CONFIG_EXT3_FS_XATTR
/*
diff -urNp linux-2.6.4/include/linux/ext3_fs_sb.h 264-rsv-no_debug/include/linux/ext3_fs_sb.h
--- linux-2.6.4/include/linux/ext3_fs_sb.h 2004-03-10 18:55:44.000000000 -0800
+++ 264-rsv-no_debug/include/linux/ext3_fs_sb.h 2004-03-29 19:42:06.682885672 -0800
@@ -59,6 +59,10 @@ struct ext3_sb_info {
struct percpu_counter s_dirs_counter;
struct blockgroup_lock s_blockgroup_lock;
+ /* head of the per fs reservation window tree */
+ spinlock_t s_rsv_window_lock;
+ struct reserve_window s_rsv_window_head;
+
/* Journaling */
struct inode * s_journal_inode;
struct journal_s * s_journal;
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC, PATCH] Reservation based ext3 preallocation
2004-03-30 8:55 ` [RFC, PATCH] Reservation based ext3 preallocation Mingming Cao
@ 2004-03-30 9:45 ` Andrew Morton
2004-03-30 17:07 ` Badari Pulavarty
2004-04-03 1:45 ` [Ext2-devel] " Mingming Cao
0 siblings, 2 replies; 28+ messages in thread
From: Andrew Morton @ 2004-03-30 9:45 UTC (permalink / raw)
To: Mingming Cao; +Cc: tytso, pbadari, ext2-devel, linux-kernel, cmm
Mingming Cao <cmm@us.ibm.com> wrote:
>
> Ext3 preallocation is currently missing.
I thing this is heading the right way.
- Please use u32 for block numbers everywhere. In a number of places you
are using int, and that may go wrong if the block numbers wrap negative
(I'm not sure that ext3 supports 8TB, but it's the right thing to do).
- Using ext3_find_next_zero_bit(bitmap_bh->b_data in
alloc_new_reservation() is risky. There are some circumstances when you
have a huge number of "free" blocks in ->b_data, but they are all unfree
in ->b_committed_data. You could end up with astronomical search
complexity in there. You should search both bitmaps to find a block
which really is allocatable. Otherwise you'll have
ext3_try_to_allocate() failing 20,000 times in succession and much CPU
will be burnt.
- I suspect ext3_try_to_allocate_with_rsv() could be reorganised a bit to
reduce the goto spaghetti?
- Please provide a mount option which enables the feature, defaulting to
"off".
- Make sure that you have a many-small-file test. Say, untar a kernel
tree onto a clean filesystem and make sure that reading all the files in
the tree is nice and fast.
This is to check that the reservation is being discarded appropriately
on file close, and that those small files are contiguous on-disk. If we
accidentally leave gaps in between them the many-small-file bandwidth
takes a dive.
- There's a little program called `bmap' in
http://www.zip.com.au/~akpm/linux/patches/stuff/ext3-tools.tar.gz which
can be used to dump out a file's block allocation map, to check
fragmentation.
Apart from that, looking good. Where are the benchmarks? ;)
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC, PATCH] Reservation based ext3 preallocation
2004-03-30 9:45 ` Andrew Morton
@ 2004-03-30 17:07 ` Badari Pulavarty
2004-03-30 17:12 ` [Ext2-devel] " Alex Tomas
` (2 more replies)
2004-04-03 1:45 ` [Ext2-devel] " Mingming Cao
1 sibling, 3 replies; 28+ messages in thread
From: Badari Pulavarty @ 2004-03-30 17:07 UTC (permalink / raw)
To: Andrew Morton, Mingming Cao; +Cc: tytso, ext2-devel, linux-kernel, cmm
On Tuesday 30 March 2004 01:45 am, Andrew Morton wrote:
Andrew,
> - Using ext3_find_next_zero_bit(bitmap_bh->b_data in
> alloc_new_reservation() is risky. There are some circumstances when you
> have a huge number of "free" blocks in ->b_data, but they are all unfree
> in ->b_committed_data. You could end up with astronomical search
> complexity in there. You should search both bitmaps to find a block
> which really is allocatable. Otherwise you'll have
> ext3_try_to_allocate() failing 20,000 times in succession and much CPU
> will be burnt.
Can you explain this a little more ? What does b->data and b->commited_data
represent ? We are assuming that b->data will always be uptodate.
May be we should use ext3_test_allocatable() also.
Mingming, what was the reason for using ext3_find_next_zero_bit() only ?
We had this discussion earlier, but I forgot :(
> - I suspect ext3_try_to_allocate_with_rsv() could be reorganised a bit to
> reduce the goto spaghetti?
will do :)
>
> - Please provide a mount option which enables the feature, defaulting to
> "off".
Sure.
>
> - Make sure that you have a many-small-file test. Say, untar a kernel
> tree onto a clean filesystem and make sure that reading all the files in
> the tree is nice and fast.
>
> This is to check that the reservation is being discarded appropriately
> on file close, and that those small files are contiguous on-disk. If we
> accidentally leave gaps in between them the many-small-file bandwidth
> takes a dive.
Hmm. Ted proposed that we should keep reservation after file close.
We weren't sure about this either. Its on our TODO list.
>
> - There's a little program called `bmap' in
> http://www.zip.com.au/~akpm/linux/patches/stuff/ext3-tools.tar.gz which
> can be used to dump out a file's block allocation map, to check
> fragmentation.
Thanks. will use that. We are using debugfs for now. Do you have any tools
to dump out whats in journal ? I want to understand log format etc..
Just curious.
>
> Apart from that, looking good. Where are the benchmarks? ;)
We are first concentrating on tiobench regression. We see clear
degrade with tiobench on ext3, since it creates lots of files in the
same directory. Once we are happy with tiobench, we go for others
kernel untars, rawiobench etc.
Thanks,
Badari
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [Ext2-devel] Re: [RFC, PATCH] Reservation based ext3 preallocation
2004-03-30 17:07 ` Badari Pulavarty
@ 2004-03-30 17:12 ` Alex Tomas
2004-03-30 18:07 ` Badari Pulavarty
2004-03-30 18:23 ` Mingming Cao
2004-03-30 18:36 ` Andrew Morton
2 siblings, 1 reply; 28+ messages in thread
From: Alex Tomas @ 2004-03-30 17:12 UTC (permalink / raw)
To: Badari Pulavarty
Cc: Andrew Morton, Mingming Cao, tytso, ext2-devel, linux-kernel
On Втр, 2004-03-30 at 21:07, Badari Pulavarty wrote:
> Can you explain this a little more ? What does b->data and b->commited_data
> represent ? We are assuming that b->data will always be uptodate.
>
b_data represents actual information about used/free blocks.
b_committed_data represents blocks that freed during current
transaction. these blocks must not be allocated. there is good
note about this just before ext3_test_allocatable() in balloc.c
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [Ext2-devel] Re: [RFC, PATCH] Reservation based ext3 preallocation
2004-03-30 17:12 ` [Ext2-devel] " Alex Tomas
@ 2004-03-30 18:07 ` Badari Pulavarty
0 siblings, 0 replies; 28+ messages in thread
From: Badari Pulavarty @ 2004-03-30 18:07 UTC (permalink / raw)
To: alex; +Cc: Andrew Morton, Mingming Cao, tytso, ext2-devel, linux-kernel
On Tuesday 30 March 2004 09:12 am, Alex Tomas wrote:
> On Втр, 2004-03-30 at 21:07, Badari Pulavarty wrote:
> > Can you explain this a little more ? What does b->data and
> > b->commited_data represent ? We are assuming that b->data will always be
> > uptodate.
>
> b_data represents actual information about used/free blocks.
> b_committed_data represents blocks that freed during current
> transaction. these blocks must not be allocated. there is good
> note about this just before ext3_test_allocatable() in balloc.c
Yes. I read the note after sending the mail.
Thanks,
Badari
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [Ext2-devel] Re: [RFC, PATCH] Reservation based ext3 preallocation
2004-03-30 17:07 ` Badari Pulavarty
2004-03-30 17:12 ` [Ext2-devel] " Alex Tomas
@ 2004-03-30 18:23 ` Mingming Cao
2004-03-30 18:36 ` Andrew Morton
2 siblings, 0 replies; 28+ messages in thread
From: Mingming Cao @ 2004-03-30 18:23 UTC (permalink / raw)
To: Badari Pulavarty; +Cc: Andrew Morton, tytso, ext2-devel, linux-kernel
On Tue, 2004-03-30 at 09:07, Badari Pulavarty wrote:
> On Tuesday 30 March 2004 01:45 am, Andrew Morton wrote:
>
> >I thing this is heading the right way.
Andrew, Thanks your comment and response to quickly! Will make changes
as you suggested.
> Andrew,
>
> > - Using ext3_find_next_zero_bit(bitmap_bh->b_data in
> > alloc_new_reservation() is risky.
> Can you explain this a little more ? What does b->data and b->commited_data
> represent ? We are assuming that b->data will always be uptodate.
>
> May be we should use ext3_test_allocatable() also.
> Mingming, what was the reason for using ext3_find_next_zero_bit() only ?
> We had this discussion earlier, but I forgot :(
I thought that just using ext3_find_next_zero_bit probably would be okey
since, once we get a reservation window that has a possible free block,
the ext3_try_to_allocate will check both the block group bitmap and the
copy of last committed bitb inside that window range anyway before doing
the real allocation. If there is no really free block on both bitmaps,
ext3_try_to_allocate will fail and will looking for a new reservation
window.
But as Andrew said, this may cause unnecessary calling
ext3_try_to_allocate() many many times...
We could do the same thing as in find_next_usable_block():
/*
* The bitmap search --- search forward alternately through the actual
* bitmap and the last-committed copy until we find a bit free in
* both
*/
while (here < maxblocks) {
next = ext3_find_next_zero_bit(bh->b_data, maxblocks, here);
if (next >= maxblocks)
return -1;
if (ext3_test_allocatable(next, bh))
return next;
jbd_lock_bh_state(bh);
if (jh->b_committed_data)
here =
ext3_find_next_zero_bit(jh->b_committed_data, maxblocks, next);
jbd_unlock_bh_state(bh);
}
Maybe make this a inline function....ext2 does not need to care about
the journalling stuff.
> > - Make sure that you have a many-small-file test. Say, untar a kernel
> > tree onto a clean filesystem and make sure that reading all the files in
> > the tree is nice and fast.
Haven't got a chance to verify that but good point. Will do.
> >
> > This is to check that the reservation is being discarded appropriately
> > on file close, and that those small files are contiguous on-disk. If we
> > accidentally leave gaps in between them the many-small-file bandwidth
> > takes a dive.
>
> Hmm. Ted proposed that we should keep reservation after file close.
> We weren't sure about this either. Its on our TODO list.
Only some files need to keep reservation cross file open/close, for
those that opened with append flag, or some file like /var/log while
multiple processes frequently open/write/close/. Maybe a file attribute
or a open flag could be used for this purpose. For regular files, I
think the files should discard at close. Untar a kernel tree just did
open files with WRITE then write and close.
>
> >
> > - There's a little program called `bmap' in
> > http://www.zip.com.au/~akpm/linux/patches/stuff/ext3-tools.tar.gz which
> > can be used to dump out a file's block allocation map, to check
> > fragmentation.
>
Thanks again.
Mingming
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC, PATCH] Reservation based ext3 preallocation
2004-03-30 17:07 ` Badari Pulavarty
2004-03-30 17:12 ` [Ext2-devel] " Alex Tomas
2004-03-30 18:23 ` Mingming Cao
@ 2004-03-30 18:36 ` Andrew Morton
2 siblings, 0 replies; 28+ messages in thread
From: Andrew Morton @ 2004-03-30 18:36 UTC (permalink / raw)
To: Badari Pulavarty; +Cc: cmm, tytso, ext2-devel, linux-kernel
Badari Pulavarty <pbadari@us.ibm.com> wrote:
>
> On Tuesday 30 March 2004 01:45 am, Andrew Morton wrote:
>
> Andrew,
>
> > - Using ext3_find_next_zero_bit(bitmap_bh->b_data in
> > alloc_new_reservation() is risky. There are some circumstances when you
> > have a huge number of "free" blocks in ->b_data, but they are all unfree
> > in ->b_committed_data. You could end up with astronomical search
> > complexity in there. You should search both bitmaps to find a block
> > which really is allocatable. Otherwise you'll have
> > ext3_try_to_allocate() failing 20,000 times in succession and much CPU
> > will be burnt.
>
> Can you explain this a little more ? What does b->data and b->commited_data
> represent ? We are assuming that b->data will always be uptodate.
The comment Alex pointed to is splendid ;)
> May be we should use ext3_test_allocatable() also.
I think so.
> > - There's a little program called `bmap' in
> > http://www.zip.com.au/~akpm/linux/patches/stuff/ext3-tools.tar.gz which
> > can be used to dump out a file's block allocation map, to check
> > fragmentation.
>
> Thanks. will use that. We are using debugfs for now. Do you have any tools
> to dump out whats in journal ? I want to understand log format etc..
> Just curious.
I cannot think of any. It wouldn't surprise me if e2fsck had a debug mode
which printed out this info, but I have not looked.
> >
> > Apart from that, looking good. Where are the benchmarks? ;)
>
> We are first concentrating on tiobench regression. We see clear
> degrade with tiobench on ext3, since it creates lots of files in the
> same directory. Once we are happy with tiobench, we go for others
> kernel untars, rawiobench etc.
OK.. dbench on SMP hardware shows poor layout also.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [Ext2-devel] Re: [RFC, PATCH] Reservation based ext3 preallocation
2004-03-30 9:45 ` Andrew Morton
2004-03-30 17:07 ` Badari Pulavarty
@ 2004-04-03 1:45 ` Mingming Cao
2004-04-03 1:50 ` Andrew Morton
1 sibling, 1 reply; 28+ messages in thread
From: Mingming Cao @ 2004-04-03 1:45 UTC (permalink / raw)
To: Andrew Morton; +Cc: tytso, pbadari, linux-kernel, cmm, ext2-devel
[-- Attachment #1: Type: text/plain, Size: 4153 bytes --]
Hi Andrew,
Here is the second version of the ext3, mostly bug fixes and made the
changes you have suggested last time.
It's a stable version now. We have done overnight fsx tests on a 2 CPU
PIII 700MHz box, did not see any issues. We also run tiobench and untar
tests.
> - Please use u32 for block numbers everywhere.
I tried to make the changes as you suggested: use u32 for block numbers
everywhere. Then hit a problem: sometimes, especially when doing block
allocation or find a free block on bitmap, on success, it returns the
block number, on fail, it returns -1 to indicate the failure. I found
we can't use the u32 for those cases. ext3_new_block() and
ext3_free_block() did the same thing. So I just did wherever I could to
make a block number from int to u32, and keep it int if it also means
failure in the fail case. Hmm...Is there any good way to do this?
> You should search both bitmaps to find a block
> which really is allocatable. Otherwise you'll have
> ext3_try_to_allocate() failing 20,000 times in succession and much CPU
> will be burnt.
Done.
> - I suspect ext3_try_to_allocate_with_rsv() could be reorganised a bit to
> reduce the goto spaghetti?
Re-organized. Merged conditions together and put them into a single loop. Now it does not contains any goto :-) See if the code looks clean and now...
> - Please provide a mount option which enables the feature, defaulting to
> "off".
Planning to do right after this version. Besides enable/disable the
reservation feature, I am thinking to enable the feature that could set
the the default reservation window size(in blocks) when the fs is
mounted. just one single mount option:"prealloc_window=n". When n=0,
it means turns off, when n>0, it means on, and the ext3 default
reservation window size for each file is n blocks(or 8 blocks, if 0< n <
8).
> - Make sure that you have a many-small-file test. Say, untar a kernel
> tree onto a clean filesystem and make sure that reading all the files in
> the tree is nice and fast.
Yes, have done this: untar linux-2.6.4 kernel tree to a clean ext3
filesystem. Verified the reservation is being discarded and the small
files are contiguous on-disk.
> Apart from that, looking good. Where are the benchmarks? ;)
One simple dd test, it's 4 times faster with the patch on a 2 way PIII
700Mhz box:
# cat /tmp/dd-large.sh
dd if=/dev/zero of=x1 bs=4k count=5000 &
dd if=/dev/zero of=x2 bs=4k count=5000 &
dd if=/dev/zero of=x3 bs=4k count=5000 &
dd if=/dev/zero of=x4 bs=4k count=5000 &
dd if=/dev/zero of=x5 bs=4k count=5000 &
dd if=/dev/zero of=x6 bs=4k count=5000 &
dd if=/dev/zero of=x7 bs=4k count=5000 &
dd if=/dev/zero of=x8 bs=4k count=5000 &
elm3b92:/mnt # time /tmp/dd-large.sh
# time /tmp/dd-large.sh
real 0m0.431s
user 0m0.001s
sys 0m0.009s
linux-2.6.4 kernel + ext3 reservation patch, window size 128 blocks:
# time /tmp/dd-large.sh
real 0m0.098s
user 0m0.001s
sys 0m0.009s
We also did tiobench sequential write tests on ext2, jfs and ext3 on
different number of threads(1,4,8,16,32 and 64), block size is 4k, file
size is 4000k. For ext3, we tried different reservation size from 8 to
128 blocks, as well as without any reservation. The test was done on a
8CPU PIII 700 i386 machine with 4G memory, on linux-2.6.4 kernel+patch
and linux2.6.4-mm1 kernel.
Attached is a graphic file show the tiobench results. It shows that
before the patch, the sequential write throughput on ext3 is pretty bad
compare with ext2 and jfs; with the patch, it's much better. And the
block allocation on disk for the files created is much less fragmented.
Planning to do dbench and other regression test later. Just what to
share with you the current status.
Patch attached below.
Thanks!
Mingming
fs/ext3/balloc.c | 581
+++++++++++++++++++++++++++++++++++++++++----
fs/ext3/file.c | 3
fs/ext3/ialloc.c | 6
fs/ext3/inode.c | 14 -
fs/ext3/super.c | 7
include/linux/ext3_fs.h | 3
include/linux/ext3_fs_i.h | 12
include/linux/ext3_fs_sb.h | 4
8 files changed, 578 insertions(+), 52 deletions(-)
[-- Attachment #2: ext3_reservation_9.diff --]
[-- Type: text/x-patch, Size: 28406 bytes --]
diff -urNp linux-2.6.4/fs/ext3/balloc.c 264-rsv/fs/ext3/balloc.c
--- linux-2.6.4/fs/ext3/balloc.c 2004-03-10 18:55:21.000000000 -0800
+++ 264-rsv/fs/ext3/balloc.c 2004-04-02 02:33:00.327572528 -0800
@@ -96,6 +96,96 @@ read_block_bitmap(struct super_block *sb
error_out:
return bh;
}
+#define EXT3_RESERVATION_DEBUG2
+/*
+ * The reservation window structure operations
+ * --------------------------------------------
+ * Operations include:
+ * dump, find, add, remove, is_empty, find_next_reservable_window, etc.
+ *
+ * We use sorted double linked list for the per-filesystem reservation
+ * window list. (like in vm_region).
+ *
+ * Initially, we keep those small operations in the abstract functions,
+ * so later if we need a better searching tree than double linked-list,
+ * we could easily switch to that without changing too much
+ * code.
+ */
+static inline void rsv_window_dump(struct reserve_window *head, char *fn)
+{
+ struct reserve_window *rsv;
+ printk("Block Allocation Reservation Windows Map (%s):\n", fn);
+ list_for_each_entry(rsv, &head->rsv_list, rsv_list) {
+ printk("reservation window 0x%p start: %d, end: %d\n",
+ rsv, rsv->rsv_start, rsv->rsv_end);
+ }
+}
+
+static inline int goal_in_my_reservation(struct reserve_window *rsv, unsigned long goal)
+{
+ return ((goal >= rsv->rsv_start) && (goal <= rsv->rsv_end));
+}
+
+/*
+ * find if the given block is within any reservation on the list
+ */
+static inline int rsv_window_find(struct reserve_window *start,
+ unsigned long block, unsigned long last_block)
+{
+ struct reserve_window *rsv;
+
+ list_for_each_entry(rsv, &start->rsv_list, rsv_list) {
+ if (goal_in_my_reservation(rsv, block))
+ return 1; /* found it*/
+ if (rsv->rsv_start > last_block)
+ return 0;
+ }
+ return 0;
+}
+
+static inline void rsv_window_add(struct reserve_window *rsv,
+ struct reserve_window *prev)
+{
+ /* insert the new reservation window after the head */
+ list_add(&rsv->rsv_list, &prev->rsv_list);
+}
+
+static inline void rsv_window_remove(struct reserve_window *rsv)
+{
+#ifdef EXT3_RESERVATION_DEBUG
+ printk(KERN_DEBUG "Reservation Window is removed:"
+ " 0x%p start: %d, end: %d \n", rsv, rsv->rsv_start,
+ rsv->rsv_end);
+#endif
+ rsv->rsv_start = 0;
+ rsv->rsv_end = 0;
+ list_del(&rsv->rsv_list);
+ INIT_LIST_HEAD(&rsv->rsv_list);
+}
+static inline int rsv_is_empty(struct reserve_window *rsv)
+{
+ /* a valid reservation end block could not be 0 */
+ return (rsv->rsv_end == 0);
+}
+
+void ext3_discard_reservation(struct inode *inode)
+{
+ struct ext3_inode_info *ei = EXT3_I(inode);
+ struct reserve_window *rsv = &ei->i_rsv_window;
+ spinlock_t *rsv_lock = &EXT3_SB(inode->i_sb)->s_rsv_window_lock;
+
+#ifdef EXT3_RESERVATION_DEBUG
+ printk(KERN_DEBUG "ext3_discard_reservation:"
+ " 0x%p start: %d, end: %d \n", rsv, rsv->rsv_start,
+ rsv->rsv_end);
+#endif
+
+ if (!rsv_is_empty(rsv)) {
+ spin_lock(rsv_lock);
+ rsv_window_remove(rsv);
+ spin_unlock(rsv_lock);
+ }
+}
/* Free given blocks, update quota and i_blocks field */
void ext3_free_blocks (handle_t *handle, struct inode * inode,
@@ -313,6 +403,33 @@ static inline int ext3_test_allocatable(
return ret;
}
+static inline int
+bitmap_search_next_usable_block(unsigned long start, struct buffer_head *bh,
+ unsigned long maxblocks)
+{
+ unsigned long next;
+ struct journal_head *jh = bh2jh(bh);
+
+ /*
+ * The bitmap search --- search forward alternately through the actual
+ * bitmap and the last-committed copy until we find a bit free in
+ * both
+ */
+ while (start < maxblocks) {
+ next = ext3_find_next_zero_bit(bh->b_data, maxblocks, start);
+ if (next >= maxblocks)
+ return -1;
+ if (ext3_test_allocatable(next, bh))
+ return next;
+ jbd_lock_bh_state(bh);
+ if (jh->b_committed_data)
+ start = ext3_find_next_zero_bit(jh->b_committed_data,
+ maxblocks, next);
+ jbd_unlock_bh_state(bh);
+ }
+ return -1;
+}
+
/*
* Find an allocatable block in a bitmap. We honour both the bitmap and
* its last-committed copy (if that exists), and perform the "most
@@ -325,7 +442,6 @@ find_next_usable_block(int start, struct
{
int here, next;
char *p, *r;
- struct journal_head *jh = bh2jh(bh);
if (start > 0) {
/*
@@ -359,19 +475,8 @@ find_next_usable_block(int start, struct
* bitmap and the last-committed copy until we find a bit free in
* both
*/
- while (here < maxblocks) {
- next = ext3_find_next_zero_bit(bh->b_data, maxblocks, here);
- if (next >= maxblocks)
- return -1;
- if (ext3_test_allocatable(next, bh))
- return next;
- jbd_lock_bh_state(bh);
- if (jh->b_committed_data)
- here = ext3_find_next_zero_bit(jh->b_committed_data,
- maxblocks, next);
- jbd_unlock_bh_state(bh);
- }
- return -1;
+ here = bitmap_search_next_usable_block(here, bh, maxblocks);
+ return here;
}
/*
@@ -407,62 +512,445 @@ claim_block(spinlock_t *lock, int block,
*/
static int
ext3_try_to_allocate(struct super_block *sb, handle_t *handle, int group,
- struct buffer_head *bitmap_bh, int goal, int *errp)
+ struct buffer_head *bitmap_bh, int goal, struct reserve_window * my_rsv)
{
- int i;
- int fatal;
- int credits = 0;
-
- *errp = 0;
+ unsigned long group_first_block, start, end;
- /*
- * Make sure we use undo access for the bitmap, because it is critical
- * that we do the frozen_data COW on bitmap buffers in all cases even
- * if the buffer is in BJ_Forget state in the committing transaction.
- */
- BUFFER_TRACE(bitmap_bh, "get undo access for new block");
- fatal = ext3_journal_get_undo_access(handle, bitmap_bh, &credits);
- if (fatal) {
- *errp = fatal;
- goto fail;
+ /* if EXT3_RESERVATION */
+ if (my_rsv) {
+ group_first_block =
+ le32_to_cpu(EXT3_SB(sb)->s_es->s_first_data_block) +
+ group * EXT3_BLOCKS_PER_GROUP(sb);
+ if (my_rsv->rsv_start >= group_first_block)
+ start = my_rsv->rsv_start - group_first_block;
+ else
+ /* reservation window cross group boundary */
+ start = group_first_block;
+ end = my_rsv->rsv_end - group_first_block;
+ if (end > EXT3_BLOCKS_PER_GROUP(sb))
+ /* reservation window cross group boundary */
+ end = EXT3_BLOCKS_PER_GROUP(sb);
+ }
+ else {
+ if (goal > 0)
+ start = goal;
+ else
+ start = 0;
+ end = EXT3_BLOCKS_PER_GROUP(sb);
}
repeat:
if (goal < 0 || !ext3_test_allocatable(goal, bitmap_bh)) {
- goal = find_next_usable_block(goal, bitmap_bh,
- EXT3_BLOCKS_PER_GROUP(sb));
+ goal = find_next_usable_block(start, bitmap_bh, end);
if (goal < 0)
goto fail_access;
- for (i = 0; i < 7 && goal > 0 &&
+ /*for (i = 0; i < 7 && goal > 0 &&
ext3_test_allocatable(goal - 1, bitmap_bh);
i++, goal--);
+ */
}
+ start = goal;
if (!claim_block(sb_bgl_lock(EXT3_SB(sb), group), goal, bitmap_bh)) {
/*
* The block was allocated by another thread, or it was
* allocated and then freed by another thread
*/
- goal++;
- if (goal >= EXT3_BLOCKS_PER_GROUP(sb))
+ start++;
+ if (start >= end)
goto fail_access;
goto repeat;
}
- BUFFER_TRACE(bitmap_bh, "journal_dirty_metadata for bitmap block");
- fatal = ext3_journal_dirty_metadata(handle, bitmap_bh);
+ return goal;
+fail_access:
+ return -1;
+}
+/**
+ * find_next_reservable_window():
+ * find a reservable space within the given range
+ * It does not allocate the reservation window for now
+ * alloc_new_reservation() will do the work later.
+ *
+ * @search_head: the head of the searching list;
+ * This is not necessary the list head of the whole filesystem
+ *
+ * we have both head and start_block to assist the search
+ * for the reservable space. The list start from head,
+ * but we will shift to the place where start_block is,
+ * then start from there, we looking for a resevable space.
+ *
+ * @fs_rsv_head: per-filesystem reervation list head.
+ *
+ * @size: the target new reservation window size
+ * @group_first_block: the first block we consider to start
+ * the real search from
+ *
+ * @last_block:
+ * the maxium block number that our goal reservable space
+ * could start from. This is normally the last block in this
+ * group. The search will end when we found the start of next
+ * possiblereservable space is out of this boundary.
+ * This could handle the cross bounday reservation window request.
+ *
+ * basically we search from the given range, rather than the whole
+ * reservation double linked list, (start_block, last_block)
+ * to find a free region that of of my size and has not
+ * been reserved.
+ *
+ * on succeed, it returns the reservation window to be append to.
+ * failed, return NULL.
+ */
+static inline
+struct reserve_window* find_next_reservable_window(
+ struct reserve_window *search_head,
+ struct reserve_window *fs_rsv_head,
+ unsigned short size, unsigned long *start_block,
+ unsigned long last_block)
+{
+ struct reserve_window *rsv;
+ unsigned long cur;
+
+ /* TODO:make the start of the reservation window byte alligned */
+ /*cur = *start_block & 8;*/
+ cur = *start_block;
+ rsv = list_entry(search_head->rsv_list.next, struct reserve_window, rsv_list);
+ while (rsv != fs_rsv_head) {
+ if (cur + size <= rsv->rsv_start) {
+ /*
+ * found a reserable space big enough
+ * we could have a reservation cross
+ * the group boundary here
+ */
+ break;
+ }
+ if (cur <= rsv->rsv_end)
+ cur = rsv->rsv_end + 1;
+
+ /* TODO?
+ * in the case we could not find a reservable space
+ * that is what is expected, during the research, we could
+ * remember what's the largest reservable space we could have
+ * and return that on.
+ *
+ * for now it will fail if we could not find the reservable
+ * space with expected-size (or more)...
+ */
+ rsv = list_entry(rsv->rsv_list.next, struct reserve_window, rsv_list);
+ if (cur > last_block)
+ return NULL; /* fail */
+ }
+ /*
+ * we come here either :
+ * when we rearch to the end of the whole list,
+ * and there is empty reservable space after last entry in the list.
+ * append it to the end of the list.
+ *
+ * or we found one reservable space in the middle of the list,
+ * return the reservation window that we could append to.
+ * succeed.
+ */
+ *start_block = cur;
+ return list_entry(rsv->rsv_list.prev, struct reserve_window, rsv_list);
+}
+
+/**
+ * alloc_new_reservation()--allocate a new reservation window
+ * if there is an existing reservation, discard it first
+ * then allocate the new one from there
+ * otherwise allocate the new reservation from the given
+ * start block, or the beginning of the group, if a goal
+ * is not given.
+ *
+ * To make a new reservation, we search part of the filesystem
+ * reservation list(the list that inside the group).
+ *
+ * If we have a old reservation, the search goal is the end of
+ * last reservation. If we do not have a old reservatio, then we
+ * start from a given goal, or the first block of the group, if
+ * the goal is not given.
+ *
+ * We first find a reservable space after the goal, then from
+ * there,we check the bitmap for the first free block after
+ * it. If there is no free block until the end of group, then the
+ * whole group is full, we failed. Otherwise, check if the free block
+ * is inside the expected reservable space, if so, we succeed.
+ * If the first free block is outside the reseravle space, then
+ * start from the first free block, we search for next avalibale
+ * space, and go on.
+ *
+ * on succeed, a new reservation will be found and inserted into the list
+ * It contains at least one free block, and it is not overlap with other
+ * reservation window.
+ *
+ * failed: we failed to found a reservation window in this group
+ *
+ * @rsv: the reservation
+ *
+ * @goal: The goal. It is where the search for a
+ * free reservable space should start from.
+ * if we have a old reservation, start_block is the end of
+ * old reservation. Otherwise,
+ * if we have a goal(goal >0 ), then start from there,
+ * no goal(goal = -1), we start from the first block
+ * of the group.
+ *
+ * @sb: the super block
+ * @group: the group we are trying to do allocate in
+ * @bitmap_bh: the block group block bitmap
+ */
+static int alloc_new_reservation(struct reserve_window *my_rsv,
+ int goal, struct super_block *sb,
+ unsigned int group, struct buffer_head *bitmap_bh)
+{
+ struct reserve_window *search_head;
+ unsigned long group_first_block, group_end_block, start_block;
+ int first_free_block;
+ int reservable_space_start;
+ struct reserve_window *prev_rsv;
+ struct reserve_window *fs_rsv_head = &EXT3_SB(sb)->s_rsv_window_head;
+ unsigned short size;
+
+ group_first_block = le32_to_cpu(EXT3_SB(sb)->s_es->s_first_data_block) +
+ group * EXT3_BLOCKS_PER_GROUP(sb);
+ group_end_block = group_first_block + EXT3_BLOCKS_PER_GROUP(sb) - 1;
+
+ if (goal < 0)
+ start_block = group_first_block;
+ else
+ start_block = goal + group_first_block;
+
+ /* if we have a old reservation, discard it first */
+ if (!rsv_is_empty(my_rsv)) {
+ /*
+ * if the old reservation is cross group boundary
+ * we will come here when we just failed to allocate from
+ * the first part of the window. We still have another part
+ * that belongs to the next group. In this case, there is no
+ * point to discard our window and try to allocate a new one
+ * in this group(which will fail). we should
+ * keep the reservation window, just simply move on.
+ *
+ * Maybe we could shift the start block of the reservation window
+ * to the first block of next group...
+ */
+ if (my_rsv->rsv_end >= group_end_block)
+ return -1;
+
+ /* remember where we are before we discard the old one */
+ if (my_rsv->rsv_end + 1 > start_block)
+ start_block = my_rsv->rsv_end + 1;
+ search_head = list_entry(my_rsv->rsv_list.prev,
+ struct reserve_window, rsv_list);
+
+#ifdef EXT3_RESERVATION_DEBUG
+ printk(KERN_DEBUG "Reservation Window is removed from alloc_new_"
+ " 0x%p start: %d, end: %d \n", my_rsv, my_rsv->rsv_start,
+ my_rsv->rsv_end);
+#endif
+ rsv_window_remove(my_rsv);
+ }
+ else {
+ /*
+ * we don't have a reservation,
+ * we set our goal(start_block) and
+ * the list head for the search
+ */
+ search_head = fs_rsv_head;
+ }
+
+ /*
+ * find_next_reservable_window() simply find a reservable window
+ * inside the given range(start_block, group_end_block).
+ *
+ * To make sure the reservation window has a free bit inside it, we need
+ * to check the bitmap after we found a reservable window.
+ */
+ size = my_rsv->rsv_goal_size;
+retry:
+ prev_rsv = find_next_reservable_window(search_head, fs_rsv_head, size,
+ &start_block, group_end_block);
+ if (prev_rsv == NULL){
+#ifdef EXT3_RESERVATION_DEBUG
+ printk(KERN_DEBUG "NO WINDOW. start_block %d, group_end_block %d,"
+ "search_head %p, fs_rsv_head %p\n",
+ start_block, group_end_block, search_head,
+ fs_rsv_head);
+#endif
+ goto failed;
+ }
+ reservable_space_start = start_block;
+ /*
+ * on succeed, find_next_reservable_window() returns the
+ * reservation window where there is a reservable space after it.
+ * Before we reserve this reservable space, we need
+ * to make sure there is at least a free block inside this region.
+ *
+ * searching the first free bit on the block bitmap and copy of
+ * last committed bitmap alternatively, until we found a allocatable
+ * block. Search start from the start block of the reservable space
+ * we just found.
+ */
+ first_free_block = bitmap_search_next_usable_block(
+ reservable_space_start - group_first_block,
+ bitmap_bh, group_end_block - group_first_block + 1);
+
+ if (first_free_block < 0) {
+#ifdef EXT3_RESERVATION_DEBUG
+ printk(KERN_DEBUG "group_first_block %d, group_end_block %d,"
+ " reservable_space_start %d\n", group_first_block,
+ group_end_block, reservable_space_start);
+#endif
+ /*
+ * no free block left on the bitmap, no point
+ * to reserve the space. return failed.
+ */
+ goto failed;
+ }
+ start_block = first_free_block + group_first_block;
+ /*
+ * check if the first free block is within the
+ * free space we just found
+ */
+ if ((start_block >= reservable_space_start) &&
+ (start_block < reservable_space_start + size))
+ goto found_rsv_window;
+ /*
+ * if the first free bit we found is out of the reservable space
+ * this means there is no free block on the reservable space
+ * we should continue search for next reservable space,
+ * start from where the free block is,
+ * we also shift the list head to where we stopped last time
+ */
+ search_head = prev_rsv;
+ goto retry;
+
+found_rsv_window:
+ /*
+ * great! the reservable space contains some free blocks.
+ * Insert it to the list.
+ */
+ rsv_window_add(my_rsv, prev_rsv);
+ my_rsv->rsv_start = reservable_space_start;
+ my_rsv->rsv_end = my_rsv->rsv_start + size - 1;
+#ifdef EXT3_RESERVATION_DEBUG
+ printk(KERN_DEBUG "New reservation window allocated:"
+ " 0x%p start: %d, end: %d \n", my_rsv, my_rsv->rsv_start,
+ my_rsv->rsv_end);
+ rsv_window_dump(fs_rsv_head, "alloc_new_reservation");
+#endif
+
+ return 0; /* succeed */
+failed:
+ return -1; /* failed */
+}
+/*
+ * This is the main function used to allocate a new block and
+ * it's reservation window.
+ * each time when a new block allocation is need, first try to allocate
+ * from it's own reservation.
+ * If it does not have a reservation window, instead of looking
+ * for a free bit on bitmap first, then look up the reservation list to see if
+ * it is inside somebody else's reservation window,
+ * we try to allocate a reservation window for it start from the goal first.
+ * Then do the block allocation within the reservation window.
+ *
+ * This will aviod keep searching the reservation list again and again
+ * when someboday is looking for a free block(without reservation),
+ * and there are lots of free blocks, but they are all being reserved
+ *
+ * We use a sorted double linked list for the per-filesystem reservation list.
+ * The insert, remove and find a free space(non-reserved) operations for the
+ * sorted double linked list should be fast.
+ *
+ */
+static int
+ext3_try_to_allocate_with_rsv(struct super_block *sb, handle_t *handle,
+ unsigned int group, struct buffer_head *bitmap_bh,
+ int goal, struct reserve_window * my_rsv,
+ int *errp)
+{
+ spinlock_t *rsv_lock;
+ unsigned long group_first_block;
+ int ret = 0;
+ int fatal;
+ int credits = 0;
+
+ *errp = 0;
+
+ /*
+ * Make sure we use undo access for the bitmap, because it is critical
+ * that we do the frozen_data COW on bitmap buffers in all cases even
+ * if the buffer is in BJ_Forget state in the committing transaction.
+ */
+ BUFFER_TRACE(bitmap_bh, "get undo access for new block");
+ fatal = ext3_journal_get_undo_access(handle, bitmap_bh, &credits);
if (fatal) {
*errp = fatal;
- goto fail;
+ return -1;
}
- return goal;
-fail_access:
+#ifdef EXT3_RESERVATION
+ rsv_lock = &EXT3_SB(sb)->s_rsv_window_lock;
+ /*
+ * goal is a group relative block number (if there is a goal)
+ * 0 < goal < EXT3_BLOCKS_PER_GROUP(sb)
+ * first block is a filesystem wide block number
+ * first block is the block number of the first block in this group
+ */
+ group_first_block = le32_to_cpu(EXT3_SB(sb)->s_es->s_first_data_block) +
+ group * EXT3_BLOCKS_PER_GROUP(sb);
+
+ /*
+ * Basically we will allocate a new block from inode's reservation window.
+ *
+ * We need to allocate a new reservation window, if:
+ * a) inode does not have a reservation window; or
+ * b) last attemp of allocating a block from existing reservation failed; or
+ * c) we come here with a goal and with a reservation window
+ *
+ * we do not need to allocate a new reservation window if
+ * we come here at the beginning with a goal and the goal is inside the window, or
+ * or we don't have a goal but already have a reservation window.
+ * then we could go to allocate from the reservation window directly.
+ */
+ while (1) {
+ if (rsv_is_empty(my_rsv) || (ret < 0) ||
+ ((goal >= 0) && !goal_in_my_reservation(my_rsv, goal+group_first_block))) {
+ spin_lock(rsv_lock);
+ ret = alloc_new_reservation(my_rsv, goal, sb, group, bitmap_bh);
+ spin_unlock(rsv_lock);
+ if (ret < 0)
+ break; /* failed */
+
+ if ((goal > 0) && !goal_in_my_reservation(my_rsv, goal+group_first_block))
+ goal = -1;
+ }
+
+ ret = ext3_try_to_allocate(sb, handle, group, bitmap_bh, goal,
+ my_rsv);
+ if (ret >= 0)
+ break; /* succeed */
+ }
+#else
+ ret = ext3_try_to_allocate(sb, handle, group, bitmap_bh, goal, NULL);
+#endif
+
+ if (ret >= 0) {
+ BUFFER_TRACE(bitmap_bh, "journal_dirty_metadata for bitmap block");
+ fatal = ext3_journal_dirty_metadata(handle, bitmap_bh);
+ if (fatal) {
+ *errp = fatal;
+ return -1;
+ }
+ return ret;
+ }
+ return ret;
+
BUFFER_TRACE(bitmap_bh, "journal_release_buffer");
ext3_journal_release_buffer(handle, bitmap_bh, credits);
-fail:
- return -1;
+ return ret;
}
/*
@@ -490,6 +978,7 @@ ext3_new_block(handle_t *handle, struct
struct ext3_group_desc *gdp;
struct ext3_super_block *es;
struct ext3_sb_info *sbi;
+ struct reserve_window *my_rsv = &EXT3_I(inode)->i_rsv_window;
#ifdef EXT3FS_DEBUG
static int goal_hits, goal_attempts;
#endif
@@ -540,8 +1029,8 @@ ext3_new_block(handle_t *handle, struct
bitmap_bh = read_block_bitmap(sb, group_no);
if (!bitmap_bh)
goto io_error;
- ret_block = ext3_try_to_allocate(sb, handle, group_no,
- bitmap_bh, ret_block, &fatal);
+ ret_block = ext3_try_to_allocate_with_rsv(sb, handle, group_no,
+ bitmap_bh, ret_block, my_rsv, &fatal);
if (fatal)
goto out;
if (ret_block >= 0)
@@ -569,8 +1058,8 @@ ext3_new_block(handle_t *handle, struct
bitmap_bh = read_block_bitmap(sb, group_no);
if (!bitmap_bh)
goto io_error;
- ret_block = ext3_try_to_allocate(sb, handle, group_no,
- bitmap_bh, -1, &fatal);
+ ret_block = ext3_try_to_allocate_with_rsv(sb, handle, group_no,
+ bitmap_bh, -1, my_rsv, &fatal);
if (fatal)
goto out;
if (ret_block >= 0)
diff -urNp linux-2.6.4/fs/ext3/file.c 264-rsv/fs/ext3/file.c
--- linux-2.6.4/fs/ext3/file.c 2004-03-10 18:55:49.000000000 -0800
+++ 264-rsv/fs/ext3/file.c 2004-03-30 18:21:30.463306312 -0800
@@ -34,7 +34,8 @@
static int ext3_release_file (struct inode * inode, struct file * filp)
{
if (filp->f_mode & FMODE_WRITE)
- ext3_discard_prealloc (inode);
+ ext3_discard_reservation(inode);
+ /*ext3_discard_prealloc (inode);*/
if (is_dx(inode) && filp->private_data)
ext3_htree_free_dir_info(filp->private_data);
diff -urNp linux-2.6.4/fs/ext3/ialloc.c 264-rsv/fs/ext3/ialloc.c
--- linux-2.6.4/fs/ext3/ialloc.c 2004-03-10 18:55:27.000000000 -0800
+++ 264-rsv/fs/ext3/ialloc.c 2004-04-02 22:26:48.615417624 -0800
@@ -29,6 +29,7 @@
#include "xattr.h"
#include "acl.h"
+
/*
* ialloc.c contains the inodes allocation and deallocation routines
*/
@@ -585,6 +586,11 @@ got:
ei->i_prealloc_block = 0;
ei->i_prealloc_count = 0;
#endif
+ ei->i_rsv_window.rsv_start = 0;
+ ei->i_rsv_window.rsv_end= 0;
+ ei->i_rsv_window.rsv_goal_size = EXT3_DEFAULT_RESERVE_BLOCKS;
+ INIT_LIST_HEAD(&ei->i_rsv_window.rsv_list);
+
ei->i_block_group = group;
ext3_set_inode_flags(inode);
diff -urNp linux-2.6.4/fs/ext3/inode.c 264-rsv/fs/ext3/inode.c
--- linux-2.6.4/fs/ext3/inode.c 2004-03-10 18:55:35.000000000 -0800
+++ 264-rsv/fs/ext3/inode.c 2004-04-02 22:26:58.250952800 -0800
@@ -186,7 +186,9 @@ static int ext3_journal_test_restart(han
void ext3_put_inode(struct inode *inode)
{
if (!is_bad_inode(inode))
- ext3_discard_prealloc(inode);
+
+ ext3_discard_reservation(inode);
+ /* ext3_discard_prealloc(inode); */
}
/*
@@ -2137,8 +2139,9 @@ void ext3_truncate(struct inode * inode)
return;
if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
return;
-
- ext3_discard_prealloc(inode);
+
+ ext3_discard_reservation(inode);
+ /* ext3_discard_prealloc(inode); */
/*
* We have to lock the EOF page here, because lock_page() nests
@@ -2535,7 +2538,10 @@ void ext3_read_inode(struct inode * inod
ei->i_prealloc_count = 0;
#endif
ei->i_block_group = iloc.block_group;
-
+ ei->i_rsv_window.rsv_start = 0;
+ ei->i_rsv_window.rsv_end= 0;
+ ei->i_rsv_window.rsv_goal_size = EXT3_DEFAULT_RESERVE_BLOCKS;
+ INIT_LIST_HEAD(&ei->i_rsv_window.rsv_list);
/*
* NOTE! The in-memory inode i_data array is in little-endian order
* even on big-endian machines: we do NOT byteswap the block numbers!
diff -urNp linux-2.6.4/fs/ext3/super.c 264-rsv/fs/ext3/super.c
--- linux-2.6.4/fs/ext3/super.c 2004-03-10 18:55:44.000000000 -0800
+++ 264-rsv/fs/ext3/super.c 2004-03-29 17:22:18.539077360 -0800
@@ -1291,6 +1291,13 @@ static int ext3_fill_super (struct super
sbi->s_gdb_count = db_count;
get_random_bytes(&sbi->s_next_generation, sizeof(u32));
spin_lock_init(&sbi->s_next_gen_lock);
+ /* per fileystem reservation list head & lock */
+ spin_lock_init(&sbi->s_rsv_window_lock);
+ INIT_LIST_HEAD(&sbi->s_rsv_window_head.rsv_list);
+ sbi->s_rsv_window_head.rsv_start = 0;
+ sbi->s_rsv_window_head.rsv_end = 0;
+ sbi->s_rsv_window_head.rsv_goal_size = 0;
+
/*
* set up enough so that it can read an inode
*/
diff -urNp linux-2.6.4/include/linux/ext3_fs.h 264-rsv/include/linux/ext3_fs.h
--- linux-2.6.4/include/linux/ext3_fs.h 2004-03-10 18:55:33.000000000 -0800
+++ 264-rsv/include/linux/ext3_fs.h 2004-04-02 22:26:23.335260792 -0800
@@ -37,6 +37,8 @@ struct statfs;
*/
#undef EXT3_PREALLOCATE /* @@@ Fix this! */
#define EXT3_DEFAULT_PREALLOC_BLOCKS 8
+#define EXT3_RESERVATION
+#define EXT3_DEFAULT_RESERVE_BLOCKS 8
/*
* Always enable hashed directories
@@ -728,6 +730,7 @@ extern void ext3_put_inode (struct inode
extern void ext3_delete_inode (struct inode *);
extern int ext3_sync_inode (handle_t *, struct inode *);
extern void ext3_discard_prealloc (struct inode *);
+extern void ext3_discard_reservation (struct inode *);
extern void ext3_dirty_inode(struct inode *);
extern int ext3_change_inode_journal_flag(struct inode *, int);
extern void ext3_truncate (struct inode *);
diff -urNp linux-2.6.4/include/linux/ext3_fs_i.h 264-rsv/include/linux/ext3_fs_i.h
--- linux-2.6.4/include/linux/ext3_fs_i.h 2004-03-10 18:55:21.000000000 -0800
+++ 264-rsv/include/linux/ext3_fs_i.h 2004-03-30 18:39:30.556107248 -0800
@@ -18,8 +18,15 @@
#include <linux/rwsem.h>
+struct reserve_window{
+ struct list_head rsv_list;
+ __u32 rsv_start;
+ __u32 rsv_end;
+ unsigned short rsv_goal_size;
+};
+
/*
- * second extended file system inode data in memory
+ * third extended file system inode data in memory
*/
struct ext3_inode_info {
__u32 i_data[15];
@@ -61,6 +68,9 @@ struct ext3_inode_info {
__u32 i_prealloc_block;
__u32 i_prealloc_count;
#endif
+ /* block reservation window */
+ struct reserve_window i_rsv_window;
+
__u32 i_dir_start_lookup;
#ifdef CONFIG_EXT3_FS_XATTR
/*
diff -urNp linux-2.6.4/include/linux/ext3_fs_sb.h 264-rsv/include/linux/ext3_fs_sb.h
--- linux-2.6.4/include/linux/ext3_fs_sb.h 2004-03-10 18:55:44.000000000 -0800
+++ 264-rsv/include/linux/ext3_fs_sb.h 2004-03-29 17:22:18.544076600 -0800
@@ -59,6 +59,10 @@ struct ext3_sb_info {
struct percpu_counter s_dirs_counter;
struct blockgroup_lock s_blockgroup_lock;
+ /* head of the per fs reservation window tree */
+ spinlock_t s_rsv_window_lock;
+ struct reserve_window s_rsv_window_head;
+
/* Journaling */
struct inode * s_journal_inode;
struct journal_s * s_journal;
[-- Attachment #3: Throughput.png --]
[-- Type: image/png, Size: 29395 bytes --]
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [Ext2-devel] Re: [RFC, PATCH] Reservation based ext3 preallocation
2004-04-03 1:45 ` [Ext2-devel] " Mingming Cao
@ 2004-04-03 1:50 ` Andrew Morton
2004-04-03 2:37 ` Mingming Cao
0 siblings, 1 reply; 28+ messages in thread
From: Andrew Morton @ 2004-04-03 1:50 UTC (permalink / raw)
To: Mingming Cao; +Cc: tytso, pbadari, linux-kernel, cmm, ext2-devel
Mingming Cao <cmm@us.ibm.com> wrote:
>
> Hi Andrew,
> Here is the second version of the ext3, mostly bug fixes and made the
> changes you have suggested last time.
Great, thanks.
> Besides enable/disable the
> reservation feature, I am thinking to enable the feature that could set
> the the default reservation window size(in blocks) when the fs is
> mounted. just one single mount option:"prealloc_window=n". When n=0,
> it means turns off, when n>0, it means on, and the ext3 default
> reservation window size for each file is n blocks(or 8 blocks, if 0< n <
> 8).
hm, maybe. We should probably also provide a per-file ext3-specific ioctl
to allow specialised apps to manipulate the reservation size.
And we should grow the reservation size dynamically. I've suggested that
we double its size each time it is exhausted, up to some limit. There may
be better algorithms though.
This work doesn't help us with the slowly-growing logfile or mailbox file
problem. I guess that would require on-disk reservations, or a new
`chattr' hint or such.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [Ext2-devel] Re: [RFC, PATCH] Reservation based ext3 preallocation
2004-04-03 1:50 ` Andrew Morton
@ 2004-04-03 2:37 ` Mingming Cao
2004-04-03 2:50 ` Andrew Morton
0 siblings, 1 reply; 28+ messages in thread
From: Mingming Cao @ 2004-04-03 2:37 UTC (permalink / raw)
To: Andrew Morton; +Cc: tytso, pbadari, linux-kernel, ext2-devel
On Fri, 2004-04-02 at 17:50, Andrew Morton wrote:
> hm, maybe. We should probably also provide a per-file ext3-specific ioctl
> to allow specialised apps to manipulate the reservation size.
>
> And we should grow the reservation size dynamically. I've suggested that
> we double its size each time it is exhausted, up to some limit. There may
> be better algorithms though.
You mean when the reservation window size is exhausted, right? I think
this is probably the easiest way. Maybe like the readahead window does.
Just sometimes the window reserved does not contains much free blocks to
allocate, and we could easily reach to the upper limit.
Currently, when try to reserve a window in a block group, if there is no
window big enough for this, we skip this group and move on to the next
group. I was thinking maybe we should keep track of the largest
avaliable reservable window when we are looking for a new window, so in
case we can't find the one with expected size, we at least could get one
within the group.
This will try to keep the file inside it's target group, and also reduce
the possibility of bogus earlier ENOSPC. Just it's a trade off:there
maybe plenty of space in the next group......who knows. What do you
think?
Also, for the the bogus earlier ENOSPC, : the filesystem probably
relatively full of reservations, so the late guy who need a new block
but don't have a reservation window will failed to allocate a block. In
this case, the easiest way as you said before is to just steal a free
block from other file's reservation window. I agree it's the extreme
case and the solution is easy, just a little concern that this will in
favor of those inodes who came first and made reservations, but whose
who come in later need a new block, every time has to suffer the same
pain: search the whole filesystem first, then end of steal blocks every
time.
Anyway maybe by that moment the there are too many open files for
writes(so the fs is full of reservations) so it is already in trouble.
>
> This work doesn't help us with the slowly-growing logfile or mailbox file
> problem. I guess that would require on-disk reservations, or a new
> `chattr' hint or such.
Ted has suggested to preserve the reservation/preallocation for those
slowing growing logfile for mailbox file. Probably do not discard the
reservation window for those files(the logfile) when they are closed.
When it opens next time, it will allocate blocks directly from the old
reservation window. Is that what you think?
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [Ext2-devel] Re: [RFC, PATCH] Reservation based ext3 preallocation
2004-04-03 2:37 ` Mingming Cao
@ 2004-04-03 2:50 ` Andrew Morton
2004-04-05 16:49 ` Mingming Cao
2004-04-14 0:52 ` [PATCH 0/4] ext3 block reservation patch set Mingming Cao
0 siblings, 2 replies; 28+ messages in thread
From: Andrew Morton @ 2004-04-03 2:50 UTC (permalink / raw)
To: Mingming Cao; +Cc: tytso, pbadari, linux-kernel, ext2-devel
Mingming Cao <cmm@us.ibm.com> wrote:
>
> On Fri, 2004-04-02 at 17:50, Andrew Morton wrote:
> > hm, maybe. We should probably also provide a per-file ext3-specific ioctl
> > to allow specialised apps to manipulate the reservation size.
> >
> > And we should grow the reservation size dynamically. I've suggested that
> > we double its size each time it is exhausted, up to some limit. There may
> > be better algorithms though.
> You mean when the reservation window size is exhausted, right? I think
> this is probably the easiest way. Maybe like the readahead window does.
> Just sometimes the window reserved does not contains much free blocks to
> allocate, and we could easily reach to the upper limit.
Good point. So the reservation should be grown by "the number of blocks we
allocated in the previous window", not by "the size of the previous
window", yes?
> Currently, when try to reserve a window in a block group, if there is no
> window big enough for this, we skip this group and move on to the next
> group. I was thinking maybe we should keep track of the largest
> avaliable reservable window when we are looking for a new window, so in
> case we can't find the one with expected size, we at least could get one
> within the group.
I suspect that if you cannot get a window in the blockgroup then simply
skipping to the next blockgroup should be OK.
But I don't understand why the reservation code needs to know about
blockgroups at all, at least from a conceptual point of view.
> This will try to keep the file inside it's target group, and also reduce
> the possibility of bogus earlier ENOSPC. Just it's a trade off:there
> maybe plenty of space in the next group......who knows. What do you
> think?
Probably it's sufficient to use the inode's blockgroup's starting block as
the initial target for allocations and then just forget about blockgroups.
Simply let allocation wander further up the disk from there, with no
further consideration of blockgroups.
> Also, for the the bogus earlier ENOSPC, : the filesystem probably
> relatively full of reservations, so the late guy who need a new block
> but don't have a reservation window will failed to allocate a block. In
> this case, the easiest way as you said before is to just steal a free
> block from other file's reservation window. I agree it's the extreme
> case and the solution is easy, just a little concern that this will in
> favor of those inodes who came first and made reservations, but whose
> who come in later need a new block, every time has to suffer the same
> pain: search the whole filesystem first, then end of steal blocks every
> time.
It would be fairly weird for the entire disk to be covered by reservations,
so falling back to the current algorithm would be OK.
> Anyway maybe by that moment the there are too many open files for
> writes(so the fs is full of reservations) so it is already in trouble.
>
> >
> > This work doesn't help us with the slowly-growing logfile or mailbox file
> > problem. I guess that would require on-disk reservations, or a new
> > `chattr' hint or such.
>
> Ted has suggested to preserve the reservation/preallocation for those
> slowing growing logfile for mailbox file. Probably do not discard the
> reservation window for those files(the logfile) when they are closed.
> When it opens next time, it will allocate blocks directly from the old
> reservation window. Is that what you think?
yup, except we now have potentially millions of inodes which have active
reservations. ENOSPC and CPU consumption problems are certain.
Some combination of
- A chattr hint
- Using O_APPEND as a hint and
- Retaining an upper limit on the number of unopened inodes which have a
reservation
should fix that up. You'd need to hook into ->destroy_inode to release
reservations when inodes are reclaimed by the VM.
But this is surely phase two material.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [Ext2-devel] Re: [RFC, PATCH] Reservation based ext3 preallocation
2004-04-03 2:50 ` Andrew Morton
@ 2004-04-05 16:49 ` Mingming Cao
2004-04-14 0:52 ` [PATCH 0/4] ext3 block reservation patch set Mingming Cao
1 sibling, 0 replies; 28+ messages in thread
From: Mingming Cao @ 2004-04-05 16:49 UTC (permalink / raw)
To: Andrew Morton; +Cc: tytso, pbadari, linux-kernel, ext2-devel
On Fri, 2004-04-02 at 18:50, Andrew Morton wrote:
> Mingming Cao <cmm@us.ibm.com> wrote:
> >
> > On Fri, 2004-04-02 at 17:50, Andrew Morton wrote:
> > > hm, maybe. We should probably also provide a per-file ext3-specific ioctl
> > > to allow specialised apps to manipulate the reservation size.
> > >
> > > And we should grow the reservation size dynamically. I've suggested that
> > > we double its size each time it is exhausted, up to some limit. There may
> > > be better algorithms though.
> > You mean when the reservation window size is exhausted, right? I think
> > this is probably the easiest way. Maybe like the readahead window does.
> > Just sometimes the window reserved does not contains much free blocks to
> > allocate, and we could easily reach to the upper limit.
>
> Good point. So the reservation should be grown by "the number of blocks we
> allocated in the previous window", not by "the size of the previous
> window", yes?
>
Yes. Maybe in the reservation structure we add a counter to keep track
of the preallocation hit. Then when a new window need to be created, we
look at the old window preallocation hit ratio to determine how much the
window size should be next time.
> > Currently, when try to reserve a window in a block group, if there is no
> > window big enough for this, we skip this group and move on to the next
> > group. I was thinking maybe we should keep track of the largest
> > avaliable reservable window when we are looking for a new window, so in
> > case we can't find the one with expected size, we at least could get one
> > within the group.
>
> I suspect that if you cannot get a window in the blockgroup then simply
> skipping to the next blockgroup should be OK.
>
okey.
> But I don't understand why the reservation code needs to know about
> blockgroups at all, at least from a conceptual point of view.
>
Agree that reservation itself is a filesystem wide concept. The
reservation window could cross the block group boundary.
> Probably it's sufficient to use the inode's blockgroup's starting block as
> the initial target for allocations and then just forget about blockgroups.
> Simply let allocation wander further up the disk from there, with no
> further consideration of blockgroups.
I think the current code's logic is the same as you said. The logic of
current code is: given a goal block,try to allocate a block starting
from there within the inode's block group. If it failed, then simply
move on to next group without a goal -- the search for a free block will
start from the starting block of the next group. I was trying to keep
the same logic as before. So for the reservation code, given a goal
block, we will try to allocate a new reservation window (and then
allocate a block within it) from the give goal block. If it failed, we
will simply do reservation window allocate in the rest of the disk,
without consideration of the inode's blockgroup.
>
> It would be fairly weird for the entire disk to be covered by reservations,
> so falling back to the current algorithm would be OK.
okey.
> > > This work doesn't help us with the slowly-growing logfile or mailbox file
> > > problem. I guess that would require on-disk reservations, or a new
> > > `chattr' hint or such.
> >
> > Ted has suggested to preserve the reservation/preallocation for those
> > slowing growing logfile for mailbox file. Probably do not discard the
> > reservation window for those files(the logfile) when they are closed.
> > When it opens next time, it will allocate blocks directly from the old
> > reservation window. Is that what you think?
>
> yup, except we now have potentially millions of inodes which have active
> reservations. ENOSPC and CPU consumption problems are certain.
>
> Some combination of
>
> - A chattr hint
>
> - Using O_APPEND as a hint and
>
> - Retaining an upper limit on the number of unopened inodes which have a
> reservation
>
> should fix that up. You'd need to hook into ->destroy_inode to release
> reservations when inodes are reclaimed by the VM.
>
> But this is surely phase two material.
Okey. Will think about this more later...
Thanks for your help!
Mingming
>
>
>
>
> -------------------------------------------------------
> This SF.Net email is sponsored by: IBM Linux Tutorials
> Free Linux tutorial presented by Daniel Robbins, President and CEO of
> GenToo technologies. Learn everything from fundamentals to system
> administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
> _______________________________________________
> Ext2-devel mailing list
> Ext2-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/ext2-devel
>
^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH 0/4] ext3 block reservation patch set
2004-04-03 2:50 ` Andrew Morton
2004-04-05 16:49 ` Mingming Cao
@ 2004-04-14 0:52 ` Mingming Cao
2004-04-14 0:54 ` [PATCH 1/4] ext3 block reservation patch set -- ext3 preallocation cleanup Mingming Cao
` (5 more replies)
1 sibling, 6 replies; 28+ messages in thread
From: Mingming Cao @ 2004-04-14 0:52 UTC (permalink / raw)
To: Andrew Morton; +Cc: tytso, pbadari, linux-kernel, ext2-devel
Hello,
Here is a set of patches which implement the in-memory ext3 block
reservation (previously called reservation based ext3 preallocation).
[patch 1]ext3_rsv_cleanup.patch: Cleans up the old ext3 preallocation
code carried from ext2 but turned off.
[patch 2]ext3_rsv_base.patch: Implements the base of in-memory block
reservation and block allocation from reservation window.
[patch 3]ext3_rsv_mount.patch: Adds features on top of the
ext3_rsv_base.patch:
- deal with earlier bogus -ENOSPC error
- do block reservation only for regular file
- make the ext3 reservation feature as a mount option:
new mount option added: reservation
- A pair of file ioctl commands are added for application to control
the block reservation window size.
[patch 4]ext3_rsv_dw.patch: adjust the reservation window size
dynamically:
Start from the deault reservation window size, if the hit ration of
the reservation window is more than 50%, we will double the reservation
window size next time up to a certain upper limit.
Here are some numbers collected on dbench on 8 way PIII 700Mhz:
dbench average throughputs on 4 runs
==================================================
Threads ext3 ext3+rsv(8) ext3+rsv+dw
1 103 104(0%) 105(1%)
4 144 286(98%) 256(77%)
8 118 197(66%) 210(77%)
16 113 160(41%) 177(56%)
32 61 123(101%) 150(145%)
64 41 82(100%) 85(107%)
And some numbers on tiobench sequential write:
tiobench Sequential Writes throughputs(improvments)
=====================================================================
Threads ext2 ext3 ext3+rsv(8)(%) ext3+rsv(128)(%) ext3+rsv+dw(%)
1 26 23 25(8%) 26(13%) 26(13%)
4 17 4 14(250%) 24(500%) 25(525%)
8 15 7 13(85%) 23(228%) 24(242%)
16 16 13 12(-7%) 22(69%) 24(84%)
32 15 3 12(300%) 23(666%) 23(666%)
64 14 1 11(1000%) 22(2100%) 23(2200%)
Note each time we run the test on a fresh created ext3 filesystem.
We have also run fsx tests on a 8 way on 2.6.4 kernel with the patch set
for a whole weekend on fresh created ext3 filesystem, as well as on a 4
way with the root filesystem as ext3 plus all the changes. Other tests
include 8 threads dd tests and untar a kernel source tree.
Besides look at the performance numbers and verify the functionality, we
also checked the block allocation layout for each file generated during
the test: the blocks for a file are more contiguous with the reservation
mount option on, especially when we dynamically increase the reservation
window size in the sequential write cases.
Andrew, is this something that you would consider for -mm tree?
Thanks again for Andrew, Ted and Badari's ideas and helps on this
project. I would really appreciate any comments and feedbacks.
Mingming
^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH 1/4] ext3 block reservation patch set -- ext3 preallocation cleanup
2004-04-14 0:52 ` [PATCH 0/4] ext3 block reservation patch set Mingming Cao
@ 2004-04-14 0:54 ` Mingming Cao
2004-04-14 0:57 ` [PATCH 2/4] ext3 block reservation patch set --ext3 block reservation Mingming Cao
` (4 subsequent siblings)
5 siblings, 0 replies; 28+ messages in thread
From: Mingming Cao @ 2004-04-14 0:54 UTC (permalink / raw)
To: Mingming Cao; +Cc: Andrew Morton, tytso, pbadari, linux-kernel, ext2-devel
[-- Attachment #1: Type: text/plain, Size: 502 bytes --]
> [patch 1]ext3_rsv_cleanup.patch: Cleans up the old ext3 preallocation
> code carried from ext2 but turned off.
diffstat ext3_rsv_cleanup.patch
fs/ext3/balloc.c | 3 -
fs/ext3/file.c | 2 -
fs/ext3/ialloc.c | 4 --
fs/ext3/inode.c | 91
----------------------------------------------
fs/ext3/xattr.c | 2 -
include/linux/ext3_fs.h | 9 ----
include/linux/ext3_fs_i.h | 4 --
7 files changed, 4 insertions(+), 111 deletions(-)
[-- Attachment #2: ext3_rsv_cleanup.patch --]
[-- Type: text/x-patch, Size: 7752 bytes --]
diff -urNp linux-2.6.4/fs/ext3/balloc.c 264-rsv-cleanup/fs/ext3/balloc.c
--- linux-2.6.4/fs/ext3/balloc.c 2004-03-10 18:55:21.000000000 -0800
+++ 264-rsv-cleanup/fs/ext3/balloc.c 2004-04-06 01:13:25.560544680 -0700
@@ -474,8 +474,7 @@ fail:
* This function also updates quota and i_blocks field.
*/
int
-ext3_new_block(handle_t *handle, struct inode *inode, unsigned long goal,
- u32 *prealloc_count, u32 *prealloc_block, int *errp)
+ext3_new_block(handle_t *handle, struct inode *inode, unsigned long goal, int *errp)
{
struct buffer_head *bitmap_bh = NULL; /* bh */
struct buffer_head *gdp_bh; /* bh2 */
diff -urNp linux-2.6.4/fs/ext3/file.c 264-rsv-cleanup/fs/ext3/file.c
--- linux-2.6.4/fs/ext3/file.c 2004-03-10 18:55:49.000000000 -0800
+++ 264-rsv-cleanup/fs/ext3/file.c 2004-04-06 01:12:49.136082040 -0700
@@ -33,8 +33,6 @@
*/
static int ext3_release_file (struct inode * inode, struct file * filp)
{
- if (filp->f_mode & FMODE_WRITE)
- ext3_discard_prealloc (inode);
if (is_dx(inode) && filp->private_data)
ext3_htree_free_dir_info(filp->private_data);
diff -urNp linux-2.6.4/fs/ext3/ialloc.c 264-rsv-cleanup/fs/ext3/ialloc.c
--- linux-2.6.4/fs/ext3/ialloc.c 2004-03-10 18:55:27.000000000 -0800
+++ 264-rsv-cleanup/fs/ext3/ialloc.c 2004-04-06 01:09:26.834836504 -0700
@@ -581,10 +581,6 @@ got:
ei->i_file_acl = 0;
ei->i_dir_acl = 0;
ei->i_dtime = 0;
-#ifdef EXT3_PREALLOCATE
- ei->i_prealloc_block = 0;
- ei->i_prealloc_count = 0;
-#endif
ei->i_block_group = group;
ext3_set_inode_flags(inode);
diff -urNp linux-2.6.4/fs/ext3/inode.c 264-rsv-cleanup/fs/ext3/inode.c
--- linux-2.6.4/fs/ext3/inode.c 2004-03-10 18:55:35.000000000 -0800
+++ 264-rsv-cleanup/fs/ext3/inode.c 2004-04-06 01:13:44.307694680 -0700
@@ -185,8 +185,6 @@ static int ext3_journal_test_restart(han
*/
void ext3_put_inode(struct inode *inode)
{
- if (!is_bad_inode(inode))
- ext3_discard_prealloc(inode);
}
/*
@@ -244,62 +242,12 @@ no_delete:
clear_inode(inode); /* We must guarantee clearing of inode... */
}
-void ext3_discard_prealloc (struct inode * inode)
-{
-#ifdef EXT3_PREALLOCATE
- struct ext3_inode_info *ei = EXT3_I(inode);
- /* Writer: ->i_prealloc* */
- if (ei->i_prealloc_count) {
- unsigned short total = ei->i_prealloc_count;
- unsigned long block = ei->i_prealloc_block;
- ei->i_prealloc_count = 0;
- ei->i_prealloc_block = 0;
- /* Writer: end */
- ext3_free_blocks (inode, block, total);
- }
-#endif
-}
-
static int ext3_alloc_block (handle_t *handle,
struct inode * inode, unsigned long goal, int *err)
{
unsigned long result;
-#ifdef EXT3_PREALLOCATE
-#ifdef EXT3FS_DEBUG
- static unsigned long alloc_hits, alloc_attempts;
-#endif
- struct ext3_inode_info *ei = EXT3_I(inode);
- /* Writer: ->i_prealloc* */
- if (ei->i_prealloc_count &&
- (goal == ei->i_prealloc_block ||
- goal + 1 == ei->i_prealloc_block))
- {
- result = ei->i_prealloc_block++;
- ei->i_prealloc_count--;
- /* Writer: end */
- ext3_debug ("preallocation hit (%lu/%lu).\n",
- ++alloc_hits, ++alloc_attempts);
- } else {
- ext3_discard_prealloc (inode);
- ext3_debug ("preallocation miss (%lu/%lu).\n",
- alloc_hits, ++alloc_attempts);
- if (S_ISREG(inode->i_mode))
- result = ext3_new_block (inode, goal,
- &ei->i_prealloc_count,
- &ei->i_prealloc_block, err);
- else
- result = ext3_new_block (inode, goal, 0, 0, err);
- /*
- * AKPM: this is somewhat sticky. I'm not surprised it was
- * disabled in 2.2's ext3. Need to integrate b_committed_data
- * guarding with preallocation, if indeed preallocation is
- * effective.
- */
- }
-#else
- result = ext3_new_block (handle, inode, goal, 0, 0, err);
-#endif
+ result = ext3_new_block (handle, inode, goal, err);
return result;
}
@@ -966,38 +914,6 @@ struct buffer_head *ext3_bread(handle_t
bh = ext3_getblk (handle, inode, block, create, err);
if (!bh)
return bh;
-#ifdef EXT3_PREALLOCATE
- /*
- * If the inode has grown, and this is a directory, then use a few
- * more of the preallocated blocks to keep directory fragmentation
- * down. The preallocated blocks are guaranteed to be contiguous.
- */
- if (create &&
- S_ISDIR(inode->i_mode) &&
- inode->i_blocks > prev_blocks &&
- EXT3_HAS_COMPAT_FEATURE(inode->i_sb,
- EXT3_FEATURE_COMPAT_DIR_PREALLOC)) {
- int i;
- struct buffer_head *tmp_bh;
-
- for (i = 1;
- EXT3_I(inode)->i_prealloc_count &&
- i < EXT3_SB(inode->i_sb)->s_es->s_prealloc_dir_blocks;
- i++) {
- /*
- * ext3_getblk will zero out the contents of the
- * directory for us
- */
- tmp_bh = ext3_getblk(handle, inode,
- block+i, create, err);
- if (!tmp_bh) {
- brelse (bh);
- return 0;
- }
- brelse (tmp_bh);
- }
- }
-#endif
if (buffer_uptodate(bh))
return bh;
ll_rw_block (READ, 1, &bh);
@@ -2138,8 +2054,6 @@ void ext3_truncate(struct inode * inode)
if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
return;
- ext3_discard_prealloc(inode);
-
/*
* We have to lock the EOF page here, because lock_page() nests
* outside journal_start().
@@ -2531,9 +2445,6 @@ void ext3_read_inode(struct inode * inod
}
ei->i_disksize = inode->i_size;
inode->i_generation = le32_to_cpu(raw_inode->i_generation);
-#ifdef EXT3_PREALLOCATE
- ei->i_prealloc_count = 0;
-#endif
ei->i_block_group = iloc.block_group;
/*
diff -urNp linux-2.6.4/fs/ext3/xattr.c 264-rsv-cleanup/fs/ext3/xattr.c
--- linux-2.6.4/fs/ext3/xattr.c 2004-03-10 18:55:28.000000000 -0800
+++ 264-rsv-cleanup/fs/ext3/xattr.c 2004-04-06 01:14:03.397792544 -0700
@@ -787,7 +787,7 @@ ext3_xattr_set_handle2(handle_t *handle,
EXT3_I(inode)->i_block_group *
EXT3_BLOCKS_PER_GROUP(sb);
int block = ext3_new_block(handle,
- inode, goal, 0, 0, &error);
+ inode, goal, &error);
if (error)
goto cleanup;
ea_idebug(inode, "creating block %d", block);
diff -urNp linux-2.6.4/include/linux/ext3_fs.h 264-rsv-cleanup/include/linux/ext3_fs.h
--- linux-2.6.4/include/linux/ext3_fs.h 2004-03-10 18:55:33.000000000 -0800
+++ 264-rsv-cleanup/include/linux/ext3_fs.h 2004-04-06 01:15:11.343463232 -0700
@@ -33,12 +33,6 @@ struct statfs;
#undef EXT3FS_DEBUG
/*
- * Define EXT3_PREALLOCATE to preallocate data blocks for expanding files
- */
-#undef EXT3_PREALLOCATE /* @@@ Fix this! */
-#define EXT3_DEFAULT_PREALLOC_BLOCKS 8
-
-/*
* Always enable hashed directories
*/
#define CONFIG_EXT3_INDEX
@@ -680,8 +674,7 @@ struct dir_private_info {
/* balloc.c */
extern int ext3_bg_has_super(struct super_block *sb, int group);
extern unsigned long ext3_bg_num_gdb(struct super_block *sb, int group);
-extern int ext3_new_block (handle_t *, struct inode *, unsigned long,
- __u32 *, __u32 *, int *);
+extern int ext3_new_block (handle_t *, struct inode *, unsigned long, int *);
extern void ext3_free_blocks (handle_t *, struct inode *, unsigned long,
unsigned long);
extern unsigned long ext3_count_free_blocks (struct super_block *);
diff -urNp linux-2.6.4/include/linux/ext3_fs_i.h 264-rsv-cleanup/include/linux/ext3_fs_i.h
--- linux-2.6.4/include/linux/ext3_fs_i.h 2004-03-10 18:55:21.000000000 -0800
+++ 264-rsv-cleanup/include/linux/ext3_fs_i.h 2004-04-06 00:38:28.684318320 -0700
@@ -57,10 +57,6 @@ struct ext3_inode_info {
* allocation when we detect linearly ascending requests.
*/
__u32 i_next_alloc_goal;
-#ifdef EXT3_PREALLOCATE
- __u32 i_prealloc_block;
- __u32 i_prealloc_count;
-#endif
__u32 i_dir_start_lookup;
#ifdef CONFIG_EXT3_FS_XATTR
/*
^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH 2/4] ext3 block reservation patch set --ext3 block reservation
2004-04-14 0:52 ` [PATCH 0/4] ext3 block reservation patch set Mingming Cao
2004-04-14 0:54 ` [PATCH 1/4] ext3 block reservation patch set -- ext3 preallocation cleanup Mingming Cao
@ 2004-04-14 0:57 ` Mingming Cao
2004-04-14 0:58 ` [PATCH 3/4] ext3 block reservation patch set --mount and ioctl feature Mingming Cao
` (3 subsequent siblings)
5 siblings, 0 replies; 28+ messages in thread
From: Mingming Cao @ 2004-04-14 0:57 UTC (permalink / raw)
To: Mingming Cao; +Cc: Andrew Morton, tytso, pbadari, linux-kernel, ext2-devel
[-- Attachment #1: Type: text/plain, Size: 629 bytes --]
> [patch 2]ext3_rsv_base.patch: Implements the base of in-memory block
> reservation and block allocation from reservation window.
-basic reservation structure and operations
-reservation based ext3 block allocation
diffstat ext3_rsv_base.patch
fs/ext3/balloc.c | 548
++++++++++++++++++++++++++++++++++++++++-----
fs/ext3/file.c | 2
fs/ext3/ialloc.c | 5
fs/ext3/inode.c | 9
fs/ext3/super.c | 7
include/linux/ext3_fs.h | 6
include/linux/ext3_fs_i.h | 12
include/linux/ext3_fs_sb.h | 4
8 files changed, 542 insertions(+), 51 deletions(-)
[-- Attachment #2: ext3_rsv_base.patch --]
[-- Type: text/x-patch, Size: 27830 bytes --]
diff -urNp -X dontdiff 264-rsv-cleanup/fs/ext3/balloc.c 264-rsv-cleanup-base/fs/ext3/balloc.c
--- 264-rsv-cleanup/fs/ext3/balloc.c 2004-04-06 01:49:11.000000000 -0700
+++ 264-rsv-cleanup-base/fs/ext3/balloc.c 2004-04-13 01:42:40.352200376 -0700
@@ -96,6 +96,79 @@ read_block_bitmap(struct super_block *sb
error_out:
return bh;
}
+/*
+ * The reservation window structure operations
+ * --------------------------------------------
+ * Operations include:
+ * dump, find, add, remove, is_empty, find_next_reservable_window, etc.
+ *
+ * We use sorted double linked list for the per-filesystem reservation
+ * window list. (like in vm_region).
+ *
+ * Initially, we keep those small operations in the abstract functions,
+ * so later if we need a better searching tree than double linked-list,
+ * we could easily switch to that without changing too much
+ * code.
+ */
+static inline void rsv_window_dump(struct reserve_window *head, char *fn)
+{
+ struct reserve_window *rsv;
+ printk("Block Allocation Reservation Windows Map (%s):\n", fn);
+ list_for_each_entry(rsv, &head->rsv_list, rsv_list) {
+ printk("reservation window 0x%p start: %d, end: %d\n",
+ rsv, rsv->rsv_start, rsv->rsv_end);
+ }
+}
+
+static inline int goal_in_my_reservation(struct reserve_window *rsv, int goal,
+ unsigned int group, struct super_block * sb)
+{
+ unsigned long group_first_block, group_last_block;
+
+ group_first_block = le32_to_cpu(EXT3_SB(sb)->s_es->s_first_data_block) +
+ group * EXT3_BLOCKS_PER_GROUP(sb);
+ group_last_block = group_first_block + EXT3_BLOCKS_PER_GROUP(sb) -1 ;
+
+ if ((rsv->rsv_start > group_last_block) || (rsv->rsv_end < group_first_block))
+ return 0;
+ if ((goal >= 0) && ((goal + group_first_block < rsv->rsv_start)
+ || (goal + group_first_block > rsv->rsv_end)))
+ return 0;
+ return 1;
+}
+
+static inline void rsv_window_add(struct reserve_window *rsv,
+ struct reserve_window *prev)
+{
+ /* insert the new reservation window after the head */
+ list_add(&rsv->rsv_list, &prev->rsv_list);
+}
+
+static inline void rsv_window_remove(struct reserve_window *rsv)
+{
+ rsv->rsv_start = 0;
+ rsv->rsv_end = 0;
+ list_del(&rsv->rsv_list);
+ INIT_LIST_HEAD(&rsv->rsv_list);
+}
+static inline int rsv_is_empty(struct reserve_window *rsv)
+{
+ /* a valid reservation end block could not be 0 */
+ return (rsv->rsv_end == 0);
+}
+
+void ext3_discard_reservation(struct inode *inode)
+{
+ struct ext3_inode_info *ei = EXT3_I(inode);
+ struct reserve_window *rsv = &ei->i_rsv_window;
+ spinlock_t *rsv_lock = &EXT3_SB(inode->i_sb)->s_rsv_window_lock;
+
+ if (!rsv_is_empty(rsv)) {
+ spin_lock(rsv_lock);
+ rsv_window_remove(rsv);
+ spin_unlock(rsv_lock);
+ }
+}
/* Free given blocks, update quota and i_blocks field */
void ext3_free_blocks (handle_t *handle, struct inode * inode,
@@ -313,6 +386,33 @@ static inline int ext3_test_allocatable(
return ret;
}
+static inline int
+bitmap_search_next_usable_block(int start, struct buffer_head *bh,
+ int maxblocks)
+{
+ int next;
+ struct journal_head *jh = bh2jh(bh);
+
+ /*
+ * The bitmap search --- search forward alternately through the actual
+ * bitmap and the last-committed copy until we find a bit free in
+ * both
+ */
+ while (start < maxblocks) {
+ next = ext3_find_next_zero_bit(bh->b_data, maxblocks, start);
+ if (next >= maxblocks)
+ return -1;
+ if (ext3_test_allocatable(next, bh))
+ return next;
+ jbd_lock_bh_state(bh);
+ if (jh->b_committed_data)
+ start = ext3_find_next_zero_bit(jh->b_committed_data,
+ maxblocks, next);
+ jbd_unlock_bh_state(bh);
+ }
+ return -1;
+}
+
/*
* Find an allocatable block in a bitmap. We honour both the bitmap and
* its last-committed copy (if that exists), and perform the "most
@@ -325,7 +425,6 @@ find_next_usable_block(int start, struct
{
int here, next;
char *p, *r;
- struct journal_head *jh = bh2jh(bh);
if (start > 0) {
/*
@@ -337,6 +436,8 @@ find_next_usable_block(int start, struct
* next 64-bit boundary is simple..
*/
int end_goal = (start + 63) & ~63;
+ if (end_goal > maxblocks)
+ end_goal = maxblocks;
here = ext3_find_next_zero_bit(bh->b_data, end_goal, start);
if (here < end_goal && ext3_test_allocatable(here, bh))
return here;
@@ -359,19 +460,8 @@ find_next_usable_block(int start, struct
* bitmap and the last-committed copy until we find a bit free in
* both
*/
- while (here < maxblocks) {
- next = ext3_find_next_zero_bit(bh->b_data, maxblocks, here);
- if (next >= maxblocks)
- return -1;
- if (ext3_test_allocatable(next, bh))
- return next;
- jbd_lock_bh_state(bh);
- if (jh->b_committed_data)
- here = ext3_find_next_zero_bit(jh->b_committed_data,
- maxblocks, next);
- jbd_unlock_bh_state(bh);
- }
- return -1;
+ here = bitmap_search_next_usable_block(here, bh, maxblocks);
+ return here;
}
/*
@@ -407,62 +497,421 @@ claim_block(spinlock_t *lock, int block,
*/
static int
ext3_try_to_allocate(struct super_block *sb, handle_t *handle, int group,
- struct buffer_head *bitmap_bh, int goal, int *errp)
+ struct buffer_head *bitmap_bh, int goal, struct reserve_window * my_rsv)
{
- int i;
- int fatal;
- int credits = 0;
+ int group_first_block, start, end;
- *errp = 0;
-
- /*
- * Make sure we use undo access for the bitmap, because it is critical
- * that we do the frozen_data COW on bitmap buffers in all cases even
- * if the buffer is in BJ_Forget state in the committing transaction.
- */
- BUFFER_TRACE(bitmap_bh, "get undo access for new block");
- fatal = ext3_journal_get_undo_access(handle, bitmap_bh, &credits);
- if (fatal) {
- *errp = fatal;
- goto fail;
+ /* we do allocation within the reservation window if we have a window */
+ if (my_rsv) {
+ group_first_block =
+ le32_to_cpu(EXT3_SB(sb)->s_es->s_first_data_block) +
+ group * EXT3_BLOCKS_PER_GROUP(sb);
+ if (my_rsv->rsv_start >= group_first_block)
+ start = my_rsv->rsv_start - group_first_block;
+ else
+ /* reservation window cross group boundary */
+ start = 0;
+ end = my_rsv->rsv_end - group_first_block + 1;
+ if (end > EXT3_BLOCKS_PER_GROUP(sb))
+ /* reservation window cross group boundary */
+ end = EXT3_BLOCKS_PER_GROUP(sb);
+ if ((start <= goal) && (goal < end))
+ start = goal;
+ else
+ goal = -1;
+ }
+ else {
+ if (goal > 0)
+ start = goal;
+ else
+ start = 0;
+ end = EXT3_BLOCKS_PER_GROUP(sb);
}
+ if (start > EXT3_BLOCKS_PER_GROUP(sb)) BUG();
+
repeat:
if (goal < 0 || !ext3_test_allocatable(goal, bitmap_bh)) {
- goal = find_next_usable_block(goal, bitmap_bh,
- EXT3_BLOCKS_PER_GROUP(sb));
+ goal = find_next_usable_block(start, bitmap_bh, end);
if (goal < 0)
goto fail_access;
-
- for (i = 0; i < 7 && goal > 0 &&
- ext3_test_allocatable(goal - 1, bitmap_bh);
- i++, goal--);
}
+ start = goal;
if (!claim_block(sb_bgl_lock(EXT3_SB(sb), group), goal, bitmap_bh)) {
/*
* The block was allocated by another thread, or it was
* allocated and then freed by another thread
*/
- goal++;
- if (goal >= EXT3_BLOCKS_PER_GROUP(sb))
+ start++; goal++;
+ if (start >= end)
goto fail_access;
goto repeat;
}
- BUFFER_TRACE(bitmap_bh, "journal_dirty_metadata for bitmap block");
- fatal = ext3_journal_dirty_metadata(handle, bitmap_bh);
+ return goal;
+fail_access:
+ return -1;
+}
+/**
+ * find_next_reservable_window():
+ * find a reservable space within the given range
+ * It does not allocate the reservation window for now
+ * alloc_new_reservation() will do the work later.
+ *
+ * @search_head: the head of the searching list;
+ * This is not necessary the list head of the whole filesystem
+ *
+ * we have both head and start_block to assist the search
+ * for the reservable space. The list start from head,
+ * but we will shift to the place where start_block is,
+ * then start from there, we looking for a resevable space.
+ *
+ * @fs_rsv_head: per-filesystem reervation list head.
+ *
+ * @size: the target new reservation window size
+ * @group_first_block: the first block we consider to start
+ * the real search from
+ *
+ * @last_block:
+ * the maxium block number that our goal reservable space
+ * could start from. This is normally the last block in this
+ * group. The search will end when we found the start of next
+ * possiblereservable space is out of this boundary.
+ * This could handle the cross bounday reservation window request.
+ *
+ * basically we search from the given range, rather than the whole
+ * reservation double linked list, (start_block, last_block)
+ * to find a free region that of of my size and has not
+ * been reserved.
+ *
+ * on succeed, it returns the reservation window to be append to.
+ * failed, return NULL.
+ */
+static inline
+struct reserve_window* find_next_reservable_window(
+ struct reserve_window *search_head,
+ struct reserve_window *fs_rsv_head,
+ unsigned long size, int *start_block,
+ int last_block)
+{
+ struct reserve_window *rsv;
+ int cur;
+
+ /* TODO:make the start of the reservation window byte alligned */
+ /*cur = *start_block & 8;*/
+ cur = *start_block;
+ rsv = list_entry(search_head->rsv_list.next, struct reserve_window, rsv_list);
+ while (rsv != fs_rsv_head) {
+ if (cur + size <= rsv->rsv_start) {
+ /*
+ * found a reserable space big enough
+ * we could have a reservation cross
+ * the group boundary here
+ */
+ break;
+ }
+ if (cur <= rsv->rsv_end)
+ cur = rsv->rsv_end + 1;
+
+ /* TODO?
+ * in the case we could not find a reservable space
+ * that is what is expected, during the research, we could
+ * remember what's the largest reservable space we could have
+ * and return that on.
+ *
+ * for now it will fail if we could not find the reservable
+ * space with expected-size (or more)...
+ */
+ rsv = list_entry(rsv->rsv_list.next, struct reserve_window, rsv_list);
+ if (cur > last_block)
+ return NULL; /* fail */
+ }
+ /*
+ * we come here either :
+ * when we rearch to the end of the whole list,
+ * and there is empty reservable space after last entry in the list.
+ * append it to the end of the list.
+ *
+ * or we found one reservable space in the middle of the list,
+ * return the reservation window that we could append to.
+ * succeed.
+ */
+ *start_block = cur;
+ return list_entry(rsv->rsv_list.prev, struct reserve_window, rsv_list);
+}
+
+/**
+ * alloc_new_reservation()--allocate a new reservation window
+ * if there is an existing reservation, discard it first
+ * then allocate the new one from there
+ * otherwise allocate the new reservation from the given
+ * start block, or the beginning of the group, if a goal
+ * is not given.
+ *
+ * To make a new reservation, we search part of the filesystem
+ * reservation list(the list that inside the group).
+ *
+ * If we have a old reservation, the search goal is the end of
+ * last reservation. If we do not have a old reservatio, then we
+ * start from a given goal, or the first block of the group, if
+ * the goal is not given.
+ *
+ * We first find a reservable space after the goal, then from
+ * there,we check the bitmap for the first free block after
+ * it. If there is no free block until the end of group, then the
+ * whole group is full, we failed. Otherwise, check if the free block
+ * is inside the expected reservable space, if so, we succeed.
+ * If the first free block is outside the reseravle space, then
+ * start from the first free block, we search for next avalibale
+ * space, and go on.
+ *
+ * on succeed, a new reservation will be found and inserted into the list
+ * It contains at least one free block, and it is not overlap with other
+ * reservation window.
+ *
+ * failed: we failed to found a reservation window in this group
+ *
+ * @rsv: the reservation
+ *
+ * @goal: The goal. It is where the search for a
+ * free reservable space should start from.
+ * if we have a old reservation, start_block is the end of
+ * old reservation. Otherwise,
+ * if we have a goal(goal >0 ), then start from there,
+ * no goal(goal = -1), we start from the first block
+ * of the group.
+ *
+ * @sb: the super block
+ * @group: the group we are trying to do allocate in
+ * @bitmap_bh: the block group block bitmap
+ */
+static int alloc_new_reservation(struct reserve_window *my_rsv,
+ int goal, struct super_block *sb,
+ unsigned int group, struct buffer_head *bitmap_bh)
+{
+ struct reserve_window *search_head;
+ int group_first_block, group_end_block, start_block;
+ int first_free_block;
+ int reservable_space_start;
+ struct reserve_window *prev_rsv;
+ struct reserve_window *fs_rsv_head = &EXT3_SB(sb)->s_rsv_window_head;
+ unsigned long size;
+
+ group_first_block = le32_to_cpu(EXT3_SB(sb)->s_es->s_first_data_block) +
+ group * EXT3_BLOCKS_PER_GROUP(sb);
+ group_end_block = group_first_block + EXT3_BLOCKS_PER_GROUP(sb) - 1;
+
+ if (goal < 0)
+ start_block = group_first_block;
+ else
+ start_block = goal + group_first_block;
+
+ size = my_rsv->rsv_goal_size;
+ /* if we have a old reservation, discard it first */
+ if (!rsv_is_empty(my_rsv)) {
+ /*
+ * if the old reservation is cross group boundary
+ * we will come here when we just failed to allocate from
+ * the first part of the window. We still have another part
+ * that belongs to the next group. In this case, there is no
+ * point to discard our window and try to allocate a new one
+ * in this group(which will fail). we should
+ * keep the reservation window, just simply move on.
+ *
+ * Maybe we could shift the start block of the reservation window
+ * to the first block of next group...
+ */
+
+ if ((my_rsv->rsv_start <= group_end_block) && (my_rsv->rsv_end > group_end_block))
+ return -1;
+
+ /* remember where we are before we discard the old one */
+ if (my_rsv->rsv_end + 1 > start_block)
+ start_block = my_rsv->rsv_end + 1;
+ search_head = list_entry(my_rsv->rsv_list.prev,
+ struct reserve_window, rsv_list);
+ rsv_window_remove(my_rsv);
+ }
+ else {
+ /*
+ * we don't have a reservation,
+ * we set our goal(start_block) and
+ * the list head for the search
+ */
+ search_head = fs_rsv_head;
+ }
+
+ /*
+ * find_next_reservable_window() simply find a reservable window
+ * inside the given range(start_block, group_end_block).
+ *
+ * To make sure the reservation window has a free bit inside it, we need
+ * to check the bitmap after we found a reservable window.
+ */
+retry:
+ prev_rsv = find_next_reservable_window(search_head, fs_rsv_head, size,
+ &start_block, group_end_block);
+ if (prev_rsv == NULL)
+ goto failed;
+ reservable_space_start = start_block;
+ /*
+ * on succeed, find_next_reservable_window() returns the
+ * reservation window where there is a reservable space after it.
+ * Before we reserve this reservable space, we need
+ * to make sure there is at least a free block inside this region.
+ *
+ * searching the first free bit on the block bitmap and copy of
+ * last committed bitmap alternatively, until we found a allocatable
+ * block. Search start from the start block of the reservable space
+ * we just found.
+ */
+ first_free_block = bitmap_search_next_usable_block(
+ reservable_space_start - group_first_block,
+ bitmap_bh, group_end_block - group_first_block + 1);
+
+ if (first_free_block < 0) {
+ /*
+ * no free block left on the bitmap, no point
+ * to reserve the space. return failed.
+ */
+ goto failed;
+ }
+ start_block = first_free_block + group_first_block;
+ /*
+ * check if the first free block is within the
+ * free space we just found
+ */
+ if ((start_block >= reservable_space_start) &&
+ (start_block < reservable_space_start + size))
+ goto found_rsv_window;
+ /*
+ * if the first free bit we found is out of the reservable space
+ * this means there is no free block on the reservable space
+ * we should continue search for next reservable space,
+ * start from where the free block is,
+ * we also shift the list head to where we stopped last time
+ */
+ search_head = prev_rsv;
+ goto retry;
+
+found_rsv_window:
+ /*
+ * great! the reservable space contains some free blocks.
+ * Insert it to the list.
+ */
+ rsv_window_add(my_rsv, prev_rsv);
+ my_rsv->rsv_start = reservable_space_start;
+ my_rsv->rsv_end = my_rsv->rsv_start + size - 1;
+ return 0; /* succeed */
+failed:
+ return -1; /* failed */
+}
+/*
+ * This is the main function used to allocate a new block and
+ * it's reservation window.
+ * each time when a new block allocation is need, first try to allocate
+ * from it's own reservation.
+ * If it does not have a reservation window, instead of looking
+ * for a free bit on bitmap first, then look up the reservation list to see if
+ * it is inside somebody else's reservation window,
+ * we try to allocate a reservation window for it start from the goal first.
+ * Then do the block allocation within the reservation window.
+ *
+ * This will aviod keep searching the reservation list again and again
+ * when someboday is looking for a free block(without reservation),
+ * and there are lots of free blocks, but they are all being reserved
+ *
+ * We use a sorted double linked list for the per-filesystem reservation list.
+ * The insert, remove and find a free space(non-reserved) operations for the
+ * sorted double linked list should be fast.
+ *
+ */
+static int
+ext3_try_to_allocate_with_rsv(struct super_block *sb, handle_t *handle,
+ unsigned int group, struct buffer_head *bitmap_bh,
+ int goal, struct reserve_window * my_rsv,
+ int *errp)
+{
+ spinlock_t *rsv_lock;
+ unsigned long group_first_block;
+ int ret = 0;
+ int fatal;
+ int credits = 0;
+
+ *errp = 0;
+
+ /*
+ * Make sure we use undo access for the bitmap, because it is critical
+ * that we do the frozen_data COW on bitmap buffers in all cases even
+ * if the buffer is in BJ_Forget state in the committing transaction.
+ */
+ BUFFER_TRACE(bitmap_bh, "get undo access for new block");
+ fatal = ext3_journal_get_undo_access(handle, bitmap_bh, &credits);
if (fatal) {
*errp = fatal;
- goto fail;
+ return -1;
+ }
+
+#ifdef EXT3_RESERVATION
+ rsv_lock = &EXT3_SB(sb)->s_rsv_window_lock;
+ /*
+ * goal is a group relative block number (if there is a goal)
+ * 0 < goal < EXT3_BLOCKS_PER_GROUP(sb)
+ * first block is a filesystem wide block number
+ * first block is the block number of the first block in this group
+ */
+ group_first_block = le32_to_cpu(EXT3_SB(sb)->s_es->s_first_data_block) +
+ group * EXT3_BLOCKS_PER_GROUP(sb);
+
+ /*
+ * Basically we will allocate a new block from inode's reservation window.
+ *
+ * We need to allocate a new reservation window, if:
+ * a) inode does not have a reservation window; or
+ * b) last attemp of allocating a block from existing reservation failed; or
+ * c) we come here with a goal and with a reservation window
+ *
+ * we do not need to allocate a new reservation window if
+ * we come here at the beginning with a goal and the goal is inside the window, or
+ * or we don't have a goal but already have a reservation window.
+ * then we could go to allocate from the reservation window directly.
+ */
+ while (1) {
+ if (rsv_is_empty(my_rsv) || (ret < 0) ||
+ !goal_in_my_reservation(my_rsv, goal, group, sb)) {
+ spin_lock(rsv_lock);
+ ret = alloc_new_reservation(my_rsv, goal, sb, group, bitmap_bh);
+ spin_unlock(rsv_lock);
+ if (ret < 0)
+ break; /* failed */
+
+ if (!goal_in_my_reservation(my_rsv, goal, group, sb)) {
+ goal = -1;
+ }
+ }
+ ret = ext3_try_to_allocate(sb, handle, group, bitmap_bh, goal,
+ my_rsv);
+ if (ret >= 0)
+ break; /* succeed */
+ }
+#else
+ ret = ext3_try_to_allocate(sb, handle, group, bitmap_bh, goal, NULL);
+#endif
+
+ if (ret >= 0) {
+ BUFFER_TRACE(bitmap_bh, "journal_dirty_metadata for bitmap block");
+ fatal = ext3_journal_dirty_metadata(handle, bitmap_bh);
+ if (fatal) {
+ *errp = fatal;
+ return -1;
+ }
+ return ret;
}
- return goal;
-fail_access:
BUFFER_TRACE(bitmap_bh, "journal_release_buffer");
ext3_journal_release_buffer(handle, bitmap_bh, credits);
-fail:
- return -1;
+ return ret;
}
/*
@@ -489,6 +938,7 @@ ext3_new_block(handle_t *handle, struct
struct ext3_group_desc *gdp;
struct ext3_super_block *es;
struct ext3_sb_info *sbi;
+ struct reserve_window *my_rsv = &EXT3_I(inode)->i_rsv_window;
#ifdef EXT3FS_DEBUG
static int goal_hits, goal_attempts;
#endif
@@ -539,8 +989,8 @@ ext3_new_block(handle_t *handle, struct
bitmap_bh = read_block_bitmap(sb, group_no);
if (!bitmap_bh)
goto io_error;
- ret_block = ext3_try_to_allocate(sb, handle, group_no,
- bitmap_bh, ret_block, &fatal);
+ ret_block = ext3_try_to_allocate_with_rsv(sb, handle, group_no,
+ bitmap_bh, ret_block, my_rsv, &fatal);
if (fatal)
goto out;
if (ret_block >= 0)
@@ -568,8 +1018,8 @@ ext3_new_block(handle_t *handle, struct
bitmap_bh = read_block_bitmap(sb, group_no);
if (!bitmap_bh)
goto io_error;
- ret_block = ext3_try_to_allocate(sb, handle, group_no,
- bitmap_bh, -1, &fatal);
+ ret_block = ext3_try_to_allocate_with_rsv(sb, handle, group_no,
+ bitmap_bh, -1, my_rsv, &fatal);
if (fatal)
goto out;
if (ret_block >= 0)
diff -urNp -X dontdiff 264-rsv-cleanup/fs/ext3/file.c 264-rsv-cleanup-base/fs/ext3/file.c
--- 264-rsv-cleanup/fs/ext3/file.c 2004-04-06 01:49:11.000000000 -0700
+++ 264-rsv-cleanup-base/fs/ext3/file.c 2004-04-06 02:25:44.000000000 -0700
@@ -33,6 +33,8 @@
*/
static int ext3_release_file (struct inode * inode, struct file * filp)
{
+ if (filp->f_mode & FMODE_WRITE)
+ ext3_discard_reservation(inode);
if (is_dx(inode) && filp->private_data)
ext3_htree_free_dir_info(filp->private_data);
diff -urNp -X dontdiff 264-rsv-cleanup/fs/ext3/ialloc.c 264-rsv-cleanup-base/fs/ext3/ialloc.c
--- 264-rsv-cleanup/fs/ext3/ialloc.c 2004-04-06 01:49:11.000000000 -0700
+++ 264-rsv-cleanup-base/fs/ext3/ialloc.c 2004-04-06 01:54:16.000000000 -0700
@@ -29,6 +29,7 @@
#include "xattr.h"
#include "acl.h"
+
/*
* ialloc.c contains the inodes allocation and deallocation routines
*/
@@ -581,6 +582,10 @@ got:
ei->i_file_acl = 0;
ei->i_dir_acl = 0;
ei->i_dtime = 0;
+ ei->i_rsv_window.rsv_start = 0;
+ ei->i_rsv_window.rsv_end= 0;
+ ei->i_rsv_window.rsv_goal_size = EXT3_DEFAULT_RESERVE_BLOCKS;
+ INIT_LIST_HEAD(&ei->i_rsv_window.rsv_list);
ei->i_block_group = group;
ext3_set_inode_flags(inode);
diff -urNp -X dontdiff 264-rsv-cleanup/fs/ext3/inode.c 264-rsv-cleanup-base/fs/ext3/inode.c
--- 264-rsv-cleanup/fs/ext3/inode.c 2004-04-06 01:49:11.000000000 -0700
+++ 264-rsv-cleanup-base/fs/ext3/inode.c 2004-04-06 01:56:16.000000000 -0700
@@ -185,6 +185,8 @@ static int ext3_journal_test_restart(han
*/
void ext3_put_inode(struct inode *inode)
{
+ if (!is_bad_inode(inode))
+ ext3_discard_reservation(inode);
}
/*
@@ -2053,6 +2055,8 @@ void ext3_truncate(struct inode * inode)
return;
if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
return;
+
+ ext3_discard_reservation(inode);
/*
* We have to lock the EOF page here, because lock_page() nests
@@ -2446,7 +2450,10 @@ void ext3_read_inode(struct inode * inod
ei->i_disksize = inode->i_size;
inode->i_generation = le32_to_cpu(raw_inode->i_generation);
ei->i_block_group = iloc.block_group;
-
+ ei->i_rsv_window.rsv_start = 0;
+ ei->i_rsv_window.rsv_end= 0;
+ ei->i_rsv_window.rsv_goal_size = EXT3_DEFAULT_RESERVE_BLOCKS;
+ INIT_LIST_HEAD(&ei->i_rsv_window.rsv_list);
/*
* NOTE! The in-memory inode i_data array is in little-endian order
* even on big-endian machines: we do NOT byteswap the block numbers!
diff -urNp -X dontdiff 264-rsv-cleanup/fs/ext3/super.c 264-rsv-cleanup-base/fs/ext3/super.c
--- 264-rsv-cleanup/fs/ext3/super.c 2004-04-06 01:49:11.000000000 -0700
+++ 264-rsv-cleanup-base/fs/ext3/super.c 2004-04-06 01:50:03.000000000 -0700
@@ -1291,6 +1291,13 @@ static int ext3_fill_super (struct super
sbi->s_gdb_count = db_count;
get_random_bytes(&sbi->s_next_generation, sizeof(u32));
spin_lock_init(&sbi->s_next_gen_lock);
+ /* per fileystem reservation list head & lock */
+ spin_lock_init(&sbi->s_rsv_window_lock);
+ INIT_LIST_HEAD(&sbi->s_rsv_window_head.rsv_list);
+ sbi->s_rsv_window_head.rsv_start = 0;
+ sbi->s_rsv_window_head.rsv_end = 0;
+ sbi->s_rsv_window_head.rsv_goal_size = 0;
+
/*
* set up enough so that it can read an inode
*/
diff -urNp -X dontdiff 264-rsv-cleanup/include/linux/ext3_fs.h 264-rsv-cleanup-base/include/linux/ext3_fs.h
--- 264-rsv-cleanup/include/linux/ext3_fs.h 2004-04-06 01:49:13.000000000 -0700
+++ 264-rsv-cleanup-base/include/linux/ext3_fs.h 2004-04-06 01:59:01.000000000 -0700
@@ -33,6 +33,11 @@ struct statfs;
#undef EXT3FS_DEBUG
/*
+ * Define EXT3_RESERVATION to reserve data blocks for expanding files
+ */
+#define EXT3_RESERVATION
+#define EXT3_DEFAULT_RESERVE_BLOCKS 8
+/*
* Always enable hashed directories
*/
#define CONFIG_EXT3_INDEX
@@ -721,6 +726,7 @@ extern void ext3_put_inode (struct inode
extern void ext3_delete_inode (struct inode *);
extern int ext3_sync_inode (handle_t *, struct inode *);
extern void ext3_discard_prealloc (struct inode *);
+extern void ext3_discard_reservation (struct inode *);
extern void ext3_dirty_inode(struct inode *);
extern int ext3_change_inode_journal_flag(struct inode *, int);
extern void ext3_truncate (struct inode *);
diff -urNp -X dontdiff 264-rsv-cleanup/include/linux/ext3_fs_i.h 264-rsv-cleanup-base/include/linux/ext3_fs_i.h
--- 264-rsv-cleanup/include/linux/ext3_fs_i.h 2004-04-06 01:49:13.000000000 -0700
+++ 264-rsv-cleanup-base/include/linux/ext3_fs_i.h 2004-04-06 02:00:35.000000000 -0700
@@ -18,8 +18,15 @@
#include <linux/rwsem.h>
+struct reserve_window{
+ struct list_head rsv_list;
+ __u32 rsv_start;
+ __u32 rsv_end;
+ unsigned short rsv_goal_size;
+};
+
/*
- * second extended file system inode data in memory
+ * third extended file system inode data in memory
*/
struct ext3_inode_info {
__u32 i_data[15];
@@ -57,6 +64,9 @@ struct ext3_inode_info {
* allocation when we detect linearly ascending requests.
*/
__u32 i_next_alloc_goal;
+ /* block reservation window */
+ struct reserve_window i_rsv_window;
+
__u32 i_dir_start_lookup;
#ifdef CONFIG_EXT3_FS_XATTR
/*
diff -urNp -X dontdiff 264-rsv-cleanup/include/linux/ext3_fs_sb.h 264-rsv-cleanup-base/include/linux/ext3_fs_sb.h
--- 264-rsv-cleanup/include/linux/ext3_fs_sb.h 2004-04-06 01:49:13.000000000 -0700
+++ 264-rsv-cleanup-base/include/linux/ext3_fs_sb.h 2004-04-06 01:50:04.000000000 -0700
@@ -59,6 +59,10 @@ struct ext3_sb_info {
struct percpu_counter s_dirs_counter;
struct blockgroup_lock s_blockgroup_lock;
+ /* head of the per fs reservation window tree */
+ spinlock_t s_rsv_window_lock;
+ struct reserve_window s_rsv_window_head;
+
/* Journaling */
struct inode * s_journal_inode;
struct journal_s * s_journal;
^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH 3/4] ext3 block reservation patch set --mount and ioctl feature
2004-04-14 0:52 ` [PATCH 0/4] ext3 block reservation patch set Mingming Cao
2004-04-14 0:54 ` [PATCH 1/4] ext3 block reservation patch set -- ext3 preallocation cleanup Mingming Cao
2004-04-14 0:57 ` [PATCH 2/4] ext3 block reservation patch set --ext3 block reservation Mingming Cao
@ 2004-04-14 0:58 ` Mingming Cao
2004-04-14 1:00 ` [PATCH 4/4] ext3 block reservation patch set -- dynamically increase reservation window Mingming Cao
` (2 subsequent siblings)
5 siblings, 0 replies; 28+ messages in thread
From: Mingming Cao @ 2004-04-14 0:58 UTC (permalink / raw)
To: Mingming Cao; +Cc: Andrew Morton, tytso, pbadari, linux-kernel, ext2-devel
[-- Attachment #1: Type: text/plain, Size: 844 bytes --]
> [patch 3]ext3_rsv_mount.patch: Adds features on top of the
> ext3_rsv_base.patch:
> - deal with earlier bogus -ENOSPC error
> - do block reservation only for regular file
> - make the ext3 reservation feature as a mount option:
> new mount option added: reservation
> - A pair of file ioctl commands are added for application to control
> the block reservation window size.
>
diffstat ext3_rsv_mount.patch
fs/ext3/balloc.c | 49
++++++++++++++++++++++++++++++++++++----------
fs/ext3/ialloc.c | 2 -
fs/ext3/inode.c | 2 -
fs/ext3/ioctl.c | 23 +++++++++++++++++++++
fs/ext3/super.c | 20 ++++++++++++++++--
include/linux/ext3_fs.h | 42
++++++++++++++++++++++-----------------
include/linux/ext3_fs_i.h | 2 -
7 files changed, 107 insertions(+), 33 deletions(-)
[-- Attachment #2: ext3_rsv_mount.patch --]
[-- Type: text/x-patch, Size: 12330 bytes --]
diff -urNp 264-rsv-cleanup-base/fs/ext3/balloc.c 264-rsv-cleanup-base-mount/fs/ext3/balloc.c
--- 264-rsv-cleanup-base/fs/ext3/balloc.c 2004-04-13 01:42:40.352200376 -0700
+++ 264-rsv-cleanup-base-mount/fs/ext3/balloc.c 2004-04-13 01:44:50.091477008 -0700
@@ -707,7 +707,7 @@ static int alloc_new_reservation(struct
else
start_block = goal + group_first_block;
- size = my_rsv->rsv_goal_size;
+ size = atomic_read(&my_rsv->rsv_goal_size);
/* if we have a old reservation, discard it first */
if (!rsv_is_empty(my_rsv)) {
/*
@@ -853,7 +853,16 @@ ext3_try_to_allocate_with_rsv(struct sup
return -1;
}
-#ifdef EXT3_RESERVATION
+ /*
+ * we don't deal with reservation when
+ * filesystem is mounted without reservation
+ * or the file is not a regular file
+ * of last attemp of allocating a block with reservation turn on failed
+ */
+ if (my_rsv == NULL ) {
+ ret = ext3_try_to_allocate(sb, handle, group, bitmap_bh, goal, NULL);
+ goto out;
+ }
rsv_lock = &EXT3_SB(sb)->s_rsv_window_lock;
/*
* goal is a group relative block number (if there is a goal)
@@ -895,10 +904,7 @@ ext3_try_to_allocate_with_rsv(struct sup
if (ret >= 0)
break; /* succeed */
}
-#else
- ret = ext3_try_to_allocate(sb, handle, group, bitmap_bh, goal, NULL);
-#endif
-
+out:
if (ret >= 0) {
BUFFER_TRACE(bitmap_bh, "journal_dirty_metadata for bitmap block");
fatal = ext3_journal_dirty_metadata(handle, bitmap_bh);
@@ -927,7 +933,7 @@ ext3_new_block(handle_t *handle, struct
{
struct buffer_head *bitmap_bh = NULL; /* bh */
struct buffer_head *gdp_bh; /* bh2 */
- int group_no; /* i */
+ int group_no, goal_group; /* i */
int ret_block; /* j */
int bgi; /* blockgroup iteration index */
int target_block; /* tmp */
@@ -938,7 +944,7 @@ ext3_new_block(handle_t *handle, struct
struct ext3_group_desc *gdp;
struct ext3_super_block *es;
struct ext3_sb_info *sbi;
- struct reserve_window *my_rsv = &EXT3_I(inode)->i_rsv_window;
+ struct reserve_window *my_rsv = NULL;
#ifdef EXT3FS_DEBUG
static int goal_hits, goal_attempts;
#endif
@@ -960,7 +966,10 @@ ext3_new_block(handle_t *handle, struct
sbi = EXT3_SB(sb);
es = EXT3_SB(sb)->s_es;
ext3_debug("goal=%lu.\n", goal);
-
+#ifdef EXT3_RESERVATION
+ if (test_opt(sb, RESERVATION) && S_ISREG(inode->i_mode))
+ my_rsv = &EXT3_I(inode)->i_rsv_window;
+#endif
free_blocks = percpu_counter_read_positive(&sbi->s_freeblocks_counter);
root_blocks = le32_to_cpu(es->s_r_blocks_count);
if (free_blocks < root_blocks + 1 && !capable(CAP_SYS_RESOURCE) &&
@@ -982,6 +991,8 @@ ext3_new_block(handle_t *handle, struct
if (!gdp)
goto io_error;
+ goal_group = group_no;
+retry:
free_blocks = le16_to_cpu(gdp->bg_free_blocks_count);
if (free_blocks > 0) {
ret_block = ((goal - le32_to_cpu(es->s_first_data_block)) %
@@ -1025,7 +1036,25 @@ ext3_new_block(handle_t *handle, struct
if (ret_block >= 0)
goto allocated;
}
-
+#ifdef EXT3_RESERVATION
+ /*
+ * We may end up a bogus ealier ENOSPC error due to
+ * filesystem is "full" of reservations, but
+ * there maybe indeed free blocks avaliable on disk
+ * In this case, we just forget about the reservations
+ * just do block allocation as without reservations.
+ */
+ if (my_rsv) {
+#ifdef EXT3_RESERVATION_DEBUG
+ printk("filesystem is full reserved. Actual free blocks is %d. "
+ "Try to do allocation without reservation, goal_group is %d\n",
+ free_blocks, goal_group);
+#endif
+ my_rsv = NULL;
+ group_no = goal_group;
+ goto retry;
+ }
+#endif
/* No space left on the device */
*errp = -ENOSPC;
goto out;
diff -urNp 264-rsv-cleanup-base/fs/ext3/ialloc.c 264-rsv-cleanup-base-mount/fs/ext3/ialloc.c
--- 264-rsv-cleanup-base/fs/ext3/ialloc.c 2004-04-06 01:54:16.000000000 -0700
+++ 264-rsv-cleanup-base-mount/fs/ext3/ialloc.c 2004-04-10 01:19:52.705503744 -0700
@@ -584,7 +584,7 @@ got:
ei->i_dtime = 0;
ei->i_rsv_window.rsv_start = 0;
ei->i_rsv_window.rsv_end= 0;
- ei->i_rsv_window.rsv_goal_size = EXT3_DEFAULT_RESERVE_BLOCKS;
+ atomic_set(&ei->i_rsv_window.rsv_goal_size, EXT3_DEFAULT_RESERVE_BLOCKS);
INIT_LIST_HEAD(&ei->i_rsv_window.rsv_list);
ei->i_block_group = group;
diff -urNp 264-rsv-cleanup-base/fs/ext3/inode.c 264-rsv-cleanup-base-mount/fs/ext3/inode.c
--- 264-rsv-cleanup-base/fs/ext3/inode.c 2004-04-06 01:56:16.000000000 -0700
+++ 264-rsv-cleanup-base-mount/fs/ext3/inode.c 2004-04-10 01:19:52.712502680 -0700
@@ -2452,7 +2452,7 @@ void ext3_read_inode(struct inode * inod
ei->i_block_group = iloc.block_group;
ei->i_rsv_window.rsv_start = 0;
ei->i_rsv_window.rsv_end= 0;
- ei->i_rsv_window.rsv_goal_size = EXT3_DEFAULT_RESERVE_BLOCKS;
+ atomic_set(&ei->i_rsv_window.rsv_goal_size, EXT3_DEFAULT_RESERVE_BLOCKS);
INIT_LIST_HEAD(&ei->i_rsv_window.rsv_list);
/*
* NOTE! The in-memory inode i_data array is in little-endian order
diff -urNp 264-rsv-cleanup-base/fs/ext3/ioctl.c 264-rsv-cleanup-base-mount/fs/ext3/ioctl.c
--- 264-rsv-cleanup-base/fs/ext3/ioctl.c 2004-04-06 00:30:07.000000000 -0700
+++ 264-rsv-cleanup-base-mount/fs/ext3/ioctl.c 2004-04-10 01:19:52.713502528 -0700
@@ -20,6 +20,7 @@ int ext3_ioctl (struct inode * inode, st
{
struct ext3_inode_info *ei = EXT3_I(inode);
unsigned int flags;
+ unsigned short rsv_window_size;
ext3_debug ("cmd = %u, arg = %lu\n", cmd, arg);
@@ -151,6 +152,28 @@ flags_err:
return ret;
}
#endif
+#ifdef EXT3_RESERVATION
+ case EXT3_IOC_GETRSVSZ:
+ rsv_window_size = atomic_read(&ei->i_rsv_window.rsv_goal_size);
+ return put_user(rsv_window_size, (int *) arg);
+ return 0;
+ case EXT3_IOC_SETRSVSZ:
+ {
+ if (IS_RDONLY(inode))
+ return -EROFS;
+
+ if ((current->fsuid != inode->i_uid) && !capable(CAP_FOWNER))
+ return -EACCES;
+
+ if (get_user(rsv_window_size, (int *) arg))
+ return -EFAULT;
+
+ if (rsv_window_size > EXT3_MAX_RESERVE_BLOCKS)
+ rsv_window_size = EXT3_MAX_RESERVE_BLOCKS;
+ atomic_set(&ei->i_rsv_window.rsv_goal_size, rsv_window_size);
+ return 0;
+ }
+#endif
default:
return -ENOTTY;
}
diff -urNp 264-rsv-cleanup-base/fs/ext3/super.c 264-rsv-cleanup-base-mount/fs/ext3/super.c
--- 264-rsv-cleanup-base/fs/ext3/super.c 2004-04-06 01:50:03.000000000 -0700
+++ 264-rsv-cleanup-base-mount/fs/ext3/super.c 2004-04-10 01:19:52.719501616 -0700
@@ -533,7 +533,8 @@ enum {
Opt_bsd_df, Opt_minix_df, Opt_grpid, Opt_nogrpid,
Opt_resgid, Opt_resuid, Opt_sb, Opt_err_cont, Opt_err_panic, Opt_err_ro,
Opt_nouid32, Opt_check, Opt_nocheck, Opt_debug, Opt_oldalloc, Opt_orlov,
- Opt_user_xattr, Opt_nouser_xattr, Opt_acl, Opt_noacl, Opt_noload,
+ Opt_user_xattr, Opt_nouser_xattr, Opt_acl, Opt_noacl,
+ Opt_reservation, Opt_noreservation, Opt_noload,
Opt_commit, Opt_journal_update, Opt_journal_inum,
Opt_abort, Opt_data_journal, Opt_data_ordered, Opt_data_writeback,
Opt_ignore, Opt_err,
@@ -563,6 +564,8 @@ static match_table_t tokens = {
{Opt_nouser_xattr, "nouser_xattr"},
{Opt_acl, "acl"},
{Opt_noacl, "noacl"},
+ {Opt_reservation, "reservation"},
+ {Opt_noreservation, "noreservation"},
{Opt_noload, "noload"},
{Opt_commit, "commit=%u"},
{Opt_journal_update, "journal=update"},
@@ -706,6 +709,19 @@ static int parse_options (char * options
printk("EXT3 (no)acl options not supported\n");
break;
#endif
+#ifdef EXT3_RESERVATION
+ case Opt_reservation:
+ set_opt(sbi->s_mount_opt, RESERVATION);
+ break;
+ case Opt_noreservation:
+ clear_opt(sbi->s_mount_opt, RESERVATION);
+ break;
+#else
+ case Opt_reservation:
+ case Opt_noreservation:
+ printk("EXT3 block reservation options not supported\n");
+ break;
+#endif
case Opt_journal_update:
/* @@@ FIXME */
/* Eventually we will want to be able to create
@@ -1296,7 +1312,7 @@ static int ext3_fill_super (struct super
INIT_LIST_HEAD(&sbi->s_rsv_window_head.rsv_list);
sbi->s_rsv_window_head.rsv_start = 0;
sbi->s_rsv_window_head.rsv_end = 0;
- sbi->s_rsv_window_head.rsv_goal_size = 0;
+ atomic_set(&sbi->s_rsv_window_head.rsv_goal_size, 0);
/*
* set up enough so that it can read an inode
diff -urNp 264-rsv-cleanup-base/include/linux/ext3_fs.h 264-rsv-cleanup-base-mount/include/linux/ext3_fs.h
--- 264-rsv-cleanup-base/include/linux/ext3_fs.h 2004-04-06 01:59:01.000000000 -0700
+++ 264-rsv-cleanup-base-mount/include/linux/ext3_fs.h 2004-04-13 01:43:53.159132040 -0700
@@ -37,6 +37,7 @@ struct statfs;
*/
#define EXT3_RESERVATION
#define EXT3_DEFAULT_RESERVE_BLOCKS 8
+#define EXT3_MAX_RESERVE_BLOCKS 1024
/*
* Always enable hashed directories
*/
@@ -207,6 +208,10 @@ struct ext3_group_desc
#ifdef CONFIG_JBD_DEBUG
#define EXT3_IOC_WAIT_FOR_READONLY _IOR('f', 99, long)
#endif
+#ifdef EXT3_RESERVATION
+#define EXT3_IOC_GETRSVSZ _IOR('r', 1, long)
+#define EXT3_IOC_SETRSVSZ _IOW('r', 2, long)
+#endif
/*
* Structure of an inode on the disk
@@ -305,24 +310,25 @@ struct ext3_inode {
/*
* Mount flags
*/
-#define EXT3_MOUNT_CHECK 0x0001 /* Do mount-time checks */
-#define EXT3_MOUNT_OLDALLOC 0x0002 /* Don't use the new Orlov allocator */
-#define EXT3_MOUNT_GRPID 0x0004 /* Create files with directory's group */
-#define EXT3_MOUNT_DEBUG 0x0008 /* Some debugging messages */
-#define EXT3_MOUNT_ERRORS_CONT 0x0010 /* Continue on errors */
-#define EXT3_MOUNT_ERRORS_RO 0x0020 /* Remount fs ro on errors */
-#define EXT3_MOUNT_ERRORS_PANIC 0x0040 /* Panic on errors */
-#define EXT3_MOUNT_MINIX_DF 0x0080 /* Mimics the Minix statfs */
-#define EXT3_MOUNT_NOLOAD 0x0100 /* Don't use existing journal*/
-#define EXT3_MOUNT_ABORT 0x0200 /* Fatal error detected */
-#define EXT3_MOUNT_DATA_FLAGS 0x0C00 /* Mode for data writes: */
- #define EXT3_MOUNT_JOURNAL_DATA 0x0400 /* Write data to journal */
- #define EXT3_MOUNT_ORDERED_DATA 0x0800 /* Flush data before commit */
- #define EXT3_MOUNT_WRITEBACK_DATA 0x0C00 /* No data ordering */
-#define EXT3_MOUNT_UPDATE_JOURNAL 0x1000 /* Update the journal format */
-#define EXT3_MOUNT_NO_UID32 0x2000 /* Disable 32-bit UIDs */
-#define EXT3_MOUNT_XATTR_USER 0x4000 /* Extended user attributes */
-#define EXT3_MOUNT_POSIX_ACL 0x8000 /* POSIX Access Control Lists */
+#define EXT3_MOUNT_CHECK 0x00001 /* Do mount-time checks */
+#define EXT3_MOUNT_OLDALLOC 0x00002 /* Don't use the new Orlov allocator */
+#define EXT3_MOUNT_GRPID 0x00004 /* Create files with directory's group */
+#define EXT3_MOUNT_DEBUG 0x00008 /* Some debugging messages */
+#define EXT3_MOUNT_ERRORS_CONT 0x00010 /* Continue on errors */
+#define EXT3_MOUNT_ERRORS_RO 0x00020 /* Remount fs ro on errors */
+#define EXT3_MOUNT_ERRORS_PANIC 0x00040 /* Panic on errors */
+#define EXT3_MOUNT_MINIX_DF 0x00080 /* Mimics the Minix statfs */
+#define EXT3_MOUNT_NOLOAD 0x00100 /* Don't use existing journal*/
+#define EXT3_MOUNT_ABORT 0x00200 /* Fatal error detected */
+#define EXT3_MOUNT_DATA_FLAGS 0x00C00 /* Mode for data writes: */
+#define EXT3_MOUNT_JOURNAL_DATA 0x00400 /* Write data to journal */
+#define EXT3_MOUNT_ORDERED_DATA 0x00800 /* Flush data before commit */
+#define EXT3_MOUNT_WRITEBACK_DATA 0x00C00 /* No data ordering */
+#define EXT3_MOUNT_UPDATE_JOURNAL 0x01000 /* Update the journal format */
+#define EXT3_MOUNT_NO_UID32 0x02000 /* Disable 32-bit UIDs */
+#define EXT3_MOUNT_XATTR_USER 0x04000 /* Extended user attributes */
+#define EXT3_MOUNT_POSIX_ACL 0x08000 /* POSIX Access Control Lists */
+#define EXT3_MOUNT_RESERVATION 0x10000 /* Preallocation */
/* Compatibility, for having both ext2_fs.h and ext3_fs.h included at once */
#ifndef _LINUX_EXT2_FS_H
diff -urNp 264-rsv-cleanup-base/include/linux/ext3_fs_i.h 264-rsv-cleanup-base-mount/include/linux/ext3_fs_i.h
--- 264-rsv-cleanup-base/include/linux/ext3_fs_i.h 2004-04-06 02:00:35.000000000 -0700
+++ 264-rsv-cleanup-base-mount/include/linux/ext3_fs_i.h 2004-04-10 01:19:52.723501008 -0700
@@ -22,7 +22,7 @@ struct reserve_window{
struct list_head rsv_list;
__u32 rsv_start;
__u32 rsv_end;
- unsigned short rsv_goal_size;
+ atomic_t rsv_goal_size;
};
/*
^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH 4/4] ext3 block reservation patch set -- dynamically increase reservation window
2004-04-14 0:52 ` [PATCH 0/4] ext3 block reservation patch set Mingming Cao
` (2 preceding siblings ...)
2004-04-14 0:58 ` [PATCH 3/4] ext3 block reservation patch set --mount and ioctl feature Mingming Cao
@ 2004-04-14 1:00 ` Mingming Cao
2004-04-14 2:47 ` [PATCH 0/4] ext3 block reservation patch set Andrew Morton
2004-04-27 15:19 ` [PATCH 0/4] ext3 block reservation patch set Mary Edie Meredith
5 siblings, 0 replies; 28+ messages in thread
From: Mingming Cao @ 2004-04-14 1:00 UTC (permalink / raw)
To: Mingming Cao; +Cc: Andrew Morton, tytso, pbadari, linux-kernel, ext2-devel
[-- Attachment #1: Type: text/plain, Size: 518 bytes --]
> [patch 4]ext3_rsv_dw.patch: adjust the reservation window size
> dynamically:
> Start from the deault reservation window size, if the hit ration of
> the reservation window is more than 50%, we will double the reservation
> window size next time up to a certain upper limit.
diffstat ext3_rsv_dw.patch
fs/ext3/balloc.c | 13 ++++++++++++-
fs/ext3/ialloc.c | 3 ++-
fs/ext3/super.c | 1 +
include/linux/ext3_fs_i.h | 1 +
4 files changed, 16 insertions(+), 2 deletions(-)
[-- Attachment #2: ext3_rsv_dw.patch --]
[-- Type: text/x-patch, Size: 2912 bytes --]
diff -urNp 264-rsv-cleanup-base-mount/fs/ext3/balloc.c 264-rsv-cleanup-base-mount-dw/fs/ext3/balloc.c
--- 264-rsv-cleanup-base-mount/fs/ext3/balloc.c 2004-04-13 01:44:50.091477008 -0700
+++ 264-rsv-cleanup-base-mount-dw/fs/ext3/balloc.c 2004-04-12 22:15:12.128618008 -0700
@@ -148,6 +148,7 @@ static inline void rsv_window_remove(str
{
rsv->rsv_start = 0;
rsv->rsv_end = 0;
+ rsv->rsv_alloc_hit = 0;
list_del(&rsv->rsv_list);
INIT_LIST_HEAD(&rsv->rsv_list);
}
@@ -548,7 +549,8 @@ repeat:
goto fail_access;
goto repeat;
}
-
+ if (my_rsv)
+ my_rsv->rsv_alloc_hit++;
return goal;
fail_access:
return -1;
@@ -731,6 +733,15 @@ static int alloc_new_reservation(struct
start_block = my_rsv->rsv_end + 1;
search_head = list_entry(my_rsv->rsv_list.prev,
struct reserve_window, rsv_list);
+ if ((my_rsv->rsv_alloc_hit > (my_rsv->rsv_end - my_rsv->rsv_start + 1) / 2)) {
+ /*
+ * if we previously allocation hit ration is greater than half
+ * we double the size of reservation window next time
+ * otherwise keep the same
+ */
+ size = size * 2;
+ atomic_set(&my_rsv->rsv_goal_size, size);
+ }
rsv_window_remove(my_rsv);
}
else {
diff -urNp 264-rsv-cleanup-base-mount/fs/ext3/ialloc.c 264-rsv-cleanup-base-mount-dw/fs/ext3/ialloc.c
--- 264-rsv-cleanup-base-mount/fs/ext3/ialloc.c 2004-04-10 01:19:52.705503744 -0700
+++ 264-rsv-cleanup-base-mount-dw/fs/ext3/ialloc.c 2004-04-12 21:46:19.443026256 -0700
@@ -583,8 +583,9 @@ got:
ei->i_dir_acl = 0;
ei->i_dtime = 0;
ei->i_rsv_window.rsv_start = 0;
- ei->i_rsv_window.rsv_end= 0;
+ ei->i_rsv_window.rsv_end = 0;
atomic_set(&ei->i_rsv_window.rsv_goal_size, EXT3_DEFAULT_RESERVE_BLOCKS);
+ ei->i_rsv_window.rsv_alloc_hit = 0;
INIT_LIST_HEAD(&ei->i_rsv_window.rsv_list);
ei->i_block_group = group;
diff -urNp 264-rsv-cleanup-base-mount/fs/ext3/super.c 264-rsv-cleanup-base-mount-dw/fs/ext3/super.c
--- 264-rsv-cleanup-base-mount/fs/ext3/super.c 2004-04-10 01:19:52.719501616 -0700
+++ 264-rsv-cleanup-base-mount-dw/fs/ext3/super.c 2004-04-12 21:47:00.612767504 -0700
@@ -1312,6 +1312,7 @@ static int ext3_fill_super (struct super
INIT_LIST_HEAD(&sbi->s_rsv_window_head.rsv_list);
sbi->s_rsv_window_head.rsv_start = 0;
sbi->s_rsv_window_head.rsv_end = 0;
+ sbi->s_rsv_window_head.rsv_alloc_hit = 0;
atomic_set(&sbi->s_rsv_window_head.rsv_goal_size, 0);
/*
diff -urNp 264-rsv-cleanup-base-mount/include/linux/ext3_fs_i.h 264-rsv-cleanup-base-mount-dw/include/linux/ext3_fs_i.h
--- 264-rsv-cleanup-base-mount/include/linux/ext3_fs_i.h 2004-04-10 01:19:52.723501008 -0700
+++ 264-rsv-cleanup-base-mount-dw/include/linux/ext3_fs_i.h 2004-04-12 21:43:25.019542656 -0700
@@ -23,6 +23,7 @@ struct reserve_window{
__u32 rsv_start;
__u32 rsv_end;
atomic_t rsv_goal_size;
+ __u32 rsv_alloc_hit;
};
/*
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 0/4] ext3 block reservation patch set
2004-04-14 0:52 ` [PATCH 0/4] ext3 block reservation patch set Mingming Cao
` (3 preceding siblings ...)
2004-04-14 1:00 ` [PATCH 4/4] ext3 block reservation patch set -- dynamically increase reservation window Mingming Cao
@ 2004-04-14 2:47 ` Andrew Morton
2004-04-14 16:11 ` Badari Pulavarty
` (2 more replies)
2004-04-27 15:19 ` [PATCH 0/4] ext3 block reservation patch set Mary Edie Meredith
5 siblings, 3 replies; 28+ messages in thread
From: Andrew Morton @ 2004-04-14 2:47 UTC (permalink / raw)
To: Mingming Cao; +Cc: tytso, pbadari, linux-kernel, ext2-devel
Mingming Cao <cmm@us.ibm.com> wrote:
>
> Here is a set of patches which implement the in-memory ext3 block
> reservation (previously called reservation based ext3 preallocation).
Great, thanks. Let's get these in the pipeline.
A few thoughts, from a five-minute read:
- The majority of in-core inodes are not open for reading, and we've
added 24 bytes to the inode just for inodes which are open for writing.
At some stage we should stop aggregating struct reserve_window into the
inode and dynamically allocate it. We can move i_next_alloc_block,
i_next_alloc_goal and possibly other fields in there too.
At which point it has the wrong name ;) Should be `write_state' or
something.
It's not clear when we should free up the write_state. I guess we
could leave it around for the remaining lifetime of the inode - that'd
still be a net win.
Is this something you can look at as a low-priority activity?
- You're performing ext3_discard_reservation() in ext3_release_file().
Note that the file may still have pending allocations at this stage: say,
open a file, map it MAP_SHARED, dirty some pages which lie over file
holes then close the file again.
Later, the VM will come along and write those dirty pages into the
file, at which point allocations need to be performed. But we have no
reservation data and, later, we may have no inode->write_state at all.
What will happen?
- Have you tested and profiled this with a huge number of open files? At
what stage do we get into search complexity problems?
- Why do we discard the file's reservation on every iput()? iput's are
relatively common operations. (see fs/fs-writeback.c)
- What locking protects rsv_alloc_hit? i_sem is not held during
VM-initiated writeout. Maybe an atomic_t there, or just say that if we
race and the number is a bit inaccurate, we don't care?
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 0/4] ext3 block reservation patch set
2004-04-14 2:47 ` [PATCH 0/4] ext3 block reservation patch set Andrew Morton
@ 2004-04-14 16:11 ` Badari Pulavarty
2004-04-14 17:44 ` Mingming Cao
2004-04-14 23:02 ` Andrew Morton
2004-04-14 16:42 ` Badari Pulavarty
2004-04-14 17:30 ` Mingming Cao
2 siblings, 2 replies; 28+ messages in thread
From: Badari Pulavarty @ 2004-04-14 16:11 UTC (permalink / raw)
To: Andrew Morton, Mingming Cao; +Cc: tytso, linux-kernel, ext2-devel
On Tuesday 13 April 2004 07:47 pm, Andrew Morton wrote:
> Mingming Cao <cmm@us.ibm.com> wrote:
> > Here is a set of patches which implement the in-memory ext3 block
> > reservation (previously called reservation based ext3 preallocation).
>
> Great, thanks. Let's get these in the pipeline.
>
> A few thoughts, from a five-minute read:
>
>
> - The majority of in-core inodes are not open for reading, and we've
> added 24 bytes to the inode just for inodes which are open for writing.
>
> At some stage we should stop aggregating struct reserve_window into the
> inode and dynamically allocate it. We can move i_next_alloc_block,
> i_next_alloc_goal and possibly other fields in there too.
>
> At which point it has the wrong name ;) Should be `write_state' or
> something.
>
> It's not clear when we should free up the write_state. I guess we
> could leave it around for the remaining lifetime of the inode - that'd
> still be a net win.
>
> Is this something you can look at as a low-priority activity?
Good point !! we will surely look at it.
>
> - You're performing ext3_discard_reservation() in ext3_release_file().
> Note that the file may still have pending allocations at this stage: say,
> open a file, map it MAP_SHARED, dirty some pages which lie over file
> holes then close the file again.
..
> - Why do we discard the file's reservation on every iput()? iput's are
> relatively common operations. (see fs/fs-writeback.c)
We just followed old prealloc code. Where ever preallocation is dropped
we dropped reservation. May be thats overkill. We will look at it.
Whats the best place to drop the reservation ?
> - Have you tested and profiled this with a huge number of open files? At
> what stage do we get into search complexity problems?
In our TODO list. But our original thought was, we have to search only the
current block group reservations to get a window. So, if we have lots & lots
of reservations in a single block group - search gets complicated. We were
thinking of adding (dummy) anchors in the list to represent begining of each
block group, so that we can get to the start of a block group quickly. But
so far, we haven't done anything.
We are also looking at RB tree and see how we can make use of it. Our problem
is, we are interested in finding out a big enough hole in the tree to put our
reservation. We need to look closely.
> - What locking protects rsv_alloc_hit? i_sem is not held during
> VM-initiated writeout. Maybe an atomic_t there, or just say that if we
> race and the number is a bit inaccurate, we don't care?
We need to atleast change it to atomic_t.
Mingming, I don't see any check to force maximum. Am I missing something ?
We really appreciate your comments.
Thanks,
Badari
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 0/4] ext3 block reservation patch set
2004-04-14 2:47 ` [PATCH 0/4] ext3 block reservation patch set Andrew Morton
2004-04-14 16:11 ` Badari Pulavarty
@ 2004-04-14 16:42 ` Badari Pulavarty
2004-04-14 17:30 ` Mingming Cao
2 siblings, 0 replies; 28+ messages in thread
From: Badari Pulavarty @ 2004-04-14 16:42 UTC (permalink / raw)
To: Andrew Morton, Mingming Cao; +Cc: tytso, linux-kernel, ext2-devel
On Tuesday 13 April 2004 07:47 pm, Andrew Morton wrote:
> - You're performing ext3_discard_reservation() in ext3_release_file().
> Note that the file may still have pending allocations at this stage: say,
> open a file, map it MAP_SHARED, dirty some pages which lie over file
> holes then close the file again.
>
> Later, the VM will come along and write those dirty pages into the
> file, at which point allocations need to be performed. But we have no
> reservation data and, later, we may have no inode->write_state at all.
>
> What will happen?
Block allocations happen after ext3_release_file() ? In that case,
we would have dropped all our reservations at the time of last file close.
But if allocations happen later, the current code will start new reservation
window and start allocations from there.
> - Have you tested and profiled this with a huge number of open files? At
> what stage do we get into search complexity problems?
Come to think of it, the current code has pretty bad search algorithm. We need
to fix that. We hold the spinlock for entire search, thats why our CPU
utilization is pretty high.
Thanks,
Badari
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 0/4] ext3 block reservation patch set
2004-04-14 2:47 ` [PATCH 0/4] ext3 block reservation patch set Andrew Morton
2004-04-14 16:11 ` Badari Pulavarty
2004-04-14 16:42 ` Badari Pulavarty
@ 2004-04-14 17:30 ` Mingming Cao
2004-04-14 23:07 ` Andrew Morton
2 siblings, 1 reply; 28+ messages in thread
From: Mingming Cao @ 2004-04-14 17:30 UTC (permalink / raw)
To: Andrew Morton; +Cc: tytso, pbadari, linux-kernel, ext2-devel
On Tue, 2004-04-13 at 19:47, Andrew Morton wrote:
> Mingming Cao <cmm@us.ibm.com> wrote:
> >
> > Here is a set of patches which implement the in-memory ext3 block
> > reservation (previously called reservation based ext3 preallocation).
>
> Great, thanks. Let's get these in the pipeline.
>
> A few thoughts, from a five-minute read:
>
>
> - The majority of in-core inodes are not open for reading, and we've
> added 24 bytes to the inode just for inodes which are open for writing.
Yes, The structure is getting bigger when we add more stuff into it. It
may not worth to put it inside the ext3_inode_info structure just for
files for write....I agree!
>
> At some stage we should stop aggregating struct reserve_window into the
> inode and dynamically allocate it. We can move i_next_alloc_block,
> i_next_alloc_goal and possibly other fields in there too.
>
> At which point it has the wrong name ;) Should be `write_state' or
> something.
>
> It's not clear when we should free up the write_state. I guess we
> could leave it around for the remaining lifetime of the inode - that'd
> still be a net win.
We could free up the write_state at the time of ext3_discard_allocation(), (not at the time when we allocate a new reservation window)
or later if we preserve reservation for slow growing files, we release the write_state at the time the inode is released.
> Is this something you can look at as a low-priority activity?
>
Sure!
> - You're performing ext3_discard_reservation() in ext3_release_file().
> Note that the file may still have pending allocations at this stage: say,
> open a file, map it MAP_SHARED, dirty some pages which lie over file
> holes then close the file again.
>
> Later, the VM will come along and write those dirty pages into the
> file, at which point allocations need to be performed. But we have no
> reservation data and, later, we may have no inode->write_state at all.
>
> What will happen?
>
In this case, we will allocation a new reservation window for it.
Nothing bad will happen. We probably just waste a previously allocated
reservation window...but I am not sure.
My question is, if the file is first time opened, mapped, and we dirty
pages in the file hole, will there any really disk block allocation
involved there? If not, we do not have a reservation window at at all,
and ext3_discard_reservation will detect that and will do nothing.
> - Have you tested and profiled this with a huge number of open files? At
> what stage do we get into search complexity problems?
>
Not yet. The current search complexity is O(n), if you don't have a
reservation, you need O(n) to move the search head to the place where
you want to search for a new reservation, finding the hole size between
two reservation window is just O(1) for sorted double linked list, we
need O(n) to look for a reservable window after that, so the complex is:
O(n) +O(1) * 0(n) = O(n);
if you already have a old reservation, we will remember where to start
the search, so the complex is O(1) + O(n);
The current implementation is more than O(n): every time it does not
have a reservation window, it search from the head of per filesystem
reservation window list head. If it failed within the group, it will
move to the next group and start the search from the head of the list
again.
This could be fixed by forget about the block group boundary at
all,(remove the for loop in ext3_new_block), make it searchs for a block
in a filesystem wide:)
I have concern about red black tree: it takes O(log(n)) to get where you
want to start, but it need also takes O(log(n)) compare to find the hole
size between two windows next to each other. And to find a reservable
window, we need to browse the whole red black tree in the worse case, so
the complexity is
O(log(n)) + O(log(n)) *O(n)) = O(n)*O(log(n))
Am I right?
> - Why do we discard the file's reservation on every iput()? iput's are
> relatively common operations. (see fs/fs-writeback.c)
>
Yes..you are right! I was intent to call ext3_discard_allocation only
when the usage count of the inode is 0. I looked at ext2 preallocation
code, it called ext2_discard_preallocation in ext2_put_inode(), so I
thought that's the place. But it seems ext3_put_inode() being called
every time iput() is called. We should call ext3_discard_reservation in
iput_final(). Should fix this in ext2.
> - What locking protects rsv_alloc_hit? i_sem is not held during
> VM-initiated writeout. Maybe an atomic_t there, or just say that if we
> race and the number is a bit inaccurate, we don't care?
>
Currently no lock is protect rsv_alloc_hit. The reason is it is just a
heuristics indicator of whether we should enlarge the reservation window
size next time. Even the hit ratio(50%) is just a rough guess, so, a
little bit inaccurate would not hurt much, adding another lock probably
not worth it.
Thanks,
Mingming
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 0/4] ext3 block reservation patch set
2004-04-14 16:11 ` Badari Pulavarty
@ 2004-04-14 17:44 ` Mingming Cao
2004-04-14 23:02 ` Andrew Morton
1 sibling, 0 replies; 28+ messages in thread
From: Mingming Cao @ 2004-04-14 17:44 UTC (permalink / raw)
To: Badari Pulavarty; +Cc: Andrew Morton, tytso, linux-kernel, ext2-devel
On Wed, 2004-04-14 at 09:11, Badari Pulavarty wrote:
> > - What locking protects rsv_alloc_hit? i_sem is not held during
> > VM-initiated writeout. Maybe an atomic_t there, or just say that if we
> > race and the number is a bit inaccurate, we don't care?
>
> Mingming, I don't see any check to force maximum. Am I missing something ?
Nice catch! forget to check the maximum when growing the window
size...fixed it.:)
Thanks.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 0/4] ext3 block reservation patch set
2004-04-14 16:11 ` Badari Pulavarty
2004-04-14 17:44 ` Mingming Cao
@ 2004-04-14 23:02 ` Andrew Morton
2004-04-14 23:12 ` Badari Pulavarty
1 sibling, 1 reply; 28+ messages in thread
From: Andrew Morton @ 2004-04-14 23:02 UTC (permalink / raw)
To: Badari Pulavarty; +Cc: cmm, tytso, linux-kernel, ext2-devel
Badari Pulavarty <pbadari@us.ibm.com> wrote:
>
> > - Why do we discard the file's reservation on every iput()? iput's are
> > relatively common operations. (see fs/fs-writeback.c)
>
> We just followed old prealloc code. Where ever preallocation is dropped
> we dropped reservation. May be thats overkill. We will look at it.
>
> Whats the best place to drop the reservation ?
You know, I wish I had an easy answer to that, but I don't. It's a matter
of sticking a printk in there, running careful tests, making sure that
we're doing the right thing at the right time.
As we discussed earlier, it could be that in some some situations we should
hold onto the reservation window after the file has been closed - the
slowly-growing mbox or logfile problem. But without causing bandwidth
regressions in the the many-small-files scenario.
> > - Have you tested and profiled this with a huge number of open files? At
> > what stage do we get into search complexity problems?
>
> In our TODO list. But our original thought was, we have to search only the
> current block group reservations to get a window. So, if we have lots & lots
> of reservations in a single block group - search gets complicated. We were
> thinking of adding (dummy) anchors in the list to represent begining of each
> block group, so that we can get to the start of a block group quickly. But
> so far, we haven't done anything.
hm, I need to look at the new code more closely. I was hoping that we
could divorce the reservation windows from any knowledge of blockgroups.
Is that not the case?
> We are also looking at RB tree and see how we can make use of it. Our problem
> is, we are interested in finding out a big enough hole in the tree to put our
> reservation. We need to look closely.
This sounds awfully like get_unmapped_area().
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 0/4] ext3 block reservation patch set
2004-04-14 17:30 ` Mingming Cao
@ 2004-04-14 23:07 ` Andrew Morton
2004-04-14 23:42 ` Mingming Cao
2004-04-21 23:34 ` [PATCH] Lazy discard ext3 reservation window patch Mingming Cao
0 siblings, 2 replies; 28+ messages in thread
From: Andrew Morton @ 2004-04-14 23:07 UTC (permalink / raw)
To: Mingming Cao; +Cc: tytso, pbadari, linux-kernel, ext2-devel
Mingming Cao <cmm@us.ibm.com> wrote:
>
> > It's not clear when we should free up the write_state. I guess we
> > could leave it around for the remaining lifetime of the inode - that'd
> > still be a net win.
> We could free up the write_state at the time of ext3_discard_allocation(),
> (not at the time when we allocate a new reservation window)
>
> or later if we preserve reservation for slow growing files, we release
> the write_state at the time the inode is released.
That sounds appropriate.
> > - You're performing ext3_discard_reservation() in ext3_release_file().
> > Note that the file may still have pending allocations at this stage: say,
> > open a file, map it MAP_SHARED, dirty some pages which lie over file
> > holes then close the file again.
> >
> > Later, the VM will come along and write those dirty pages into the
> > file, at which point allocations need to be performed. But we have no
> > reservation data and, later, we may have no inode->write_state at all.
> >
> > What will happen?
> >
> In this case, we will allocation a new reservation window for it.
> Nothing bad will happen. We probably just waste a previously allocated
> reservation window...but I am not sure.
>
> My question is, if the file is first time opened, mapped, and we dirty
> pages in the file hole, will there any really disk block allocation
> involved there?
There might be, and there might not be. It depends on timing, memory
pressure, application activity, etc.
> The current implementation is more than O(n): every time it does not
> have a reservation window, it search from the head of per filesystem
> reservation window list head. If it failed within the group, it will
> move to the next group and start the search from the head of the list
> again.
Same problem exists in arch_get_unmapped_area(). We have a funny little
heuristic (free_area_cache) in there to speed up the common case.
> This could be fixed by forget about the block group boundary at
> all,(remove the for loop in ext3_new_block), make it searchs for a block
> in a filesystem wide:)
I do think we should do this. Does it have any disadvantages?
> I have concern about red black tree: it takes O(log(n)) to get where you
> want to start, but it need also takes O(log(n)) compare to find the hole
> size between two windows next to each other. And to find a reservable
> window, we need to browse the whole red black tree in the worse case, so
> the complexity is
> O(log(n)) + O(log(n)) *O(n)) = O(n)*O(log(n))
>
> Am I right?
Think so. rbtrees are optimised for loopkup, not for
get-me-a-suitably-sized-hole.
> > - Why do we discard the file's reservation on every iput()? iput's are
> > relatively common operations. (see fs/fs-writeback.c)
> >
> Yes..you are right! I was intent to call ext3_discard_allocation only
> when the usage count of the inode is 0. I looked at ext2 preallocation
> code, it called ext2_discard_preallocation in ext2_put_inode(), so I
> thought that's the place. But it seems ext3_put_inode() being called
> every time iput() is called. We should call ext3_discard_reservation in
> iput_final(). Should fix this in ext2.
Could be. so.
> > - What locking protects rsv_alloc_hit? i_sem is not held during
> > VM-initiated writeout. Maybe an atomic_t there, or just say that if we
> > race and the number is a bit inaccurate, we don't care?
> >
> Currently no lock is protect rsv_alloc_hit. The reason is it is just a
> heuristics indicator of whether we should enlarge the reservation window
> size next time. Even the hit ratio(50%) is just a rough guess, so, a
> little bit inaccurate would not hurt much, adding another lock probably
> not worth it.
I'd agree with that.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 0/4] ext3 block reservation patch set
2004-04-14 23:02 ` Andrew Morton
@ 2004-04-14 23:12 ` Badari Pulavarty
0 siblings, 0 replies; 28+ messages in thread
From: Badari Pulavarty @ 2004-04-14 23:12 UTC (permalink / raw)
To: Andrew Morton; +Cc: cmm, tytso, linux-kernel, ext2-devel
On Wednesday 14 April 2004 04:02 pm, Andrew Morton wrote:
> > In our TODO list. But our original thought was, we have to search only
> > the current block group reservations to get a window. So, if we have lots
> > & lots of reservations in a single block group - search gets complicated.
> > We were thinking of adding (dummy) anchors in the list to represent
> > begining of each block group, so that we can get to the start of a block
> > group quickly. But so far, we haven't done anything.
>
> hm, I need to look at the new code more closely. I was hoping that we
> could divorce the reservation windows from any knowledge of blockgroups.
> Is that not the case?
The reservation window code kind of knows the group boundaries. The
reason why we did this was, we want to fit it into existing
ext3_get_newblock() code easily. ext3_get_newblock() operates on each
group and passes a bitmap for each group to work on. The current code
looks for a reservation window in the given group (since we need bitmap to
verify that there is something allocatable in that group).
To make the reservation window ignore groups, we may need to do some major
surgery to ext3_get_newblock().
> > We are also looking at RB tree and see how we can make use of it. Our
> > problem is, we are interested in finding out a big enough hole in the
> > tree to put our reservation. We need to look closely.
> This sounds awfully like get_unmapped_area().
That was the first place I looked, i need to look at it one more time to see
if we can reuse the logic.
Thanks,
Badari
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 0/4] ext3 block reservation patch set
2004-04-14 23:07 ` Andrew Morton
@ 2004-04-14 23:42 ` Mingming Cao
2004-04-21 23:34 ` [PATCH] Lazy discard ext3 reservation window patch Mingming Cao
1 sibling, 0 replies; 28+ messages in thread
From: Mingming Cao @ 2004-04-14 23:42 UTC (permalink / raw)
To: Andrew Morton; +Cc: tytso, pbadari, linux-kernel, ext2-devel
On Wed, 2004-04-14 at 16:07, Andrew Morton wrote:
> > The current implementation is more than O(n): every time it does not
> > have a reservation window, it search from the head of per filesystem
> > reservation window list head. If it failed within the group, it will
> > move to the next group and start the search from the head of the list
> > again.
>
> Same problem exists in arch_get_unmapped_area(). We have a funny little
> heuristic (free_area_cache) in there to speed up the common case.
Actually, we only hit this more than O(n) case when the file is just
opened for write(without reservation window). In the normal case, if we
have a old reservation window, we will start the from the old
reservation, instead of the head of whole filesystem list, to search for
next new reservable hole.
>
> > This could be fixed by forget about the block group boundary at
> > all,(remove the for loop in ext3_new_block), make it searchs for a block
> > in a filesystem wide:)
>
> I do think we should do this. Does it have any disadvantages?
>
I re-looked at the code today, my concern is, we may end of a big
changes to the existing code. Need to think it more....
^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH] Lazy discard ext3 reservation window patch
2004-04-14 23:07 ` Andrew Morton
2004-04-14 23:42 ` Mingming Cao
@ 2004-04-21 23:34 ` Mingming Cao
1 sibling, 0 replies; 28+ messages in thread
From: Mingming Cao @ 2004-04-21 23:34 UTC (permalink / raw)
To: Andrew Morton; +Cc: tytso, pbadari, linux-kernel, ext2-devel
[-- Attachment #1: Type: text/plain, Size: 1787 bytes --]
Andrew,
This patch contains several changes against the ext3 reservation code in
265-mm6 tree:
Lazy Discard Reservation Window:
This patch is trying to do lazy discard: keep the old reservation
window temporally until we find the new reservation window, only do
remove/add if the new reservation window locate different than the old
one. (The reservation code in mm6 tree will discard the old one first,
then search the new one). Two reasons:
- If the ext3_find_goal() does a good job, the reservation windows on
the list should not very close to each other. So a inode's new
reservation window is likely located just next to it's old one, it's
position in the whole list is unchanged, no need to do remove and then
add the new one to the list in the same location. Just update the start
block and end block.
- If we failed to find a new reservation in the goal group and move on
the search to the next group, having the old reservation around
temporally could allow us to search the list directly after the old
window. Otherwise we lost where we were and has to start from the
beginning of the list. Eventually the old window will be discard when we
found a new one.
Other changes:
- Add check to force maximum when dynamically increase window size.
- ext3_discard_reservation() should not be called on every iput(). Now
it is moved to ext3_delete_inode(), so it is only called on the last
iput() if i_nlink is 0
- remove #ifdef EXT3_RESERVATION since we made reservation an mount
option
- Only allow application to modify the file's reservation window size
when fs is mounted with reservation and the operation is performed on
regular files.
This patch should apply to 2.6.5-mm6. Have tested it through many dd
test, untar test,dbench and tiobench.
Thanks!
Mingming
[-- Attachment #2: ext3_reservation_lazydiscard.patch --]
[-- Type: text/x-patch, Size: 7322 bytes --]
diff -urNP -X dontdiff 265-mm6-regression-fix/fs/ext3/balloc.c 265-mm6-fix2/fs/ext3/balloc.c
--- 265-mm6-regression-fix/fs/ext3/balloc.c 2004-04-21 18:35:21.916666616 -0700
+++ 265-mm6-fix2/fs/ext3/balloc.c 2004-04-21 18:28:47.621608576 -0700
@@ -723,7 +723,7 @@
start_block = goal + group_first_block;
size = atomic_read(&my_rsv->rsv_goal_size);
- /* if we have a old reservation, discard it first */
+ /* if we have a old reservation, start the search from the old rsv */
if (!rsv_is_empty(my_rsv)) {
/*
* if the old reservation is cross group boundary
@@ -745,8 +745,7 @@
/* remember where we are before we discard the old one */
if (my_rsv->rsv_end + 1 > start_block)
start_block = my_rsv->rsv_end + 1;
- search_head = list_entry(my_rsv->rsv_list.prev,
- struct reserve_window, rsv_list);
+ search_head = my_rsv;
if ((my_rsv->rsv_alloc_hit > (my_rsv->rsv_end - my_rsv->rsv_start + 1) / 2)) {
/*
* if we previously allocation hit ration is greater than half
@@ -754,9 +753,10 @@
* otherwise keep the same
*/
size = size * 2;
+ if (size > EXT3_MAX_RESERVE_BLOCKS)
+ size = EXT3_MAX_RESERVE_BLOCKS;
atomic_set(&my_rsv->rsv_goal_size, size);
}
- rsv_window_remove(my_rsv);
}
else {
/*
@@ -823,9 +823,17 @@
found_rsv_window:
/*
* great! the reservable space contains some free blocks.
- * Insert it to the list.
- */
- rsv_window_add(my_rsv, prev_rsv);
+ *
+ * if the search returns that we should add the new
+ * window just next to where the old window, we don't
+ * need to remove the old window first then add it to the
+ * same place, just update the new start and new end.
+ */
+ if (my_rsv != prev_rsv) {
+ if (!rsv_is_empty(my_rsv))
+ rsv_window_remove(my_rsv);
+ rsv_window_add(my_rsv, prev_rsv);
+ }
my_rsv->rsv_start = reservable_space_start;
my_rsv->rsv_end = my_rsv->rsv_start + size - 1;
return 0; /* succeed */
@@ -927,6 +935,10 @@
if (!goal_in_my_reservation(my_rsv, goal, group, sb))
goal = -1;
}
+ if ((my_rsv->rsv_start >= group_first_block + EXT3_BLOCKS_PER_GROUP(sb))
+ || (my_rsv->rsv_end < group_first_block))
+ BUG();
+
ret = ext3_try_to_allocate(sb, handle, group, bitmap_bh, goal,
my_rsv);
if (ret >= 0)
@@ -996,10 +1008,10 @@
sbi = EXT3_SB(sb);
es = EXT3_SB(sb)->s_es;
ext3_debug("goal=%lu.\n", goal);
-#ifdef EXT3_RESERVATION
+
if (test_opt(sb, RESERVATION) && S_ISREG(inode->i_mode))
my_rsv = &EXT3_I(inode)->i_rsv_window;
-#endif
+
free_blocks = percpu_counter_read_positive(&sbi->s_freeblocks_counter);
root_blocks = le32_to_cpu(es->s_r_blocks_count);
if (free_blocks < root_blocks + 1 && !capable(CAP_SYS_RESOURCE) &&
@@ -1066,7 +1078,6 @@
if (ret_block >= 0)
goto allocated;
}
-#ifdef EXT3_RESERVATION
/*
* We may end up a bogus ealier ENOSPC error due to
* filesystem is "full" of reservations, but
@@ -1075,17 +1086,11 @@
* just do block allocation as without reservations.
*/
if (my_rsv) {
-#ifdef EXT3_RESERVATION_DEBUG
- printk("filesystem is fully reserved. Actual free blocks: %d. "
- "Try to do allocation without reservation, goal_group "
- "is %d\n",
- free_blocks, goal_group);
-#endif
my_rsv = NULL;
group_no = goal_group;
goto retry;
}
-#endif
+
/* No space left on the device */
*errp = -ENOSPC;
goto out;
diff -urNP -X dontdiff 265-mm6-regression-fix/fs/ext3/inode.c 265-mm6-fix2/fs/ext3/inode.c
--- 265-mm6-regression-fix/fs/ext3/inode.c 2004-04-21 18:34:45.393219024 -0700
+++ 265-mm6-fix2/fs/ext3/inode.c 2004-04-21 18:21:35.747263456 -0700
@@ -177,19 +177,6 @@
}
/*
- * Called at each iput()
- *
- * The inode may be "bad" if ext3_read_inode() saw an error from
- * ext3_get_inode(), so we need to check that to avoid freeing random disk
- * blocks.
- */
-void ext3_put_inode(struct inode *inode)
-{
- if (!is_bad_inode(inode))
- ext3_discard_reservation(inode);
-}
-
-/*
* Called at the last iput() if i_nlink is zero.
*/
void ext3_delete_inode (struct inode * inode)
@@ -199,6 +186,9 @@
if (is_bad_inode(inode))
goto no_delete;
+ /* discard the block reservation */
+ ext3_discard_reservation(inode);
+
handle = start_transaction(inode);
if (IS_ERR(handle)) {
/* If we're going to skip the normal cleanup, we still
diff -urNP -X dontdiff 265-mm6-regression-fix/fs/ext3/ioctl.c 265-mm6-fix2/fs/ext3/ioctl.c
--- 265-mm6-regression-fix/fs/ext3/ioctl.c 2004-04-21 18:34:45.389219632 -0700
+++ 265-mm6-fix2/fs/ext3/ioctl.c 2004-04-21 18:22:28.196289992 -0700
@@ -152,11 +152,16 @@
return ret;
}
#endif
-#ifdef EXT3_RESERVATION
case EXT3_IOC_GETRSVSZ:
- rsv_window_size = atomic_read(&ei->i_rsv_window.rsv_goal_size);
- return put_user(rsv_window_size, (int *)arg);
+ if (test_opt(inode->i_sb, RESERVATION) && S_ISREG(inode->i_mode)) {
+ rsv_window_size = atomic_read(&ei->i_rsv_window.rsv_goal_size);
+ return put_user(rsv_window_size, (int *)arg);
+ }
+ return -ENOTTY;
case EXT3_IOC_SETRSVSZ:
+ if (!test_opt(inode->i_sb, RESERVATION) ||!S_ISREG(inode->i_mode))
+ return -ENOTTY;
+
if (IS_RDONLY(inode))
return -EROFS;
@@ -170,7 +175,6 @@
rsv_window_size = EXT3_MAX_RESERVE_BLOCKS;
atomic_set(&ei->i_rsv_window.rsv_goal_size, rsv_window_size);
return 0;
-#endif
default:
return -ENOTTY;
}
diff -urNP -X dontdiff 265-mm6-regression-fix/fs/ext3/super.c 265-mm6-fix2/fs/ext3/super.c
--- 265-mm6-regression-fix/fs/ext3/super.c 2004-04-21 18:34:45.394218872 -0700
+++ 265-mm6-fix2/fs/ext3/super.c 2004-04-21 18:21:35.755262240 -0700
@@ -551,7 +551,6 @@
.read_inode = ext3_read_inode,
.write_inode = ext3_write_inode,
.dirty_inode = ext3_dirty_inode,
- .put_inode = ext3_put_inode,
.delete_inode = ext3_delete_inode,
.put_super = ext3_put_super,
.write_super = ext3_write_super,
@@ -760,19 +759,12 @@
printk("EXT3 (no)acl options not supported\n");
break;
#endif
-#ifdef EXT3_RESERVATION
case Opt_reservation:
set_opt(sbi->s_mount_opt, RESERVATION);
break;
case Opt_noreservation:
clear_opt(sbi->s_mount_opt, RESERVATION);
break;
-#else
- case Opt_reservation:
- case Opt_noreservation:
- printk("EXT3 block reservation options not supported\n");
- break;
-#endif
case Opt_journal_update:
/* @@@ FIXME */
/* Eventually we will want to be able to create
diff -urNP -X dontdiff 265-mm6-regression-fix/include/linux/ext3_fs.h 265-mm6-fix2/include/linux/ext3_fs.h
--- 265-mm6-regression-fix/include/linux/ext3_fs.h 2004-04-21 18:34:43.546499768 -0700
+++ 265-mm6-fix2/include/linux/ext3_fs.h 2004-04-21 18:21:35.769260112 -0700
@@ -35,7 +35,6 @@
/*
* Define EXT3_RESERVATION to reserve data blocks for expanding files
*/
-#define EXT3_RESERVATION
#define EXT3_DEFAULT_RESERVE_BLOCKS 8
#define EXT3_MAX_RESERVE_BLOCKS 1024
/*
@@ -208,10 +207,8 @@
#ifdef CONFIG_JBD_DEBUG
#define EXT3_IOC_WAIT_FOR_READONLY _IOR('f', 99, long)
#endif
-#ifdef EXT3_RESERVATION
#define EXT3_IOC_GETRSVSZ _IOR('r', 1, long)
#define EXT3_IOC_SETRSVSZ _IOW('r', 2, long)
-#endif
/*
* Structure of an inode on the disk
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 0/4] ext3 block reservation patch set
2004-04-14 0:52 ` [PATCH 0/4] ext3 block reservation patch set Mingming Cao
` (4 preceding siblings ...)
2004-04-14 2:47 ` [PATCH 0/4] ext3 block reservation patch set Andrew Morton
@ 2004-04-27 15:19 ` Mary Edie Meredith
5 siblings, 0 replies; 28+ messages in thread
From: Mary Edie Meredith @ 2004-04-27 15:19 UTC (permalink / raw)
To: linux-kernel
To test the benefit of Mingming's ext3 reservation
patchset, we ran tiobench on 2-way systems on STP
using 2.6.6-rc2-mm1 versus 2.6.6-rc2-mm1 patched to
force the ext3 file system to be built without
reservation.
The results show increased throughput for >1
threads not only for sequential write, but also
for random write, sequential read, and random read.
Latency is also decreased for all cases.
Raw data can be found:
-2 way 2.6.6-rc2-mm1
http://khack.osdl.org/stp/292223/results/tiobench-ext3.txt
-2 way 2.6.6-rc2-mm1 noreservation default
http://khack.osdl.org/stp/292225/results/tiobench-ext3.txt
Judith compared the two runs by plotting the
results at: http://developer.osdl.org/judith/tiobench/ext3-reserve/
Here are some interesting ones:
Thruput results:
-Random write thruput 128k
http://developer.osdl.org/judith/tiobench/ext3-reserve/through.ext3.2CPU.RW.128.png
-Random write thruput 4k
http://developer.osdl.org/judith/tiobench/ext3-reserve/through.ext3.2CPU.RW.4.png
-Sequential write thruput 4k
http://developer.osdl.org/judith/tiobench/ext3-reserve/through.ext3.2CPU.SW.4.png
-Sequential write thruput 128k
http://developer.osdl.org/judith/tiobench/ext3-reserve/through.ext3.2CPU.SW.128.png
Latency is reduced almost across the board.
-Example: Latency figures for Random write 4k:
http://developer.osdl.org/judith/tiobench/ext3-reserve/lat.ext3.2CPU.RW.4.png
Mary Edie Meredith
Open Source Development Labs
503-626-2455 x42
maryedie@hotmail.com
Mingming Cao wrote:
> Hello,
>
> Here is a set of patches which implement the in-memory ext3 block
> reservation (previously called reservation based ext3 preallocation).
>
> [patch 1]ext3_rsv_cleanup.patch: Cleans up the old ext3 preallocation
> code carried from ext2 but turned off.
>
> [patch 2]ext3_rsv_base.patch: Implements the base of in-memory block
> reservation and block allocation from reservation window.
>
> [patch 3]ext3_rsv_mount.patch: Adds features on top of the
> ext3_rsv_base.patch:
> - deal with earlier bogus -ENOSPC error
> - do block reservation only for regular file
> - make the ext3 reservation feature as a mount option:
> new mount option added: reservation
> - A pair of file ioctl commands are added for application to control
> the block reservation window size.
>
> [patch 4]ext3_rsv_dw.patch: adjust the reservation window size
> dynamically:
> Start from the deault reservation window size, if the hit ration of
> the reservation window is more than 50%, we will double the reservation
> window size next time up to a certain upper limit.
>
> Here are some numbers collected on dbench on 8 way PIII 700Mhz:
>
> dbench average throughputs on 4 runs
> ==================================================
> Threads ext3 ext3+rsv(8) ext3+rsv+dw
> 1 103 104(0%) 105(1%)
> 4 144 286(98%) 256(77%)
> 8 118 197(66%) 210(77%)
> 16 113 160(41%) 177(56%)
> 32 61 123(101%) 150(145%)
> 64 41 82(100%) 85(107%)
>
> And some numbers on tiobench sequential write:
>
> tiobench Sequential Writes throughputs(improvments)
> =====================================================================
> Threads ext2 ext3 ext3+rsv(8)(%) ext3+rsv(128)(%) ext3+rsv+dw(%)
> 1 26 23 25(8%) 26(13%) 26(13%)
> 4 17 4 14(250%) 24(500%) 25(525%)
> 8 15 7 13(85%) 23(228%) 24(242%)
> 16 16 13 12(-7%) 22(69%) 24(84%)
> 32 15 3 12(300%) 23(666%) 23(666%)
> 64 14 1 11(1000%) 22(2100%) 23(2200%)
>
> Note each time we run the test on a fresh created ext3 filesystem.
>
> We have also run fsx tests on a 8 way on 2.6.4 kernel with the patch set
> for a whole weekend on fresh created ext3 filesystem, as well as on a 4
> way with the root filesystem as ext3 plus all the changes. Other tests
> include 8 threads dd tests and untar a kernel source tree.
>
> Besides look at the performance numbers and verify the functionality, we
> also checked the block allocation layout for each file generated during
> the test: the blocks for a file are more contiguous with the reservation
> mount option on, especially when we dynamically increase the reservation
> window size in the sequential write cases.
>
> Andrew, is this something that you would consider for -mm tree?
>
> Thanks again for Andrew, Ted and Badari's ideas and helps on this
> project. I would really appreciate any comments and feedbacks.
>
>
> Mingming
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 28+ messages in thread
end of thread, other threads:[~2004-04-27 15:27 UTC | newest]
Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <200403190846.56955.pbadari@us.ibm.com>
[not found] ` <20040321015746.14b3c0dc.akpm@osdl.org>
2004-03-30 8:55 ` [RFC, PATCH] Reservation based ext3 preallocation Mingming Cao
2004-03-30 9:45 ` Andrew Morton
2004-03-30 17:07 ` Badari Pulavarty
2004-03-30 17:12 ` [Ext2-devel] " Alex Tomas
2004-03-30 18:07 ` Badari Pulavarty
2004-03-30 18:23 ` Mingming Cao
2004-03-30 18:36 ` Andrew Morton
2004-04-03 1:45 ` [Ext2-devel] " Mingming Cao
2004-04-03 1:50 ` Andrew Morton
2004-04-03 2:37 ` Mingming Cao
2004-04-03 2:50 ` Andrew Morton
2004-04-05 16:49 ` Mingming Cao
2004-04-14 0:52 ` [PATCH 0/4] ext3 block reservation patch set Mingming Cao
2004-04-14 0:54 ` [PATCH 1/4] ext3 block reservation patch set -- ext3 preallocation cleanup Mingming Cao
2004-04-14 0:57 ` [PATCH 2/4] ext3 block reservation patch set --ext3 block reservation Mingming Cao
2004-04-14 0:58 ` [PATCH 3/4] ext3 block reservation patch set --mount and ioctl feature Mingming Cao
2004-04-14 1:00 ` [PATCH 4/4] ext3 block reservation patch set -- dynamically increase reservation window Mingming Cao
2004-04-14 2:47 ` [PATCH 0/4] ext3 block reservation patch set Andrew Morton
2004-04-14 16:11 ` Badari Pulavarty
2004-04-14 17:44 ` Mingming Cao
2004-04-14 23:02 ` Andrew Morton
2004-04-14 23:12 ` Badari Pulavarty
2004-04-14 16:42 ` Badari Pulavarty
2004-04-14 17:30 ` Mingming Cao
2004-04-14 23:07 ` Andrew Morton
2004-04-14 23:42 ` Mingming Cao
2004-04-21 23:34 ` [PATCH] Lazy discard ext3 reservation window patch Mingming Cao
2004-04-27 15:19 ` [PATCH 0/4] ext3 block reservation patch set Mary Edie Meredith
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox