* [PATCHSET] Linux 2.4.20-jam0
@ 2002-11-29 23:38 J.A. Magallon
2002-11-30 0:47 ` Andrew Morton
` (2 more replies)
0 siblings, 3 replies; 12+ messages in thread
From: J.A. Magallon @ 2002-11-29 23:38 UTC (permalink / raw)
To: Lista Linux-Kernel; +Cc: Con Kolivas
Hi all...
New announcement of the -jam patches. While we all await for a real
-aa1 from Andrea, here goes an -jam0 (ie, not -jam1), with -aa
patch ported. It runs fine on my box.
Additions since last release (see README for credits...):
- reverted the fast-pte part of -aa. Still have to try again
to see if it is more stable now.
- force-inline patch.
- 4M queue size for block writes
- P4 prefetching
- Orlov inode allocator for 2.4
- BProc 3.2.3
(btw, I have bee looking for orlov for ext2 - it exists ? -
and htree for ext2/3. Any pointers ? )
As always, get it at:
http://giga.cps.unizar.es/~magallon/linux/kernel/2.4.20-jam0.tar.gz
http://giga.cps.unizar.es/~magallon/linux/kernel/2.4.20-jam0/
Enjoy !!
--
J.A. Magallon <jamagallon@able.es> \ Software is like sex:
werewolf.able.es \ It's better when it's free
Mandrake Linux release 9.1 (Cooker) for i586
Linux 2.4.20-jam0 (gcc 3.2 (Mandrake Linux 9.1 3.2-4mdk))
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCHSET] Linux 2.4.20-jam0
2002-11-29 23:38 [PATCHSET] Linux 2.4.20-jam0 J.A. Magallon
@ 2002-11-30 0:47 ` Andrew Morton
2002-11-30 14:45 ` J.A. Magallon
2002-11-30 6:36 ` hugang
2002-11-30 17:50 ` Andrea Arcangeli
2 siblings, 1 reply; 12+ messages in thread
From: Andrew Morton @ 2002-11-30 0:47 UTC (permalink / raw)
To: J.A. Magallon; +Cc: Lista Linux-Kernel, Con Kolivas
"J.A. Magallon" wrote:
>
> - Orlov inode allocator for 2.4
The Orlov allocator in 2.5 has caused a tremendous performance regression
in dbench-on-ext3/ordered-on-scsi.
I don't know why yet - I doubt if it's due to the allocator itself - more
likely an IO scheduling bug in ext3, or a bug in the 2.5 elevator.
There is no such regression on IDE - presumably write caching is covering
up the problem.
So that's something to watch out for.
(where did your Orlov patch from? All the tabs are mangled)
You'll need to port this missing bit, which provides the `oldalloc'
and `orlov' mount options.
fs/ext3/super.c | 4 ++++
1 files changed, 4 insertions(+)
--- 25/fs/ext3/super.c~ext3-oldalloc Fri Nov 29 02:21:20 2002
+++ 25-akpm/fs/ext3/super.c Fri Nov 29 02:22:03 2002
@@ -662,6 +662,10 @@ static int parse_options (char * options
return 0;
sbi->s_resuid = v;
}
+ else if (!strcmp (this_char, "oldalloc"))
+ set_opt (sbi->s_mount_opt, OLDALLOC);
+ else if (!strcmp (this_char, "orlov"))
+ clear_opt (sbi->s_mount_opt, OLDALLOC);
#ifdef CONFIG_JBD_DEBUG
else if (!strcmp (this_char, "ro-after")) {
unsigned long v;
_
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCHSET] Linux 2.4.20-jam0
2002-11-29 23:38 [PATCHSET] Linux 2.4.20-jam0 J.A. Magallon
2002-11-30 0:47 ` Andrew Morton
@ 2002-11-30 6:36 ` hugang
2002-11-30 14:58 ` J.A. Magallon
2002-11-30 17:50 ` Andrea Arcangeli
2 siblings, 1 reply; 12+ messages in thread
From: hugang @ 2002-11-30 6:36 UTC (permalink / raw)
To: J.A. Magallon, Andrew Morton; +Cc: linux-kernel, conman
[-- Attachment #1: Type: text/plain, Size: 208 bytes --]
On Sat, 30 Nov 2002 00:38:07 +0100
"J.A. Magallon" <jamagallon@able.es> wrote:
> - Orlov inode allocator for 2.4
- add andrew morton supper.c patch
- change the indent to linux standard.
--
- Hu Gang
[-- Attachment #2: 2.4.20_orlov-indent --]
[-- Type: application/octet-stream, Size: 17851 bytes --]
Index: fs/ext3/ialloc.c
===================================================================
RCS file: /home/hugang/local/cvs/2.4.X/fs/ext3/ialloc.c,v
retrieving revision 1.1.1.6
diff -u -r1.1.1.6 ialloc.c
--- fs/ext3/ialloc.c 29 Nov 2002 06:01:50 -0000 1.1.1.6
+++ fs/ext3/ialloc.c 30 Nov 2002 05:08:04 -0000
@@ -21,6 +21,7 @@
#include <linux/string.h>
#include <linux/locks.h>
#include <linux/quotaops.h>
+#include <linux/random.h>
#include <asm/bitops.h>
#include <asm/byteorder.h>
@@ -262,9 +263,11 @@
if (gdp) {
gdp->bg_free_inodes_count = cpu_to_le16(
le16_to_cpu(gdp->bg_free_inodes_count) + 1);
- if (is_directory)
+ if (is_directory) {
gdp->bg_used_dirs_count = cpu_to_le16(
le16_to_cpu(gdp->bg_used_dirs_count) - 1);
+ EXT3_SB(sb)->s_dir_count--;
+ }
}
BUFFER_TRACE(bh2, "call ext3_journal_dirty_metadata");
err = ext3_journal_dirty_metadata(handle, bh2);
@@ -293,20 +296,228 @@
* the groups with above-average free space, that group with the fewest
* directories already is chosen.
*
+ * For other inodes, search forward from the parent directory\'s block
+ * group to find a free inode.
+ */
+static int find_group_dir(struct super_block *sb, struct inode *parent)
+{
+ struct ext3_super_block * es = EXT3_SB(sb)->s_es;
+ int ngroups = EXT3_SB(sb)->s_groups_count;
+ int avefreei = le32_to_cpu(es->s_free_inodes_count) / ngroups;
+ struct ext3_group_desc *desc, *best_desc = NULL;
+ struct buffer_head *bh, *best_bh = NULL;
+ int group, best_group = -1;
+
+ for (group = 0; group < ngroups; group++) {
+ desc = ext3_get_group_desc (sb, group, &bh);
+ if (!desc || !desc->bg_free_inodes_count)
+ continue;
+ if (le16_to_cpu(desc->bg_free_inodes_count) < avefreei)
+ continue;
+ if (!best_desc ||
+ (le16_to_cpu(desc->bg_free_blocks_count) >
+ le16_to_cpu(best_desc->bg_free_blocks_count))) {
+ best_group = group;
+ best_desc = desc;
+ best_bh = bh;
+ }
+ }
+ if (!best_desc)
+ return -1;
+ return best_group;
+}
+
+/*
+ * Orlov's allocator for directories.
+ *
+ * We always try to spread first-level directories.
+ *
+ * If there are blockgroups with both free inodes and free blocks counts
+ * not worse than average we return one with smallest directory count.
+ * Otherwise we simply return a random group.
+ *
+ * For the rest rules look so:
+ *
+ * It's OK to put directory into a group unless
+ * it has too many directories already (max_dirs) or
+ * it has too few free inodes left (min_inodes) or
+ * it has too few free blocks left (min_blocks) or
+ * it's already running too large debt (max_debt).
+ * Parent's group is prefered, if it doesn't satisfy these
+ * conditions we search cyclically through the rest. If none
+ * of the groups look good we just look for a group with more
+ * free inodes than average (starting at parent's group).
+ *
+ * Debt is incremented each time we allocate a directory and decremented
+ * when we allocate an inode, within 0--255.
+ */
+
+#define INODE_COST 64
+#define BLOCK_COST 256
+
+static int find_group_orlov(struct super_block *sb, const struct inode *parent)
+{
+ int parent_group = EXT3_I(parent)->i_block_group;
+ struct ext3_sb_info *sbi = EXT3_SB(sb);
+ struct ext3_super_block *es = sbi->s_es;
+ int ngroups = sbi->s_groups_count;
+ int inodes_per_group = EXT3_INODES_PER_GROUP(sb);
+ int avefreei = le32_to_cpu(es->s_free_inodes_count) / ngroups;
+ int avefreeb = le32_to_cpu(es->s_free_blocks_count) / ngroups;
+ int blocks_per_dir;
+ int ndirs = sbi->s_dir_count;
+ int max_debt, max_dirs, min_blocks, min_inodes;
+ int group = -1, i;
+ struct ext3_group_desc *desc;
+ struct buffer_head *bh;
+
+ if ((parent == sb->s_root->d_inode) ||
+ (parent->i_flags & EXT3_TOPDIR_FL)) {
+ struct ext3_group_desc *best_desc = NULL;
+ struct buffer_head *best_bh = NULL;
+ int best_ndir = inodes_per_group;
+ int best_group = -1;
+
+ get_random_bytes(&group, sizeof(group));
+ parent_group = (unsigned)group % ngroups;
+ for (i = 0; i < ngroups; i++) {
+ group = (parent_group + i) % ngroups;
+ desc = ext3_get_group_desc (sb, group, &bh);
+ if (!desc || !desc->bg_free_inodes_count)
+ continue;
+ if (le16_to_cpu(desc->bg_used_dirs_count) >= best_ndir)
+ continue;
+ if (le16_to_cpu(desc->bg_free_inodes_count) < avefreei)
+ continue;
+ if (le16_to_cpu(desc->bg_free_blocks_count) < avefreeb)
+ continue;
+ best_group = group;
+ best_ndir = le16_to_cpu(desc->bg_used_dirs_count);
+ best_desc = desc;
+ best_bh = bh;
+ }
+ if (best_group >= 0) {
+ desc = best_desc;
+ bh = best_bh;
+ group = best_group;
+ goto found;
+ }
+ goto fallback;
+ }
+
+ blocks_per_dir = (le32_to_cpu(es->s_blocks_count) -
+ le32_to_cpu(es->s_free_blocks_count)) / ndirs;
+
+ max_dirs = ndirs / ngroups + inodes_per_group / 16;
+ min_inodes = avefreei - inodes_per_group / 4;
+ min_blocks = avefreeb - EXT3_BLOCKS_PER_GROUP(sb) / 4;
+
+ max_debt = EXT3_BLOCKS_PER_GROUP(sb) / max(blocks_per_dir, BLOCK_COST);
+ if (max_debt * INODE_COST > inodes_per_group)
+ max_debt = inodes_per_group / INODE_COST;
+ if (max_debt > 255)
+ max_debt = 255;
+ if (max_debt == 0)
+ max_debt = 1;
+
+ for (i = 0; i < ngroups; i++) {
+ group = (parent_group + i) % ngroups;
+ desc = ext3_get_group_desc (sb, group, &bh);
+ if (!desc || !desc->bg_free_inodes_count)
+ continue;
+ if (sbi->s_debts[group] >= max_debt)
+ continue;
+ if (le16_to_cpu(desc->bg_used_dirs_count) >= max_dirs)
+ continue;
+ if (le16_to_cpu(desc->bg_free_inodes_count) < min_inodes)
+ continue;
+ if (le16_to_cpu(desc->bg_free_blocks_count) < min_blocks)
+ continue;
+ goto found;
+ }
+
+ fallback:
+ for (i = 0; i < ngroups; i++) {
+ group = (parent_group + i) % ngroups;
+ desc = ext3_get_group_desc (sb, group, &bh);
+ if (!desc || !desc->bg_free_inodes_count)
+ continue;
+ if (le16_to_cpu(desc->bg_free_inodes_count) >= avefreei)
+ goto found;
+ }
+
+ return -1;
+ found:
+ return group;
+}
+
+static int find_group_other(struct super_block *sb, struct inode *parent)
+{
+ int parent_group = EXT3_I(parent)->i_block_group;
+ int ngroups = EXT3_SB(sb)->s_groups_count;
+ struct ext3_group_desc *desc;
+ struct buffer_head *bh;
+ int group, i;
+
+ /*
+ * Try to place the inode in its parent directory
+ */
+ group = parent_group;
+ desc = ext3_get_group_desc (sb, group, &bh);
+ if (desc && le16_to_cpu(desc->bg_free_inodes_count))
+ goto found;
+
+ /*
+ * Use a quadratic hash to find a group with a
+ * free inode
+ */
+ for (i = 1; i < ngroups; i <<= 1) {
+ group += i;
+ if (group >= ngroups)
+ group -= ngroups;
+ desc = ext3_get_group_desc (sb, group, &bh);
+ if (desc && le16_to_cpu(desc->bg_free_inodes_count))
+ goto found;
+ }
+
+ /*
+ * That failed: try linear search for a free inode
+ */
+ group = parent_group + 1;
+ for (i = 2; i < ngroups; i++) {
+ if (++group >= ngroups)
+ group = 0;
+ desc = ext3_get_group_desc (sb, group, &bh);
+ if (desc && le16_to_cpu(desc->bg_free_inodes_count))
+ goto found;
+ }
+
+ return -1;
+
+ found:
+ return group;
+}
+
+/*
+ * There are two policies for allocating an inode. If the new inode is
+ * a directory, then a forward search is made for a block group with both
+ * free space and a low directory-to-inode ratio; if that fails, then of
+ * the groups with above-average free space, that group with the fewest
+ * directories already is chosen.
+ *
* For other inodes, search forward from the parent directory's block
* group to find a free inode.
*/
-struct inode * ext3_new_inode (handle_t *handle,
- const struct inode * dir, int mode)
+struct inode * ext3_new_inode (handle_t *handle, struct inode * dir, int mode)
{
struct super_block * sb;
struct buffer_head * bh;
struct buffer_head * bh2;
- int i, j, avefreei;
- struct inode * inode;
+ int group;
+ ino_t ino;
int bitmap_nr;
+ struct inode * inode;
struct ext3_group_desc * gdp;
- struct ext3_group_desc * tmp;
struct ext3_super_block * es;
int err = 0;
@@ -323,94 +534,36 @@
lock_super (sb);
es = sb->u.ext3_sb.s_es;
repeat:
- gdp = NULL;
- i = 0;
-
if (S_ISDIR(mode)) {
- avefreei = le32_to_cpu(es->s_free_inodes_count) /
- sb->u.ext3_sb.s_groups_count;
- if (!gdp) {
- for (j = 0; j < sb->u.ext3_sb.s_groups_count; j++) {
- struct buffer_head *temp_buffer;
- tmp = ext3_get_group_desc (sb, j, &temp_buffer);
- if (tmp &&
- le16_to_cpu(tmp->bg_free_inodes_count) &&
- le16_to_cpu(tmp->bg_free_inodes_count) >=
- avefreei) {
- if (!gdp || (le16_to_cpu(tmp->bg_free_blocks_count) >
- le16_to_cpu(gdp->bg_free_blocks_count))) {
- i = j;
- gdp = tmp;
- bh2 = temp_buffer;
- }
- }
- }
- }
- } else {
- /*
- * Try to place the inode in its parent directory
- */
- i = dir->u.ext3_i.i_block_group;
- tmp = ext3_get_group_desc (sb, i, &bh2);
- if (tmp && le16_to_cpu(tmp->bg_free_inodes_count))
- gdp = tmp;
+ if (test_opt (sb, OLDALLOC))
+ group = find_group_dir(sb, dir);
else
- {
- /*
- * Use a quadratic hash to find a group with a
- * free inode
- */
- for (j = 1; j < sb->u.ext3_sb.s_groups_count; j <<= 1) {
- i += j;
- if (i >= sb->u.ext3_sb.s_groups_count)
- i -= sb->u.ext3_sb.s_groups_count;
- tmp = ext3_get_group_desc (sb, i, &bh2);
- if (tmp &&
- le16_to_cpu(tmp->bg_free_inodes_count)) {
- gdp = tmp;
- break;
- }
- }
- }
- if (!gdp) {
- /*
- * That failed: try linear search for a free inode
- */
- i = dir->u.ext3_i.i_block_group + 1;
- for (j = 2; j < sb->u.ext3_sb.s_groups_count; j++) {
- if (++i >= sb->u.ext3_sb.s_groups_count)
- i = 0;
- tmp = ext3_get_group_desc (sb, i, &bh2);
- if (tmp &&
- le16_to_cpu(tmp->bg_free_inodes_count)) {
- gdp = tmp;
- break;
- }
- }
- }
- }
+ group = find_group_orlov(sb, dir);
+ } else
+ group = find_group_other(sb, dir);
err = -ENOSPC;
- if (!gdp)
- goto out;
+ if (gdp == -1)
+ goto fail;
err = -EIO;
- bitmap_nr = load_inode_bitmap (sb, i);
+ bitmap_nr = load_inode_bitmap (sb, group);
if (bitmap_nr < 0)
goto fail;
+ gdp = ext3_get_group_desc (sb, group, &bh2);
bh = sb->u.ext3_sb.s_inode_bitmap[bitmap_nr];
- if ((j = ext3_find_first_zero_bit ((unsigned long *) bh->b_data,
+ if ((ino = ext3_find_first_zero_bit ((unsigned long *) bh->b_data,
EXT3_INODES_PER_GROUP(sb))) <
EXT3_INODES_PER_GROUP(sb)) {
BUFFER_TRACE(bh, "get_write_access");
err = ext3_journal_get_write_access(handle, bh);
if (err) goto fail;
- if (ext3_set_bit (j, bh->b_data)) {
+ if (ext3_set_bit (ino, bh->b_data)) {
ext3_error (sb, "ext3_new_inode",
- "bit already set for inode %d", j);
+ "bit already set for inode %lu", ino);
goto repeat;
}
BUFFER_TRACE(bh, "call ext3_journal_dirty_metadata");
@@ -420,7 +573,7 @@
if (le16_to_cpu(gdp->bg_free_inodes_count) != 0) {
ext3_error (sb, "ext3_new_inode",
"Free inodes count corrupted in group %d",
- i);
+ group);
/* Is it really ENOSPC? */
err = -ENOSPC;
if (sb->s_flags & MS_RDONLY)
@@ -436,11 +589,11 @@
}
goto repeat;
}
- j += i * EXT3_INODES_PER_GROUP(sb) + 1;
- if (j < EXT3_FIRST_INO(sb) || j > le32_to_cpu(es->s_inodes_count)) {
+ ino += group * EXT3_INODES_PER_GROUP(sb) + 1;
+ if (ino < EXT3_FIRST_INO(sb) || ino > le32_to_cpu(es->s_inodes_count)) {
ext3_error (sb, "ext3_new_inode",
"reserved inode or inode > inodes count - "
- "block_group = %d,inode=%d", i, j);
+ "block_group = %d,inode=%lu", group, ino);
err = -EIO;
goto fail;
}
@@ -450,9 +603,11 @@
if (err) goto fail;
gdp->bg_free_inodes_count =
cpu_to_le16(le16_to_cpu(gdp->bg_free_inodes_count) - 1);
- if (S_ISDIR(mode))
+ if (S_ISDIR(mode)) {
gdp->bg_used_dirs_count =
cpu_to_le16(le16_to_cpu(gdp->bg_used_dirs_count) + 1);
+ sb->u.ext3_sb.s_dir_count++;
+ }
BUFFER_TRACE(bh2, "call ext3_journal_dirty_metadata");
err = ext3_journal_dirty_metadata(handle, bh2);
if (err) goto fail;
@@ -478,7 +633,7 @@
inode->i_gid = current->fsgid;
inode->i_mode = mode;
- inode->i_ino = j;
+ inode->i_ino = ino;
/* This is the optimal IO size (for stat), not the fs block size */
inode->i_blksize = PAGE_SIZE;
inode->i_blocks = 0;
@@ -498,7 +653,7 @@
#ifdef EXT3_PREALLOCATE
inode->u.ext3_i.i_prealloc_count = 0;
#endif
- inode->u.ext3_i.i_block_group = i;
+ inode->u.ext3_i.i_block_group = group;
if (inode->u.ext3_i.i_flags & EXT3_SYNC_FL)
inode->i_flags |= S_SYNC;
@@ -620,6 +775,21 @@
#else
return le32_to_cpu(sb->u.ext3_sb.s_es->s_free_inodes_count);
#endif
+}
+
+/* Called at mount-time, super-block is locked */
+unsigned long ext3_count_dirs (struct super_block * sb)
+{
+ unsigned long count = 0;
+ int i;
+
+ for (i = 0; i < EXT3_SB(sb)->s_groups_count; i++) {
+ struct ext3_group_desc *gdp = ext3_get_group_desc (sb, i, NULL);
+ if (!gdp)
+ continue;
+ count += le16_to_cpu(gdp->bg_used_dirs_count);
+ }
+ return count;
}
#ifdef CONFIG_EXT3_CHECK
Index: fs/ext3/super.c
===================================================================
RCS file: /home/hugang/local/cvs/2.4.X/fs/ext3/super.c,v
retrieving revision 1.1.1.6
diff -u -r1.1.1.6 super.c
--- fs/ext3/super.c 29 Nov 2002 06:01:51 -0000 1.1.1.6
+++ fs/ext3/super.c 30 Nov 2002 05:03:28 -0000
@@ -416,6 +416,7 @@
for (i = 0; i < sbi->s_gdb_count; i++)
brelse(sbi->s_group_desc[i]);
kfree(sbi->s_group_desc);
+ kfree(sbi->s_debts);
for (i = 0; i < EXT3_MAX_GROUP_LOADED; i++)
brelse(sbi->s_inode_bitmap[i]);
for (i = 0; i < EXT3_MAX_GROUP_LOADED; i++)
@@ -582,6 +583,10 @@
if (want_numeric(value, "sb", sb_block))
return 0;
}
+ else if (!strcmp (this_char, "oldalloc"))
+ set_opt (sbi->s_mount_opt, OLDALLOC);
+ else if (!strcmp (this_char, "orlov"))
+ clear_opt (sbi->s_mount_opt, OLDALLOC);
#ifdef CONFIG_JBD_DEBUG
else if (!strcmp (this_char, "ro-after")) {
unsigned long v;
@@ -1098,6 +1103,13 @@
printk (KERN_ERR "EXT3-fs: not enough memory\n");
goto failed_mount;
}
+ sbi->s_debts = kmalloc(sbi->s_groups_count * sizeof(*sbi->s_debts),
+ GFP_KERNEL);
+ if (!sbi->s_debts) {
+ printk ("EXT3-fs: not enough memory\n");
+ goto failed_mount2;
+ }
+ memset(sbi->s_debts, 0, sbi->s_groups_count * sizeof(*sbi->s_debts));
for (i = 0; i < db_count; i++) {
sbi->s_group_desc[i] = sb_bread(sb, logic_sb_block + i + 1);
if (!sbi->s_group_desc[i]) {
@@ -1120,6 +1132,7 @@
sbi->s_loaded_inode_bitmaps = 0;
sbi->s_loaded_block_bitmaps = 0;
sbi->s_gdb_count = db_count;
+ sbi->s_dir_count = ext3_count_dirs(sb);
get_random_bytes(&sbi->s_next_generation, sizeof(u32));
/*
* set up enough so that it can read an inode
@@ -1223,6 +1236,8 @@
failed_mount3:
journal_destroy(sbi->s_journal);
failed_mount2:
+ if (sbi->s_debts)
+ kfree(sbi->s_debts);
for (i = 0; i < db_count; i++)
brelse(sbi->s_group_desc[i]);
kfree(sbi->s_group_desc);
Index: include/linux/ext3_fs.h
===================================================================
RCS file: /home/hugang/local/cvs/2.4.X/include/linux/ext3_fs.h,v
retrieving revision 1.1.1.6
diff -u -r1.1.1.6 ext3_fs.h
--- include/linux/ext3_fs.h 29 Nov 2002 06:08:22 -0000 1.1.1.6
+++ include/linux/ext3_fs.h 30 Nov 2002 05:03:28 -0000
@@ -203,11 +203,11 @@
#define EXT3_INDEX_FL 0x00001000 /* hash-indexed directory */
#define EXT3_IMAGIC_FL 0x00002000 /* AFS directory */
#define EXT3_JOURNAL_DATA_FL 0x00004000 /* file data should be journaled */
+#define EXT3_TOPDIR_FL 0x00020000 /* Top of directory hierarchies*/
#define EXT3_RESERVED_FL 0x80000000 /* reserved for ext3 lib */
-#define EXT3_FL_USER_VISIBLE 0x00005FFF /* User visible flags */
-#define EXT3_FL_USER_MODIFIABLE 0x000000FF /* User modifiable flags */
-
+#define EXT3_FL_USER_VISIBLE 0x0003DFFF /* User visible flags */
+#define EXT3_FL_USER_MODIFIABLE 0x000380FF /* User modifiable flags */
/*
* Inode dynamic state flags
*/
@@ -325,6 +325,7 @@
* Mount flags
*/
#define EXT3_MOUNT_CHECK 0x0001 /* Do mount-time checks */
+#define EXT3_MOUNT_OLDALLOC 0x0002 /* Don't use the new Orlov allocator */
#define EXT3_MOUNT_GRPID 0x0004 /* Create files with directory's group */
#define EXT3_MOUNT_DEBUG 0x0008 /* Some debugging messages */
#define EXT3_MOUNT_ERRORS_CONT 0x0010 /* Continue on errors */
@@ -607,6 +608,7 @@
extern void ext3_free_blocks (handle_t *, struct inode *, unsigned long,
unsigned long);
extern unsigned long ext3_count_free_blocks (struct super_block *);
+extern unsigned long ext3_count_dirs (struct super_block *);
extern void ext3_check_blocks_bitmap (struct super_block *);
extern struct ext3_group_desc * ext3_get_group_desc(struct super_block * sb,
unsigned int block_group,
@@ -620,7 +622,7 @@
extern int ext3_sync_file (struct file *, struct dentry *, int);
/* ialloc.c */
-extern struct inode * ext3_new_inode (handle_t *, const struct inode *, int);
+extern struct inode * ext3_new_inode (handle_t *, struct inode *, int);
extern void ext3_free_inode (handle_t *, struct inode *);
extern struct inode * ext3_orphan_get (struct super_block *, ino_t);
extern unsigned long ext3_count_free_inodes (struct super_block *);
Index: include/linux/ext3_fs_sb.h
===================================================================
RCS file: /home/hugang/local/cvs/2.4.X/include/linux/ext3_fs_sb.h,v
retrieving revision 1.1.1.5
diff -u -r1.1.1.5 ext3_fs_sb.h
--- include/linux/ext3_fs_sb.h 29 Nov 2002 06:08:22 -0000 1.1.1.5
+++ include/linux/ext3_fs_sb.h 30 Nov 2002 05:03:28 -0000
@@ -62,6 +62,8 @@
int s_inode_size;
int s_first_ino;
u32 s_next_generation;
+ unsigned long s_dir_count;
+ u8 *s_debts;
/* Journaling */
struct inode * s_journal_inode;
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCHSET] Linux 2.4.20-jam0
2002-11-30 0:47 ` Andrew Morton
@ 2002-11-30 14:45 ` J.A. Magallon
2002-11-30 14:58 ` Sean Neakums
2002-11-30 22:15 ` Andrew Morton
0 siblings, 2 replies; 12+ messages in thread
From: J.A. Magallon @ 2002-11-30 14:45 UTC (permalink / raw)
To: Andrew Morton; +Cc: Lista Linux-Kernel
On 2002.11.30 Andrew Morton wrote:
>"J.A. Magallon" wrote:
>>
>> - Orlov inode allocator for 2.4
>
>The Orlov allocator in 2.5 has caused a tremendous performance regression
>in dbench-on-ext3/ordered-on-scsi.
>
>I don't know why yet - I doubt if it's due to the allocator itself - more
>likely an IO scheduling bug in ext3, or a bug in the 2.5 elevator.
>
>There is no such regression on IDE - presumably write caching is covering
>up the problem.
>
Is there any way I can test that ? I have all scsi drives and can
for example remount with 'orlov' or 'oldalloc'...
>So that's something to watch out for.
>
>(where did your Orlov patch from? All the tabs are mangled)
>
See the other answer to previous message...
>You'll need to port this missing bit, which provides the `oldalloc'
>and `orlov' mount options.
>
Thanks, I will add it...
BTW, who puts names to options ? Wouldn't be more intuitive to add options
like 'ialloc_std' or 'ialloc_orlov' ? Too late to change this ?
TIA
--
J.A. Magallon <jamagallon@able.es> \ Software is like sex:
werewolf.able.es \ It's better when it's free
Mandrake Linux release 9.1 (Cooker) for i586
Linux 2.4.20-jam0 (gcc 3.2 (Mandrake Linux 9.1 3.2-4mdk))
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCHSET] Linux 2.4.20-jam0
2002-11-30 14:45 ` J.A. Magallon
@ 2002-11-30 14:58 ` Sean Neakums
2002-11-30 15:02 ` J.A. Magallon
2002-11-30 22:15 ` Andrew Morton
1 sibling, 1 reply; 12+ messages in thread
From: Sean Neakums @ 2002-11-30 14:58 UTC (permalink / raw)
To: linux-kernel
commence J.A. Magallon quotation:
> Thanks, I will add it...
> BTW, who puts names to options ? Wouldn't be more intuitive to add options
> like 'ialloc_std' or 'ialloc_orlov' ? Too late to change this ?
There isn't exactly a whole lot of contention in the mount-options
namespace. And neither orlov not ialloc_orlov is in any way
"intuitive". However, orlov is more guessable, to my mind, than
ialloc_orlov.
--
/ |
[|] Sean Neakums | Questions are a burden to others;
[|] <sneakums@zork.net> | answers a prison for oneself.
\ |
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCHSET] Linux 2.4.20-jam0
2002-11-30 6:36 ` hugang
@ 2002-11-30 14:58 ` J.A. Magallon
2002-11-30 17:07 ` Mike Galbraith
0 siblings, 1 reply; 12+ messages in thread
From: J.A. Magallon @ 2002-11-30 14:58 UTC (permalink / raw)
To: hugang; +Cc: linux-kernel, conman
On 2002.11.30 hugang wrote:
>On Sat, 30 Nov 2002 00:38:07 +0100
>"J.A. Magallon" <jamagallon@able.es> wrote:
>
>> - Orlov inode allocator for 2.4
>
>- add andrew morton supper.c patch
>- change the indent to linux standard.
>
Thankks, I will update it.
Just a note for further updates: could you make the patch from /usr/src,
instead of /usr/src/linux, so it can be applied with patch -p1 ?
It is the standard way...
TIA
--
J.A. Magallon <jamagallon@able.es> \ Software is like sex:
werewolf.able.es \ It's better when it's free
Mandrake Linux release 9.1 (Cooker) for i586
Linux 2.4.20-jam0 (gcc 3.2 (Mandrake Linux 9.1 3.2-4mdk))
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCHSET] Linux 2.4.20-jam0
2002-11-30 14:58 ` Sean Neakums
@ 2002-11-30 15:02 ` J.A. Magallon
0 siblings, 0 replies; 12+ messages in thread
From: J.A. Magallon @ 2002-11-30 15:02 UTC (permalink / raw)
To: Sean Neakums; +Cc: linux-kernel
On 2002.11.30 Sean Neakums wrote:
>commence J.A. Magallon quotation:
>
>> Thanks, I will add it...
>> BTW, who puts names to options ? Wouldn't be more intuitive to add options
>> like 'ialloc_std' or 'ialloc_orlov' ? Too late to change this ?
>
>There isn't exactly a whole lot of contention in the mount-options
>namespace. And neither orlov not ialloc_orlov is in any way
>"intuitive". However, orlov is more guessable, to my mind, than
>ialloc_orlov.
>
Well, what I think is more understandable when you see a /etc/fstab
would be something like 'inode_allocator=std' or 'inode_allocator=orlov' or
'inode_allocator=xxxxx' if something new appears.
--
J.A. Magallon <jamagallon@able.es> \ Software is like sex:
werewolf.able.es \ It's better when it's free
Mandrake Linux release 9.1 (Cooker) for i586
Linux 2.4.20-jam0 (gcc 3.2 (Mandrake Linux 9.1 3.2-4mdk))
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCHSET] Linux 2.4.20-jam0
2002-11-30 14:58 ` J.A. Magallon
@ 2002-11-30 17:07 ` Mike Galbraith
0 siblings, 0 replies; 12+ messages in thread
From: Mike Galbraith @ 2002-11-30 17:07 UTC (permalink / raw)
To: J.A. Magallon, hugang; +Cc: linux-kernel, conman
At 03:58 PM 11/30/2002 +0100, J.A. Magallon wrote:
>On 2002.11.30 hugang wrote:
> >On Sat, 30 Nov 2002 00:38:07 +0100
> >"J.A. Magallon" <jamagallon@able.es> wrote:
> >
> >> - Orlov inode allocator for 2.4
> >
> >- add andrew morton supper.c patch
> >- change the indent to linux standard.
> >
>
>Thankks, I will update it.
>Just a note for further updates: could you make the patch from /usr/src,
>instead of /usr/src/linux, so it can be applied with patch -p1 ?
>It is the standard way...
Concurr... easiest to read.
-Mike
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCHSET] Linux 2.4.20-jam0
2002-11-29 23:38 [PATCHSET] Linux 2.4.20-jam0 J.A. Magallon
2002-11-30 0:47 ` Andrew Morton
2002-11-30 6:36 ` hugang
@ 2002-11-30 17:50 ` Andrea Arcangeli
2002-11-30 23:36 ` J.A. Magallon
2002-12-01 1:55 ` J.A. Magallon
2 siblings, 2 replies; 12+ messages in thread
From: Andrea Arcangeli @ 2002-11-30 17:50 UTC (permalink / raw)
To: J.A. Magallon; +Cc: Lista Linux-Kernel, Con Kolivas, Srihari Vijayaraghavan
On Sat, Nov 30, 2002 at 12:38:07AM +0100, J.A. Magallon wrote:
> - reverted the fast-pte part of -aa. Still have to try again
> to see if it is more stable now.
AFIK this was reproduced by Srihari on nohighmem so it must be that
somebody is calling pgd_free_fast on a pgd that cannot be re-used.
Can you try this patch on top of 2.4.20rc2aa1? (or jam0 after backing
out the fast-pte removal that would otherwise forbid the debugging check
to trigger)
--- 2.4.20rc2aa1/include/asm-i386/pgalloc.h.~1~ 2002-11-27 10:09:30.000000000 +0100
+++ 2.4.20rc2aa1/include/asm-i386/pgalloc.h 2002-11-30 18:43:29.000000000 +0100
@@ -97,6 +97,20 @@ static inline pgd_t *get_pgd_fast(void)
static inline void free_pgd_fast(pgd_t *pgd)
{
+ {
+ int i;
+ for (i = 0; i < USER_PTRS_PER_PGD; i++)
+ if (pgd_val(pgd[i])) {
+ printk("non zero idx %d\n", i);
+ BUG();
+ }
+ for (i = USER_PTRS_PER_PGD; i < PTRS_PER_PGD - USER_PTRS_PER_PGD -
+ ((-VMALLOC_START + PGDIR_SIZE - 1) >> PGDIR_SHIFT); i++)
+ if (pgd_val(pgd[i]) != pgd_val(swapper_pg_dir[i])) {
+ printk("corrupted idx %d\n", i);
+ BUG();
+ }
+ }
*(unsigned long *)pgd = (unsigned long) pgd_quicklist;
pgd_quicklist = (unsigned long *) pgd;
pgtable_cache_size++;
the stack trace should tell us who is freeing a not valid pgd.
without this check the crash happens in an innocent place and it's not
obvious why it breaks.
Andrea
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCHSET] Linux 2.4.20-jam0
2002-11-30 14:45 ` J.A. Magallon
2002-11-30 14:58 ` Sean Neakums
@ 2002-11-30 22:15 ` Andrew Morton
1 sibling, 0 replies; 12+ messages in thread
From: Andrew Morton @ 2002-11-30 22:15 UTC (permalink / raw)
To: J.A. Magallon; +Cc: Lista Linux-Kernel
"J.A. Magallon" wrote:
>
> On 2002.11.30 Andrew Morton wrote:
> >"J.A. Magallon" wrote:
> >>
> >> - Orlov inode allocator for 2.4
> >
> >The Orlov allocator in 2.5 has caused a tremendous performance regression
> >in dbench-on-ext3/ordered-on-scsi.
> >
> >I don't know why yet - I doubt if it's due to the allocator itself - more
> >likely an IO scheduling bug in ext3, or a bug in the 2.5 elevator.
> >
> >There is no such regression on IDE - presumably write caching is covering
> >up the problem.
> >
>
> Is there any way I can test that ? I have all scsi drives and can
> for example remount with 'orlov' or 'oldalloc'...
It is specific to SMP, and for some reason doesn't manifest with
IDE hardware.
See
http://sourceforge.net/mailarchive/forum.php?thread_id=1365460&forum_id=6379
for the analysis.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCHSET] Linux 2.4.20-jam0
2002-11-30 17:50 ` Andrea Arcangeli
@ 2002-11-30 23:36 ` J.A. Magallon
2002-12-01 1:55 ` J.A. Magallon
1 sibling, 0 replies; 12+ messages in thread
From: J.A. Magallon @ 2002-11-30 23:36 UTC (permalink / raw)
To: Andrea Arcangeli; +Cc: Lista Linux-Kernel
On 2002.11.30 Andrea Arcangeli wrote:
>On Sat, Nov 30, 2002 at 12:38:07AM +0100, J.A. Magallon wrote:
>> - reverted the fast-pte part of -aa. Still have to try again
>> to see if it is more stable now.
>
>AFIK this was reproduced by Srihari on nohighmem so it must be that
>somebody is calling pgd_free_fast on a pgd that cannot be re-used.
>Can you try this patch on top of 2.4.20rc2aa1? (or jam0 after backing
>out the fast-pte removal that would otherwise forbid the debugging check
>to trigger)
>
Yes, I will try. Hope I use the piece of kernel that triggers this.
--
J.A. Magallon <jamagallon@able.es> \ Software is like sex:
werewolf.able.es \ It's better when it's free
Mandrake Linux release 9.1 (Cooker) for i586
Linux 2.4.20-jam1 (gcc 3.2 (Mandrake Linux 9.1 3.2-4mdk))
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCHSET] Linux 2.4.20-jam0
2002-11-30 17:50 ` Andrea Arcangeli
2002-11-30 23:36 ` J.A. Magallon
@ 2002-12-01 1:55 ` J.A. Magallon
1 sibling, 0 replies; 12+ messages in thread
From: J.A. Magallon @ 2002-12-01 1:55 UTC (permalink / raw)
To: Andrea Arcangeli; +Cc: Lista Linux-Kernel
On 2002.11.30 Andrea Arcangeli wrote:
>On Sat, Nov 30, 2002 at 12:38:07AM +0100, J.A. Magallon wrote:
>> - reverted the fast-pte part of -aa. Still have to try again
>> to see if it is more stable now.
>
>AFIK this was reproduced by Srihari on nohighmem so it must be that
>somebody is calling pgd_free_fast on a pgd that cannot be re-used.
>Can you try this patch on top of 2.4.20rc2aa1? (or jam0 after backing
>out the fast-pte removal that would otherwise forbid the debugging check
>to trigger)
>
I suppose this will be useless (tainted ;))
BTW, what does mean the symbol address mismatch ?
ksymoops 2.4.8 on i686 2.4.20-jam1. Options used
-V (default)
-k /proc/ksyms (default)
-l /proc/modules (default)
-o /lib/modules/2.4.20-jam1/ (default)
-m /boot/System.map-2.4.20-jam1 (default)
Warning: You did not tell me where to find symbol information. I will
assume that the log matches the kernel and modules that are running
right now and I'll use the default options above for symbol resolution.
If the current kernel and/or modules do not match the log, you can get
more accurate output by telling me the kernel version and where to find
map, modules, ksyms etc. ksymoops -h explains the options.
Warning (compare_maps): mismatch on symbol __nvsym03120 , nvdriver says 692dac20, /lib/modules/2.4.20-jam1/video/nvdriver.o says 692d3560. Ignoring /lib/modules/2.4.20-jam1/video/nvdriver.o entry
Dec 1 02:35:57 werewolf kernel: Unable to handle kernel paging request at virtual address 47000be8
Dec 1 02:35:57 werewolf kernel: 4012060d
Dec 1 02:35:57 werewolf kernel: *pde = 070001e3
Dec 1 02:35:57 werewolf kernel: Oops: 0000 2.4.20-jam1 #4 SMP dom dic 1 00:44:09 CET 2002
Dec 1 02:35:57 werewolf kernel: CPU: 0
Dec 1 02:35:57 werewolf kernel: EIP: 0010:[dup_mmap+285/458] Tainted: P
Dec 1 02:35:57 werewolf kernel: EIP: 0010:[<4012060d>] Tainted: P
Using defaults from ksymoops -t elf32-i386 -a i386
Dec 1 02:35:57 werewolf kernel: EFLAGS: 00010202
Dec 1 02:35:57 werewolf kernel: eax: 42527780 ebx: 4387b6e0 ecx: 00000000 edx: 47000be0
Dec 1 02:35:57 werewolf kernel: esi: 467a5544 edi: 4387b724 ebp: 467a5500 esp: 4504bf28
Dec 1 02:35:57 werewolf kernel: ds: 0018 es: 0018 ss: 0018
Dec 1 02:35:57 werewolf kernel: Process rc (pid: 3543, stackpage=4504b000)
Dec 1 02:35:57 werewolf kernel: Stack: 419afeac 000001f0 46c62a80 4504a000 46c62a8c 42527820 4252783c 4252780c
Dec 1 02:35:57 werewolf kernel: 42527780 4011f65c 42527780 000001f0 fffffff4 5046e000 43273a64 42cd0aa4
Dec 1 02:35:57 werewolf kernel: 00000011 4011fdfb 00000011 5046e000 4504bf98 4504bf98 4504bfa8 00000000
Dec 1 02:35:57 werewolf kernel: Call Trace: [copy_mm+252/352] [do_fork+843/2272] [sys_fork+39/48] [system_call+51/56]
Dec 1 02:35:57 werewolf kernel: Call Trace: [<4011f65c>] [<4011fdfb>] [<40107d07>] [<40109777>]
Dec 1 02:35:57 werewolf kernel: Code: 8b 42 08 8b 48 08 f0 ff 42 14 f6 43 15 08 74 07 f0 ff 89 18
>>EIP; 4012060d <dup_mmap+11d/1ca> <=====
>>eax; 42527780 <[videodev].data.end+c8341/110c21>
>>ebx; 4387b6e0 <[8390].rodata.end+d00291/34bcc11>
>>edx; 47000be0 <[mii].text.end+c8e404/122d884>
>>esi; 467a5544 <[mii].text.end+432d68/122d884>
>>edi; 4387b724 <[8390].rodata.end+d002d5/34bcc11>
>>ebp; 467a5500 <[mii].text.end+432d24/122d884>
>>esp; 4504bf28 <[8390].rodata.end+24d0ad9/34bcc11>
Trace; 4011f65c <copy_mm+fc/160>
Trace; 4011fdfb <do_fork+34b/8e0>
Trace; 40107d07 <sys_fork+27/30>
Trace; 40109777 <system_call+33/38>
Code; 4012060d <dup_mmap+11d/1ca>
00000000 <_EIP>:
Code; 4012060d <dup_mmap+11d/1ca> <=====
0: 8b 42 08 mov 0x8(%edx),%eax <=====
Code; 40120610 <dup_mmap+120/1ca>
3: 8b 48 08 mov 0x8(%eax),%ecx
Code; 40120613 <dup_mmap+123/1ca>
6: f0 ff 42 14 lock incl 0x14(%edx)
Code; 40120617 <dup_mmap+127/1ca>
a: f6 43 15 08 testb $0x8,0x15(%ebx)
Code; 4012061b <dup_mmap+12b/1ca>
e: 74 07 je 17 <_EIP+0x17> 40120624 <dup_mmap+134/1ca>
Code; 4012061d <dup_mmap+12d/1ca>
10: f0 ff 89 18 00 00 00 lock decl 0x18(%ecx)
2 warnings issued. Results may not be reliable.
--
J.A. Magallon <jamagallon@able.es> \ Software is like sex:
werewolf.able.es \ It's better when it's free
Mandrake Linux release 9.1 (Cooker) for i586
Linux 2.4.20-jam1 (gcc 3.2 (Mandrake Linux 9.1 3.2-4mdk))
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2002-12-01 1:47 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-11-29 23:38 [PATCHSET] Linux 2.4.20-jam0 J.A. Magallon
2002-11-30 0:47 ` Andrew Morton
2002-11-30 14:45 ` J.A. Magallon
2002-11-30 14:58 ` Sean Neakums
2002-11-30 15:02 ` J.A. Magallon
2002-11-30 22:15 ` Andrew Morton
2002-11-30 6:36 ` hugang
2002-11-30 14:58 ` J.A. Magallon
2002-11-30 17:07 ` Mike Galbraith
2002-11-30 17:50 ` Andrea Arcangeli
2002-11-30 23:36 ` J.A. Magallon
2002-12-01 1:55 ` J.A. Magallon
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox