Re: [31/36] Large Blocksize: Core piece

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [31/36] Large Blocksize: Core piece
       [not found] ` <20070828190735.292638294@sgi.com>
@ 2007-08-30  0:11   ` Mingming Cao
  2007-08-30  0:12     ` Christoph Lameter
                       ` (2 more replies)
  0 siblings, 3 replies; 56+ messages in thread
From: Mingming Cao @ 2007-08-30  0:11 UTC (permalink / raw)
  To: clameter
  Cc: torvalds, linux-fsdevel, linux-kernel, Christoph Hellwig,
	Mel Gorman, William Lee Irwin III, David Chinner, Jens Axboe,
	Badari Pulavarty, Maxim Levitsky, Fengguang Wu, swin wang,
	totty.lu, H. Peter Anvin, joern, Eric W. Biederman

On Tue, 2007-08-28 at 12:06 -0700, clameter@sgi.com wrote:
> plain text document attachment (0031-Large-Blocksize-Core-piece.patch)
> Provide an alternate definition for the page_cache_xxx(mapping, ...)
> functions that can determine the current page size from the mapping
> and generate the appropriate shifts, sizes and mask for the page cache
> operations. Change the basic functions that allocate pages for the
> page cache to be able to handle higher order allocations.
> 
> Provide a new function
> 
> mapping_setup(stdruct address_space *, gfp_t mask, int order)
> 
> that allows the setup of a mapping of any compound page order.
> 
> mapping_set_gfp_mask() is still provided but it sets mappings to order 0.
> Calls to mapping_set_gfp_mask() must be converted to mapping_setup() in
> order for the filesystem to be able to use larger pages. For some key block
> devices and filesystems the conversion is done here.
> 
> mapping_setup() for higher order is only allowed if the mapping does not
> use DMA mappings or HIGHMEM since we do not support bouncing at the moment.
> Thus BUG() on DMA mappings and clear the highmem bit of higher order mappings.
> 
> Modify the set_blocksize() function so that an arbitrary blocksize can be set.
> Blocksizes up to MAX_ORDER - 1 can be set. This is typically 8MB on many
> platforms (order 11). Typically file systems are not only limited by the core
> VM but also by the structure of their internal data structures. The core VM
> limitations fall away with this patch. The functionality provided here
> can do nothing about the internal limitations of filesystems.
> 
> Known internal limitations:
> 
> Ext2            64k
> XFS             64k
> Reiserfs        8k
> Ext3            4k (rumor has it that changing a constant can remove the limit)
> Ext4            4k
> 

There are patches original worked by Takashi Sato to support large block
size (up to 64k) in ext2/3/4, which addressed the directory issue as
well. I just forward ported. Will posted it in a separate thread.
Haven't get a chance to integrated with your patch yet (next step).

thanks,
Mingming
> Signed-off-by: Christoph Lameter <clameter@sgi.com>
> ---
>  block/Kconfig               |   17 ++++++
>  drivers/block/rd.c          |    6 ++-
>  fs/block_dev.c              |   29 +++++++---
>  fs/buffer.c                 |    4 +-
>  fs/inode.c                  |    7 ++-
>  fs/xfs/linux-2.6/xfs_buf.c  |    3 +-
>  include/linux/buffer_head.h |   12 ++++-
>  include/linux/fs.h          |    5 ++
>  include/linux/pagemap.h     |  121 ++++++++++++++++++++++++++++++++++++++++--
>  mm/filemap.c                |   17 ++++--
>  10 files changed, 192 insertions(+), 29 deletions(-)
> 
> Index: linux-2.6/block/Kconfig
> ===================================================================
> --- linux-2.6.orig/block/Kconfig	2007-08-27 19:22:13.000000000 -0700
> +++ linux-2.6/block/Kconfig	2007-08-27 21:16:38.000000000 -0700
> @@ -62,6 +62,20 @@ config BLK_DEV_BSG
>  	protocols (e.g. Task Management Functions and SMP in Serial
>  	Attached SCSI).
> 
> +#
> +# The functions to switch on larger pages in a filesystem will return an error
> +# if the gfp flags for a mapping require only DMA pages. Highmem will always
> +# be switched off for higher order mappings.
> +#
> +config LARGE_BLOCKSIZE
> +	bool "Support blocksizes larger than page size"
> +	default n
> +	depends on EXPERIMENTAL
> +	help
> +	  Allows the page cache to support higher orders of pages. Higher
> +	  order page cache pages may be useful to increase I/O performance
> +	  anbd support special devices like CD or DVDs and Flash.
> +
>  endif # BLOCK
> 
>  source block/Kconfig.iosched
> Index: linux-2.6/drivers/block/rd.c
> ===================================================================
> --- linux-2.6.orig/drivers/block/rd.c	2007-08-27 20:59:27.000000000 -0700
> +++ linux-2.6/drivers/block/rd.c	2007-08-27 21:10:38.000000000 -0700
> @@ -121,7 +121,8 @@ static void make_page_uptodate(struct pa
>  			}
>  		} while ((bh = bh->b_this_page) != head);
>  	} else {
> -		memset(page_address(page), 0, page_cache_size(page_mapping(page)));
> +		memset(page_address(page), 0,
> +			page_cache_size(page_mapping(page)));
>  	}
>  	flush_dcache_page(page);
>  	SetPageUptodate(page);
> @@ -380,7 +381,8 @@ static int rd_open(struct inode *inode, 
>  		gfp_mask = mapping_gfp_mask(mapping);
>  		gfp_mask &= ~(__GFP_FS|__GFP_IO);
>  		gfp_mask |= __GFP_HIGH;
> -		mapping_set_gfp_mask(mapping, gfp_mask);
> +		mapping_setup(mapping, gfp_mask,
> +			page_cache_blkbits_to_order(inode->i_blkbits));
>  	}
> 
>  	return 0;
> Index: linux-2.6/fs/block_dev.c
> ===================================================================
> --- linux-2.6.orig/fs/block_dev.c	2007-08-27 19:22:13.000000000 -0700
> +++ linux-2.6/fs/block_dev.c	2007-08-27 21:10:38.000000000 -0700
> @@ -63,36 +63,46 @@ static void kill_bdev(struct block_devic
>  		return;
>  	invalidate_bh_lrus();
>  	truncate_inode_pages(bdev->bd_inode->i_mapping, 0);
> -}	
> +}
> 
>  int set_blocksize(struct block_device *bdev, int size)
>  {
> -	/* Size must be a power of two, and between 512 and PAGE_SIZE */
> -	if (size > PAGE_SIZE || size < 512 || !is_power_of_2(size))
> +	int order;
> +
> +	if (size > (PAGE_SIZE << (MAX_ORDER - 1)) ||
> +			size < 512 || !is_power_of_2(size))
>  		return -EINVAL;
> 
>  	/* Size cannot be smaller than the size supported by the device */
>  	if (size < bdev_hardsect_size(bdev))
>  		return -EINVAL;
> 
> +	order = page_cache_blocksize_to_order(size);
> +
>  	/* Don't change the size if it is same as current */
>  	if (bdev->bd_block_size != size) {
> +		int bits = blksize_bits(size);
> +		struct address_space *mapping =
> +			bdev->bd_inode->i_mapping;
> +
>  		sync_blockdev(bdev);
> -		bdev->bd_block_size = size;
> -		bdev->bd_inode->i_blkbits = blksize_bits(size);
>  		kill_bdev(bdev);
> +		bdev->bd_block_size = size;
> +		bdev->bd_inode->i_blkbits = bits;
> +		mapping_setup(mapping, GFP_NOFS, order);
>  	}
>  	return 0;
>  }
> -
>  EXPORT_SYMBOL(set_blocksize);
> 
>  int sb_set_blocksize(struct super_block *sb, int size)
>  {
>  	if (set_blocksize(sb->s_bdev, size))
>  		return 0;
> -	/* If we get here, we know size is power of two
> -	 * and it's value is between 512 and PAGE_SIZE */
> +	/*
> +	 * If we get here, we know size is power of two
> +	 * and it's value is valid for the page cache
> +	 */
>  	sb->s_blocksize = size;
>  	sb->s_blocksize_bits = blksize_bits(size);
>  	return sb->s_blocksize;
> @@ -574,7 +584,8 @@ struct block_device *bdget(dev_t dev)
>  		inode->i_rdev = dev;
>  		inode->i_bdev = bdev;
>  		inode->i_data.a_ops = &def_blk_aops;
> -		mapping_set_gfp_mask(&inode->i_data, GFP_USER);
> +		mapping_setup(&inode->i_data, GFP_USER,
> +			page_cache_blkbits_to_order(inode->i_blkbits));
>  		inode->i_data.backing_dev_info = &default_backing_dev_info;
>  		spin_lock(&bdev_lock);
>  		list_add(&bdev->bd_list, &all_bdevs);
> Index: linux-2.6/fs/buffer.c
> ===================================================================
> --- linux-2.6.orig/fs/buffer.c	2007-08-27 21:09:19.000000000 -0700
> +++ linux-2.6/fs/buffer.c	2007-08-27 21:10:38.000000000 -0700
> @@ -1090,7 +1090,7 @@ __getblk_slow(struct block_device *bdev,
>  {
>  	/* Size must be multiple of hard sectorsize */
>  	if (unlikely(size & (bdev_hardsect_size(bdev)-1) ||
> -			(size < 512 || size > PAGE_SIZE))) {
> +		size < 512 || size > (PAGE_SIZE << (MAX_ORDER - 1)))) {
>  		printk(KERN_ERR "getblk(): invalid block size %d requested\n",
>  					size);
>  		printk(KERN_ERR "hardsect size: %d\n",
> @@ -1811,7 +1811,7 @@ static int __block_prepare_write(struct 
>  				if (block_end > to || block_start < from)
>  					zero_user_segments(page,
>  						to, block_end,
> -						block_start, from)
> +						block_start, from);
>  				continue;
>  			}
>  		}
> Index: linux-2.6/fs/inode.c
> ===================================================================
> --- linux-2.6.orig/fs/inode.c	2007-08-27 19:22:13.000000000 -0700
> +++ linux-2.6/fs/inode.c	2007-08-27 21:10:38.000000000 -0700
> @@ -145,7 +145,8 @@ static struct inode *alloc_inode(struct 
>  		mapping->a_ops = &empty_aops;
>   		mapping->host = inode;
>  		mapping->flags = 0;
> -		mapping_set_gfp_mask(mapping, GFP_HIGHUSER_PAGECACHE);
> +		mapping_setup(mapping, GFP_HIGHUSER_PAGECACHE,
> +				page_cache_blkbits_to_order(inode->i_blkbits));
>  		mapping->assoc_mapping = NULL;
>  		mapping->backing_dev_info = &default_backing_dev_info;
> 
> @@ -243,7 +244,7 @@ void clear_inode(struct inode *inode)
>  {
>  	might_sleep();
>  	invalidate_inode_buffers(inode);
> -       
> +
>  	BUG_ON(inode->i_data.nrpages);
>  	BUG_ON(!(inode->i_state & I_FREEING));
>  	BUG_ON(inode->i_state & I_CLEAR);
> @@ -528,7 +529,7 @@ repeat:
>   *	for allocations related to inode->i_mapping is GFP_HIGHUSER_PAGECACHE.
>   *	If HIGHMEM pages are unsuitable or it is known that pages allocated
>   *	for the page cache are not reclaimable or migratable,
> - *	mapping_set_gfp_mask() must be called with suitable flags on the
> + *	mapping_setup() must be called with suitable flags and bits on the
>   *	newly created inode's mapping
>   *
>   */
> Index: linux-2.6/fs/xfs/linux-2.6/xfs_buf.c
> ===================================================================
> --- linux-2.6.orig/fs/xfs/linux-2.6/xfs_buf.c	2007-08-27 19:22:13.000000000 -0700
> +++ linux-2.6/fs/xfs/linux-2.6/xfs_buf.c	2007-08-27 21:10:38.000000000 -0700
> @@ -1547,7 +1547,8 @@ xfs_mapping_buftarg(
>  	mapping = &inode->i_data;
>  	mapping->a_ops = &mapping_aops;
>  	mapping->backing_dev_info = bdi;
> -	mapping_set_gfp_mask(mapping, GFP_NOFS);
> +	mapping_setup(mapping, GFP_NOFS,
> +		page_cache_blkbits_to_order(inode->i_blkbits));
>  	btp->bt_mapping = mapping;
>  	return 0;
>  }
> Index: linux-2.6/include/linux/buffer_head.h
> ===================================================================
> --- linux-2.6.orig/include/linux/buffer_head.h	2007-08-27 19:22:13.000000000 -0700
> +++ linux-2.6/include/linux/buffer_head.h	2007-08-27 21:10:38.000000000 -0700
> @@ -129,7 +129,17 @@ BUFFER_FNS(Ordered, ordered)
>  BUFFER_FNS(Eopnotsupp, eopnotsupp)
>  BUFFER_FNS(Unwritten, unwritten)
> 
> -#define bh_offset(bh)		((unsigned long)(bh)->b_data & ~PAGE_MASK)
> +static inline unsigned long bh_offset(struct buffer_head *bh)
> +{
> +	/*
> +	 * No mapping available. Use page struct to obtain
> +	 * order.
> +	 */
> +	unsigned long mask = compound_size(bh->b_page) - 1;
> +
> +	return (unsigned long)bh->b_data & mask;
> +}
> +
>  #define touch_buffer(bh)	mark_page_accessed(bh->b_page)
> 
>  /* If we *know* page->private refers to buffer_heads */
> Index: linux-2.6/include/linux/fs.h
> ===================================================================
> --- linux-2.6.orig/include/linux/fs.h	2007-08-27 19:22:13.000000000 -0700
> +++ linux-2.6/include/linux/fs.h	2007-08-27 21:10:38.000000000 -0700
> @@ -446,6 +446,11 @@ struct address_space {
>  	spinlock_t		i_mmap_lock;	/* protect tree, count, list */
>  	unsigned int		truncate_count;	/* Cover race condition with truncate */
>  	unsigned long		nrpages;	/* number of total pages */
> +#ifdef CONFIG_LARGE_BLOCKSIZE
> +	loff_t			offset_mask;	/* Mask to get to offset bits */
> +	unsigned int		order;		/* Page order of the pages in here */
> +	unsigned int		shift;		/* Shift of index */
> +#endif
>  	pgoff_t			writeback_index;/* writeback starts here */
>  	const struct address_space_operations *a_ops;	/* methods */
>  	unsigned long		flags;		/* error bits/gfp mask */
> Index: linux-2.6/include/linux/pagemap.h
> ===================================================================
> --- linux-2.6.orig/include/linux/pagemap.h	2007-08-27 19:29:55.000000000 -0700
> +++ linux-2.6/include/linux/pagemap.h	2007-08-27 21:15:58.000000000 -0700
> @@ -39,10 +39,35 @@ static inline gfp_t mapping_gfp_mask(str
>   * This is non-atomic.  Only to be used before the mapping is activated.
>   * Probably needs a barrier...
>   */
> -static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
> +static inline void mapping_setup(struct address_space *m,
> +					gfp_t mask, int order)
>  {
>  	m->flags = (m->flags & ~(__force unsigned long)__GFP_BITS_MASK) |
>  				(__force unsigned long)mask;
> +
> +#ifdef CONFIG_LARGE_BLOCKSIZE
> +	m->order = order;
> +	m->shift = order + PAGE_SHIFT;
> +	m->offset_mask = (PAGE_SIZE << order) - 1;
> +	if (order) {
> +		/*
> +		 * Bouncing is not supported. Requests for DMA
> +		 * memory will not work
> +		 */
> +		BUG_ON(m->flags & (__GFP_DMA|__GFP_DMA32));
> +		/*
> +		 * Bouncing not supported. We cannot use HIGHMEM
> +		 */
> +		m->flags &= ~__GFP_HIGHMEM;
> +		m->flags |= __GFP_COMP;
> +		/*
> +		 * If we could raise the kswapd order then it should be
> +		 * done here.
> +		 *
> +		 * raise_kswapd_order(order);
> +		 */
> +	}
> +#endif
>  }
> 
>  /*
> @@ -62,6 +87,78 @@ static inline void mapping_set_gfp_mask(
>  #define PAGE_CACHE_ALIGN(addr)	(((addr)+PAGE_CACHE_SIZE-1)&PAGE_CACHE_MASK)
> 
>  /*
> + * The next set of functions allow to write code that is capable of dealing
> + * with multiple page sizes.
> + */
> +#ifdef CONFIG_LARGE_BLOCKSIZE
> +/*
> + * Determine page order from the blkbits in the inode structure
> + */
> +static inline int page_cache_blkbits_to_order(int shift)
> +{
> +	BUG_ON(shift < 9);
> +
> +	if (shift < PAGE_SHIFT)
> +		return 0;
> +
> +	return shift - PAGE_SHIFT;
> +}
> +
> +/*
> + * Determine page order from a given blocksize
> + */
> +static inline int page_cache_blocksize_to_order(unsigned long size)
> +{
> +	return page_cache_blkbits_to_order(ilog2(size));
> +}
> +
> +static inline int mapping_order(struct address_space *a)
> +{
> +	return a->order;
> +}
> +
> +static inline int page_cache_shift(struct address_space *a)
> +{
> +	return a->shift;
> +}
> +
> +static inline unsigned int page_cache_size(struct address_space *a)
> +{
> +	return a->offset_mask + 1;
> +}
> +
> +static inline loff_t page_cache_mask(struct address_space *a)
> +{
> +	return ~a->offset_mask;
> +}
> +
> +static inline unsigned int page_cache_offset(struct address_space *a,
> +		loff_t pos)
> +{
> +	return pos & a->offset_mask;
> +}
> +#else
> +/*
> + * Kernel configured for a fixed PAGE_SIZEd page cache
> + */
> +static inline int page_cache_blkbits_to_order(int shift)
> +{
> +	if (shift < 9)
> +		return -EINVAL;
> +	if (shift > PAGE_SHIFT)
> +		return -EINVAL;
> +	return 0;
> +}
> +
> +static inline int page_cache_blocksize_to_order(unsigned long size)
> +{
> +	if (size >= 512 && size <= PAGE_SIZE)
> +		return 0;
> +
> +	return -EINVAL;
> +}
> +
> +/*
>   * Functions that are currently setup for a fixed PAGE_SIZEd. The use of
>   * these will allow a variable page size pagecache in the future.
>   */
> @@ -90,6 +187,7 @@ static inline unsigned int page_cache_of
>  {
>  	return pos & ~PAGE_MASK;
>  }
> +#endif
> 
>  static inline pgoff_t page_cache_index(struct address_space *a,
>  		loff_t pos)
> @@ -112,27 +210,37 @@ static inline loff_t page_cache_pos(stru
>  	return ((loff_t)index << page_cache_shift(a)) + offset;
>  }
> 
> +/*
> + * Legacy function. Only supports order 0 pages.
> + */
> +static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
> +{
> +	BUG_ON(mapping_order(m));
> +	mapping_setup(m, mask, 0);
> +}
> +
>  #define page_cache_get(page)		get_page(page)
>  #define page_cache_release(page)	put_page(page)
>  void release_pages(struct page **pages, int nr, int cold);
> 
>  #ifdef CONFIG_NUMA
> -extern struct page *__page_cache_alloc(gfp_t gfp);
> +extern struct page *__page_cache_alloc(gfp_t gfp, int);
>  #else
> -static inline struct page *__page_cache_alloc(gfp_t gfp)
> +static inline struct page *__page_cache_alloc(gfp_t gfp, int order)
>  {
> -	return alloc_pages(gfp, 0);
> +	return alloc_pages(gfp, order);
>  }
>  #endif
> 
>  static inline struct page *page_cache_alloc(struct address_space *x)
>  {
> -	return __page_cache_alloc(mapping_gfp_mask(x));
> +	return __page_cache_alloc(mapping_gfp_mask(x), mapping_order(x));
>  }
> 
>  static inline struct page *page_cache_alloc_cold(struct address_space *x)
>  {
> -	return __page_cache_alloc(mapping_gfp_mask(x)|__GFP_COLD);
> +	return __page_cache_alloc(mapping_gfp_mask(x)|__GFP_COLD,
> +				mapping_order(x));
>  }
> 
>  typedef int filler_t(void *, struct page *);
> Index: linux-2.6/mm/filemap.c
> ===================================================================
> --- linux-2.6.orig/mm/filemap.c	2007-08-27 21:09:19.000000000 -0700
> +++ linux-2.6/mm/filemap.c	2007-08-27 21:14:55.000000000 -0700
> @@ -471,13 +471,13 @@ int add_to_page_cache_lru(struct page *p
>  }
> 
>  #ifdef CONFIG_NUMA
> -struct page *__page_cache_alloc(gfp_t gfp)
> +struct page *__page_cache_alloc(gfp_t gfp, int order)
>  {
>  	if (cpuset_do_page_mem_spread()) {
>  		int n = cpuset_mem_spread_node();
> -		return alloc_pages_node(n, gfp, 0);
> +		return alloc_pages_node(n, gfp, order);
>  	}
> -	return alloc_pages(gfp, 0);
> +	return alloc_pages(gfp, order);
>  }
>  EXPORT_SYMBOL(__page_cache_alloc);
>  #endif
> @@ -678,7 +678,7 @@ repeat:
>  	if (!page) {
>  		if (!cached_page) {
>  			cached_page =
> -				__page_cache_alloc(gfp_mask);
> +				__page_cache_alloc(gfp_mask, mapping_order(mapping));
>  			if (!cached_page)
>  				return NULL;
>  		}
> @@ -818,7 +818,8 @@ grab_cache_page_nowait(struct address_sp
>  		page_cache_release(page);
>  		return NULL;
>  	}
> -	page = __page_cache_alloc(mapping_gfp_mask(mapping) & ~__GFP_FS);
> +	page = __page_cache_alloc(mapping_gfp_mask(mapping) & ~__GFP_FS,
> +				mapping_order(mapping));
>  	if (page && add_to_page_cache_lru(page, mapping, index, GFP_KERNEL)) {
>  		page_cache_release(page);
>  		page = NULL;
> @@ -1479,6 +1480,12 @@ int generic_file_mmap(struct file * file
>  {
>  	struct address_space *mapping = file->f_mapping;
> 
> +	/*
> +	 * Forbid mmap access to higher order mappings.
> +	 */
> +	if (mapping_order(mapping))
> +		return -ENOSYS;
> +
>  	if (!mapping->a_ops->readpage)
>  		return -ENOEXEC;
>  	file_accessed(file);
> 


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [31/36] Large Blocksize: Core piece
  2007-08-30  0:11   ` [31/36] Large Blocksize: Core piece Mingming Cao
@ 2007-08-30  0:12     ` Christoph Lameter
  2007-08-30  0:47     ` [RFC 1/4] Large Blocksize support for Ext2/3/4 Mingming Cao
  2007-08-30  0:47     ` [RFC 2/4]ext2: fix rec_len overflow with 64KB block size Mingming Cao
  2 siblings, 0 replies; 56+ messages in thread
From: Christoph Lameter @ 2007-08-30  0:12 UTC (permalink / raw)
  To: Mingming Cao
  Cc: torvalds, linux-fsdevel, linux-kernel, Christoph Hellwig,
	Mel Gorman, William Lee Irwin III, David Chinner, Jens Axboe,
	Badari Pulavarty, Maxim Levitsky, Fengguang Wu, swin wang,
	totty.lu, H. Peter Anvin, joern, Eric W. Biederman

On Wed, 29 Aug 2007, Mingming Cao wrote:

> > Known internal limitations:
> > 
> > Ext2            64k
> > XFS             64k
> > Reiserfs        8k
> > Ext3            4k (rumor has it that changing a constant can remove the limit)
> > Ext4            4k
> > 
> 
> There are patches original worked by Takashi Sato to support large block
> size (up to 64k) in ext2/3/4, which addressed the directory issue as
> well. I just forward ported. Will posted it in a separate thread.
> Haven't get a chance to integrated with your patch yet (next step).

Ahh. Great. Keep me posted.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [RFC 1/4] Large Blocksize support for Ext2/3/4
  2007-08-30  0:11   ` [31/36] Large Blocksize: Core piece Mingming Cao
  2007-08-30  0:12     ` Christoph Lameter
@ 2007-08-30  0:47     ` Mingming Cao
  2007-08-30  0:59       ` Christoph Lameter
                         ` (3 more replies)
  2007-08-30  0:47     ` [RFC 2/4]ext2: fix rec_len overflow with 64KB block size Mingming Cao
  2 siblings, 4 replies; 56+ messages in thread
From: Mingming Cao @ 2007-08-30  0:47 UTC (permalink / raw)
  To: clameter, linux-fsdevel
  Cc: adilger, sho, ext4 development, linux-fsdevel, linux-kernel

The next 4 patches support large block size (up to PAGESIZE, max 64KB)
for ext2/3/4, originally from Takashi Sato.
http://marc.info/?l=linux-ext4&m=115768873518400&w=2


It's quite simple to support large block size in ext2/3/4, mostly just
enlarge the block size limit.  But it is NOT possible to have 64kB
blocksize on ext2/3/4 without some changes to the directory handling
code.  The reason is that an empty 64kB directory block would have a
rec_len == (__u16)2^16 == 0, and this would cause an error to be hit in
the filesystem.  The proposed solution is to put 2 empty records in such
a directory, or to special-case an impossible value like rec_len =
0xffff to handle this. 


The Patch-set consists of the following 4 patches.
  [1/4]  ext2/3/4: enlarge blocksize
         - Allow blocksize up to pagesize

  [2/4]  ext2: fix rec_len overflow
         - prevent rec_len from overflow with 64KB blocksize

  [3/4]  ext3: fix rec_len overflow
         - prevent rec_len from overflow with 64KB blocksize

  [4/4]  ext4: fix rec_len overflow
         - prevent rec_len from overflow with 64KB blocksize

Just rebase to 2.6.23-rc4 and against the ext4 patch queue. Compile tested only. 

Next steps:
Need a e2fsprogs changes to able test this feature. As mkfs needs to be
educated not assuming rec_len to be blocksize all the time.
Will try it with Christoph Lameter's large block patch next.


Signed-off-by: Takashi Sato <sho@tnes.nec.co.jp>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
---
---
 fs/ext2/super.c         |    2 +-
 fs/ext3/super.c         |    5 ++++-
 fs/ext4/super.c         |    5 +++++
 include/linux/ext2_fs.h |    4 ++--
 include/linux/ext3_fs.h |    4 ++--
 include/linux/ext4_fs.h |    4 ++--
 6 files changed, 16 insertions(+), 8 deletions(-)

Index: linux-2.6.23-rc3/fs/ext2/super.c
===================================================================
--- linux-2.6.23-rc3.orig/fs/ext2/super.c	2007-08-12 21:25:24.000000000 -0700
+++ linux-2.6.23-rc3/fs/ext2/super.c	2007-08-29 15:22:29.000000000 -0700
@@ -775,7 +775,7 @@ static int ext2_fill_super(struct super_
 		brelse(bh);
 
 		if (!sb_set_blocksize(sb, blocksize)) {
-			printk(KERN_ERR "EXT2-fs: blocksize too small for device.\n");
+			printk(KERN_ERR "EXT2-fs: bad blocksize %d.\n", blocksize);
 			goto failed_sbi;
 		}
 
Index: linux-2.6.23-rc3/fs/ext3/super.c
===================================================================
--- linux-2.6.23-rc3.orig/fs/ext3/super.c	2007-08-12 21:25:24.000000000 -0700
+++ linux-2.6.23-rc3/fs/ext3/super.c	2007-08-29 15:22:29.000000000 -0700
@@ -1549,7 +1549,10 @@ static int ext3_fill_super (struct super
 		}
 
 		brelse (bh);
-		sb_set_blocksize(sb, blocksize);
+		if (!sb_set_blocksize(sb, blocksize)) {
+			printk(KERN_ERR "EXT3-fs: bad blocksize %d.\n", blocksize);
+			goto out_fail;
+		}
 		logic_sb_block = (sb_block * EXT3_MIN_BLOCK_SIZE) / blocksize;
 		offset = (sb_block * EXT3_MIN_BLOCK_SIZE) % blocksize;
 		bh = sb_bread(sb, logic_sb_block);
Index: linux-2.6.23-rc3/fs/ext4/super.c
===================================================================
--- linux-2.6.23-rc3.orig/fs/ext4/super.c	2007-08-28 11:09:40.000000000 -0700
+++ linux-2.6.23-rc3/fs/ext4/super.c	2007-08-29 15:24:08.000000000 -0700
@@ -1626,6 +1626,11 @@ static int ext4_fill_super (struct super
 		goto out_fail;
 	}
 
+	if (!sb_set_blocksize(sb, blocksize)) {
+		printk(KERN_ERR "EXT4-fs: bad blocksize %d.\n", blocksize);
+		goto out_fail;
+	}
+
 	/*
 	 * The ext4 superblock will not be buffer aligned for other than 1kB
 	 * block sizes.  We need to calculate the offset from buffer start.
Index: linux-2.6.23-rc3/include/linux/ext2_fs.h
===================================================================
--- linux-2.6.23-rc3.orig/include/linux/ext2_fs.h	2007-08-12 21:25:24.000000000 -0700
+++ linux-2.6.23-rc3/include/linux/ext2_fs.h	2007-08-29 15:22:29.000000000 -0700
@@ -86,8 +86,8 @@ static inline struct ext2_sb_info *EXT2_
  * Macro-instructions used to manage several block sizes
  */
 #define EXT2_MIN_BLOCK_SIZE		1024
-#define	EXT2_MAX_BLOCK_SIZE		4096
-#define EXT2_MIN_BLOCK_LOG_SIZE		  10
+#define EXT2_MAX_BLOCK_SIZE		65536
+#define EXT2_MIN_BLOCK_LOG_SIZE		10
 #ifdef __KERNEL__
 # define EXT2_BLOCK_SIZE(s)		((s)->s_blocksize)
 #else
Index: linux-2.6.23-rc3/include/linux/ext3_fs.h
===================================================================
--- linux-2.6.23-rc3.orig/include/linux/ext3_fs.h	2007-08-12 21:25:24.000000000 -0700
+++ linux-2.6.23-rc3/include/linux/ext3_fs.h	2007-08-29 15:22:29.000000000 -0700
@@ -76,8 +76,8 @@
  * Macro-instructions used to manage several block sizes
  */
 #define EXT3_MIN_BLOCK_SIZE		1024
-#define	EXT3_MAX_BLOCK_SIZE		4096
-#define EXT3_MIN_BLOCK_LOG_SIZE		  10
+#define	EXT3_MAX_BLOCK_SIZE		65536
+#define EXT3_MIN_BLOCK_LOG_SIZE		10
 #ifdef __KERNEL__
 # define EXT3_BLOCK_SIZE(s)		((s)->s_blocksize)
 #else
Index: linux-2.6.23-rc3/include/linux/ext4_fs.h
===================================================================
--- linux-2.6.23-rc3.orig/include/linux/ext4_fs.h	2007-08-28 11:09:40.000000000 -0700
+++ linux-2.6.23-rc3/include/linux/ext4_fs.h	2007-08-29 15:22:29.000000000 -0700
@@ -104,8 +104,8 @@ struct ext4_allocation_request {
  * Macro-instructions used to manage several block sizes
  */
 #define EXT4_MIN_BLOCK_SIZE		1024
-#define	EXT4_MAX_BLOCK_SIZE		4096
-#define EXT4_MIN_BLOCK_LOG_SIZE		  10
+#define	EXT4_MAX_BLOCK_SIZE		65536
+#define EXT4_MIN_BLOCK_LOG_SIZE		10
 #ifdef __KERNEL__
 # define EXT4_BLOCK_SIZE(s)		((s)->s_blocksize)
 #else



^ permalink raw reply	[flat|nested] 56+ messages in thread

* [RFC 2/4]ext2: fix rec_len overflow with 64KB block size
  2007-08-30  0:11   ` [31/36] Large Blocksize: Core piece Mingming Cao
  2007-08-30  0:12     ` Christoph Lameter
  2007-08-30  0:47     ` [RFC 1/4] Large Blocksize support for Ext2/3/4 Mingming Cao
@ 2007-08-30  0:47     ` Mingming Cao
  2 siblings, 0 replies; 56+ messages in thread
From: Mingming Cao @ 2007-08-30  0:47 UTC (permalink / raw)
  To: clameter, linux-fsdevel; +Cc: linux-kernel, adilger, sho, ext4 development

[2/4]  ext2: fix rec_len overflow
         - prevent rec_len from overflow with 64KB blocksize


Signed-off-by: Takashi Sato <sho@tnes.nec.co.jp>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>

---
 fs/ext2/dir.c           |   46 ++++++++++++++++++++++++++++++++++++----------
 include/linux/ext2_fs.h |   13 +++++++++++++
 2 files changed, 49 insertions(+), 10 deletions(-)

Index: linux-2.6.23-rc3/fs/ext2/dir.c
===================================================================
--- linux-2.6.23-rc3.orig/fs/ext2/dir.c	2007-08-12 21:25:24.000000000 -0700
+++ linux-2.6.23-rc3/fs/ext2/dir.c	2007-08-29 15:29:51.000000000 -0700
@@ -94,9 +94,9 @@ static void ext2_check_page(struct page 
 			goto out;
 	}
 	for (offs = 0; offs <= limit - EXT2_DIR_REC_LEN(1); offs += rec_len) {
+		offs = EXT2_DIR_ADJUST_TAIL_OFFS(offs, chunk_size);
 		p = (ext2_dirent *)(kaddr + offs);
 		rec_len = le16_to_cpu(p->rec_len);
-
 		if (rec_len < EXT2_DIR_REC_LEN(1))
 			goto Eshort;
 		if (rec_len & 3)
@@ -108,6 +108,7 @@ static void ext2_check_page(struct page 
 		if (le32_to_cpu(p->inode) > max_inumber)
 			goto Einumber;
 	}
+	offs = EXT2_DIR_ADJUST_TAIL_OFFS(offs, chunk_size);
 	if (offs != limit)
 		goto Eend;
 out:
@@ -283,6 +284,7 @@ ext2_readdir (struct file * filp, void *
 		de = (ext2_dirent *)(kaddr+offset);
 		limit = kaddr + ext2_last_byte(inode, n) - EXT2_DIR_REC_LEN(1);
 		for ( ;(char*)de <= limit; de = ext2_next_entry(de)) {
+			de = EXT2_DIR_ADJUST_TAIL_ADDR(kaddr, de, sb->s_blocksize);
 			if (de->rec_len == 0) {
 				ext2_error(sb, __FUNCTION__,
 					"zero-length directory entry");
@@ -305,8 +307,10 @@ ext2_readdir (struct file * filp, void *
 					return 0;
 				}
 			}
+			filp->f_pos = EXT2_DIR_ADJUST_TAIL_OFFS(filp->f_pos, sb->s_blocksize);
 			filp->f_pos += le16_to_cpu(de->rec_len);
 		}
+		filp->f_pos = EXT2_DIR_ADJUST_TAIL_OFFS(filp->f_pos, sb->s_blocksize);
 		ext2_put_page(page);
 	}
 	return 0;
@@ -343,13 +347,14 @@ struct ext2_dir_entry_2 * ext2_find_entr
 		start = 0;
 	n = start;
 	do {
-		char *kaddr;
+		char *kaddr, *page_start;
 		page = ext2_get_page(dir, n);
 		if (!IS_ERR(page)) {
-			kaddr = page_address(page);
+			kaddr = page_start = page_address(page);
 			de = (ext2_dirent *) kaddr;
 			kaddr += ext2_last_byte(dir, n) - reclen;
 			while ((char *) de <= kaddr) {
+				de = EXT2_DIR_ADJUST_TAIL_ADDR(page_start, de, dir->i_sb->s_blocksize);
 				if (de->rec_len == 0) {
 					ext2_error(dir->i_sb, __FUNCTION__,
 						"zero-length directory entry");
@@ -416,6 +421,7 @@ void ext2_set_link(struct inode *dir, st
 	unsigned to = from + le16_to_cpu(de->rec_len);
 	int err;
 
+	to = EXT2_DIR_ADJUST_TAIL_OFFS(to, inode->i_sb->s_blocksize);
 	lock_page(page);
 	err = page->mapping->a_ops->prepare_write(NULL, page, from, to);
 	BUG_ON(err);
@@ -446,6 +452,7 @@ int ext2_add_link (struct dentry *dentry
 	char *kaddr;
 	unsigned from, to;
 	int err;
+	char *page_start = NULL;
 
 	/*
 	 * We take care of directory expansion in the same loop.
@@ -460,16 +467,28 @@ int ext2_add_link (struct dentry *dentry
 		if (IS_ERR(page))
 			goto out;
 		lock_page(page);
-		kaddr = page_address(page);
+		kaddr = page_start = page_address(page);
 		dir_end = kaddr + ext2_last_byte(dir, n);
 		de = (ext2_dirent *)kaddr;
-		kaddr += PAGE_CACHE_SIZE - reclen;
+		if (chunk_size < EXT2_DIR_MAX_REC_LEN) {
+			kaddr += PAGE_CACHE_SIZE - reclen;
+		} else {
+			kaddr += PAGE_CACHE_SIZE - 
+				(chunk_size - EXT2_DIR_MAX_REC_LEN) - reclen;
+		}
 		while ((char *)de <= kaddr) {
+			de = EXT2_DIR_ADJUST_TAIL_ADDR(page_start, de, chunk_size);	
 			if ((char *)de == dir_end) {
 				/* We hit i_size */
 				name_len = 0;
-				rec_len = chunk_size;
-				de->rec_len = cpu_to_le16(chunk_size);
+				if (chunk_size  < EXT2_DIR_MAX_REC_LEN) {
+					rec_len = chunk_size;
+					de->rec_len = cpu_to_le16(chunk_size);
+				} else {
+					rec_len = EXT2_DIR_MAX_REC_LEN;
+					de->rec_len =
+					cpu_to_le16(EXT2_DIR_MAX_REC_LEN);
+				}
 				de->inode = 0;
 				goto got_it;
 			}
@@ -499,6 +518,7 @@ int ext2_add_link (struct dentry *dentry
 got_it:
 	from = (char*)de - (char*)page_address(page);
 	to = from + rec_len;
+	to = EXT2_DIR_ADJUST_TAIL_OFFS(to, chunk_size);
 	err = page->mapping->a_ops->prepare_write(NULL, page, from, to);
 	if (err)
 		goto out_unlock;
@@ -541,6 +561,7 @@ int ext2_delete_entry (struct ext2_dir_e
 	ext2_dirent * de = (ext2_dirent *) (kaddr + from);
 	int err;
 
+	to = EXT2_DIR_ADJUST_TAIL_OFFS(to, inode->i_sb->s_blocksize);
 	while ((char*)de < (char*)dir) {
 		if (de->rec_len == 0) {
 			ext2_error(inode->i_sb, __FUNCTION__,
@@ -598,7 +619,11 @@ int ext2_make_empty(struct inode *inode,
 
 	de = (struct ext2_dir_entry_2 *)(kaddr + EXT2_DIR_REC_LEN(1));
 	de->name_len = 2;
-	de->rec_len = cpu_to_le16(chunk_size - EXT2_DIR_REC_LEN(1));
+	if (chunk_size < EXT2_DIR_MAX_REC_LEN) {
+		de->rec_len = cpu_to_le16(chunk_size - EXT2_DIR_REC_LEN(1));
+	} else {
+		de->rec_len = cpu_to_le16(EXT2_DIR_MAX_REC_LEN - EXT2_DIR_REC_LEN(1));
+	}
 	de->inode = cpu_to_le32(parent->i_ino);
 	memcpy (de->name, "..\0", 4);
 	ext2_set_de_type (de, inode);
@@ -618,18 +643,19 @@ int ext2_empty_dir (struct inode * inode
 	unsigned long i, npages = dir_pages(inode);
 
 	for (i = 0; i < npages; i++) {
-		char *kaddr;
+		char *kaddr, *page_start;
 		ext2_dirent * de;
 		page = ext2_get_page(inode, i);
 
 		if (IS_ERR(page))
 			continue;
 
-		kaddr = page_address(page);
+		kaddr = page_start = page_address(page);
 		de = (ext2_dirent *)kaddr;
 		kaddr += ext2_last_byte(inode, i) - EXT2_DIR_REC_LEN(1);
 
 		while ((char *)de <= kaddr) {
+			de = EXT2_DIR_ADJUST_TAIL_ADDR(page_start, de, inode->i_sb->s_blocksize);
 			if (de->rec_len == 0) {
 				ext2_error(inode->i_sb, __FUNCTION__,
 					"zero-length directory entry");
Index: linux-2.6.23-rc3/include/linux/ext2_fs.h
===================================================================
--- linux-2.6.23-rc3.orig/include/linux/ext2_fs.h	2007-08-29 15:22:29.000000000 -0700
+++ linux-2.6.23-rc3/include/linux/ext2_fs.h	2007-08-29 15:29:51.000000000 -0700
@@ -557,5 +557,18 @@ enum {
 #define EXT2_DIR_ROUND 			(EXT2_DIR_PAD - 1)
 #define EXT2_DIR_REC_LEN(name_len)	(((name_len) + 8 + EXT2_DIR_ROUND) & \
 					 ~EXT2_DIR_ROUND)
+#define	EXT2_DIR_MAX_REC_LEN		65532
+
+/*
+ * Align a tail offset(address) to the end of a directory block
+ */
+#define EXT2_DIR_ADJUST_TAIL_OFFS(offs, bsize) \
+	((((offs) & ((bsize) -1)) == EXT2_DIR_MAX_REC_LEN) ? \
+	((offs) + (bsize) - EXT2_DIR_MAX_REC_LEN):(offs))
+
+#define EXT2_DIR_ADJUST_TAIL_ADDR(page, de, bsize) \
+	(((((char*)(de) - (page)) & ((bsize) - 1)) == EXT2_DIR_MAX_REC_LEN) ? \
+	((ext2_dirent*)((char*)(de) + (bsize) - EXT2_DIR_MAX_REC_LEN)):(de))
 
 #endif	/* _LINUX_EXT2_FS_H */
+



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC 1/4] Large Blocksize support for Ext2/3/4
  2007-08-30  0:47     ` [RFC 1/4] Large Blocksize support for Ext2/3/4 Mingming Cao
@ 2007-08-30  0:59       ` Christoph Lameter
  2007-09-01  0:01       ` Mingming Cao
                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 56+ messages in thread
From: Christoph Lameter @ 2007-08-30  0:59 UTC (permalink / raw)
  To: Mingming Cao; +Cc: linux-fsdevel, adilger, sho, ext4 development, linux-kernel

On Wed, 29 Aug 2007, Mingming Cao wrote:

> It's quite simple to support large block size in ext2/3/4, mostly just
> enlarge the block size limit.  But it is NOT possible to have 64kB
> blocksize on ext2/3/4 without some changes to the directory handling
> code.  The reason is that an empty 64kB directory block would have a
> rec_len == (__u16)2^16 == 0, and this would cause an error to be hit in
> the filesystem.  The proposed solution is to put 2 empty records in such
> a directory, or to special-case an impossible value like rec_len =
> 0xffff to handle this. 

Ahh. Good.

I could add the path to the large blocksize patchset?

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [11/36] Use page_cache_xxx in fs/buffer.c
       [not found] ` <20070828190730.220393749@sgi.com>
@ 2007-08-30  9:20   ` Dmitry Monakhov
  2007-08-30 18:14     ` Christoph Lameter
  0 siblings, 1 reply; 56+ messages in thread
From: Dmitry Monakhov @ 2007-08-30  9:20 UTC (permalink / raw)
  To: clameter
  Cc: torvalds, linux-fsdevel, linux-kernel, Christoph Hellwig,
	Mel Gorman, William Lee Irwin III, David Chinner, Jens Axboe,
	Badari Pulavarty, Maxim Levitsky, Fengguang Wu, swin wang,
	totty.lu, H. Peter Anvin, joern, Eric W. Biederman

On 12:06 Tue 28 Aug     , clameter@sgi.com wrote:
> Use page_cache_xxx in fs/buffer.c.
submit_bh wasn't changed this means what bio pages may have huge size
without respect to queue reqsrictions (q->max_hw_segments, and etc)
At least driver/md/raid0 will be broken by you'r patch.
> 
> We have a special situation in set_bh_page() since reiserfs calls that
> function before setting up the mapping. So retrieve the page size
> from the page struct rather than the mapping.
> 
> Signed-off-by: Christoph Lameter <clameter@sgi.com>
> ---
>  fs/buffer.c |  110 +++++++++++++++++++++++++++++++++---------------------------
>  1 file changed, 62 insertions(+), 48 deletions(-)
> 
> Index: linux-2.6/fs/buffer.c
> ===================================================================
> --- linux-2.6.orig/fs/buffer.c	2007-08-28 11:37:13.000000000 -0700
> +++ linux-2.6/fs/buffer.c	2007-08-28 11:37:58.000000000 -0700
> @@ -257,7 +257,7 @@ __find_get_block_slow(struct block_devic
>  	struct page *page;
>  	int all_mapped = 1;
>  
> -	index = block >> (PAGE_CACHE_SHIFT - bd_inode->i_blkbits);
> +	index = block >> (page_cache_shift(bd_mapping) - bd_inode->i_blkbits);
>  	page = find_get_page(bd_mapping, index);
>  	if (!page)
>  		goto out;
> @@ -697,7 +697,7 @@ static int __set_page_dirty(struct page 
>  
>  		if (mapping_cap_account_dirty(mapping)) {
>  			__inc_zone_page_state(page, NR_FILE_DIRTY);
> -			task_io_account_write(PAGE_CACHE_SIZE);
> +			task_io_account_write(page_cache_size(mapping));
>  		}
>  		radix_tree_tag_set(&mapping->page_tree,
>  				page_index(page), PAGECACHE_TAG_DIRTY);
> @@ -891,10 +891,11 @@ struct buffer_head *alloc_page_buffers(s
>  {
>  	struct buffer_head *bh, *head;
>  	long offset;
> +	unsigned int page_size = page_cache_size(page->mapping);
>  
>  try_again:
>  	head = NULL;
> -	offset = PAGE_SIZE;
> +	offset = page_size;
>  	while ((offset -= size) >= 0) {
>  		bh = alloc_buffer_head(GFP_NOFS);
>  		if (!bh)
> @@ -1426,7 +1427,7 @@ void set_bh_page(struct buffer_head *bh,
>  		struct page *page, unsigned long offset)
>  {
>  	bh->b_page = page;
> -	BUG_ON(offset >= PAGE_SIZE);
> +	BUG_ON(offset >= compound_size(page));
>  	if (PageHighMem(page))
>  		/*
>  		 * This catches illegal uses and preserves the offset:
> @@ -1605,6 +1606,7 @@ static int __block_write_full_page(struc
>  	struct buffer_head *bh, *head;
>  	const unsigned blocksize = 1 << inode->i_blkbits;
>  	int nr_underway = 0;
> +	struct address_space *mapping = inode->i_mapping;
>  
>  	BUG_ON(!PageLocked(page));
>  
> @@ -1625,7 +1627,8 @@ static int __block_write_full_page(struc
>  	 * handle that here by just cleaning them.
>  	 */
>  
> -	block = (sector_t)page->index << (PAGE_CACHE_SHIFT - inode->i_blkbits);
> +	block = (sector_t)page->index <<
> +		(page_cache_shift(mapping) - inode->i_blkbits);
>  	head = page_buffers(page);
>  	bh = head;
>  
> @@ -1742,7 +1745,7 @@ recover:
>  	} while ((bh = bh->b_this_page) != head);
>  	SetPageError(page);
>  	BUG_ON(PageWriteback(page));
> -	mapping_set_error(page->mapping, err);
> +	mapping_set_error(mapping, err);
>  	set_page_writeback(page);
>  	do {
>  		struct buffer_head *next = bh->b_this_page;
> @@ -1767,8 +1770,8 @@ static int __block_prepare_write(struct 
>  	struct buffer_head *bh, *head, *wait[2], **wait_bh=wait;
>  
>  	BUG_ON(!PageLocked(page));
> -	BUG_ON(from > PAGE_CACHE_SIZE);
> -	BUG_ON(to > PAGE_CACHE_SIZE);
> +	BUG_ON(from > page_cache_size(inode->i_mapping));
> +	BUG_ON(to > page_cache_size(inode->i_mapping));
>  	BUG_ON(from > to);
>  
>  	blocksize = 1 << inode->i_blkbits;
> @@ -1777,7 +1780,8 @@ static int __block_prepare_write(struct 
>  	head = page_buffers(page);
>  
>  	bbits = inode->i_blkbits;
> -	block = (sector_t)page->index << (PAGE_CACHE_SHIFT - bbits);
> +	block = (sector_t)page->index <<
> +		(page_cache_shift(inode->i_mapping) - bbits);
>  
>  	for(bh = head, block_start = 0; bh != head || !block_start;
>  	    block++, block_start=block_end, bh = bh->b_this_page) {
> @@ -1921,7 +1925,8 @@ int block_read_full_page(struct page *pa
>  		create_empty_buffers(page, blocksize, 0);
>  	head = page_buffers(page);
>  
> -	iblock = (sector_t)page->index << (PAGE_CACHE_SHIFT - inode->i_blkbits);
> +	iblock = (sector_t)page->index <<
> +		(page_cache_shift(page->mapping) - inode->i_blkbits);
>  	lblock = (i_size_read(inode)+blocksize-1) >> inode->i_blkbits;
>  	bh = head;
>  	nr = 0;
> @@ -2045,7 +2050,7 @@ int generic_cont_expand(struct inode *in
>  	pgoff_t index;
>  	unsigned int offset;
>  
> -	offset = (size & (PAGE_CACHE_SIZE - 1)); /* Within page */
> +	offset = page_cache_offset(inode->i_mapping, size); /* Within page */
>  
>  	/* ugh.  in prepare/commit_write, if from==to==start of block, we
>  	** skip the prepare.  make sure we never send an offset for the start
> @@ -2055,7 +2060,7 @@ int generic_cont_expand(struct inode *in
>  		/* caller must handle this extra byte. */
>  		offset++;
>  	}
> -	index = size >> PAGE_CACHE_SHIFT;
> +	index = page_cache_index(inode->i_mapping, size);
>  
>  	return __generic_cont_expand(inode, size, index, offset);
>  }
> @@ -2063,8 +2068,8 @@ int generic_cont_expand(struct inode *in
>  int generic_cont_expand_simple(struct inode *inode, loff_t size)
>  {
>  	loff_t pos = size - 1;
> -	pgoff_t index = pos >> PAGE_CACHE_SHIFT;
> -	unsigned int offset = (pos & (PAGE_CACHE_SIZE - 1)) + 1;
> +	pgoff_t index = page_cache_index(inode->i_mapping, pos);
> +	unsigned int offset = page_cache_offset(inode->i_mapping, pos) + 1;
>  
>  	/* prepare/commit_write can handle even if from==to==start of block. */
>  	return __generic_cont_expand(inode, size, index, offset);
> @@ -2086,28 +2091,28 @@ int cont_prepare_write(struct page *page
>  	unsigned zerofrom;
>  	unsigned blocksize = 1 << inode->i_blkbits;
>  
> -	while(page->index > (pgpos = *bytes>>PAGE_CACHE_SHIFT)) {
> +	while (page->index > (pgpos = page_cache_index(mapping, *bytes))) {
>  		status = -ENOMEM;
>  		new_page = grab_cache_page(mapping, pgpos);
>  		if (!new_page)
>  			goto out;
>  		/* we might sleep */
> -		if (*bytes>>PAGE_CACHE_SHIFT != pgpos) {
> +		if (page_cache_index(mapping, *bytes) != pgpos) {
>  			unlock_page(new_page);
>  			page_cache_release(new_page);
>  			continue;
>  		}
> -		zerofrom = *bytes & ~PAGE_CACHE_MASK;
> +		zerofrom = page_cache_offset(mapping, *bytes);
>  		if (zerofrom & (blocksize-1)) {
>  			*bytes |= (blocksize-1);
>  			(*bytes)++;
>  		}
>  		status = __block_prepare_write(inode, new_page, zerofrom,
> -						PAGE_CACHE_SIZE, get_block);
> +					page_cache_size(mapping), get_block);
>  		if (status)
>  			goto out_unmap;
> -		zero_user_segment(new_page, zerofrom, PAGE_CACHE_SIZE);
> -		generic_commit_write(NULL, new_page, zerofrom, PAGE_CACHE_SIZE);
> +		zero_user_segment(new_page, zerofrom, page_cache_size(mapping));
> +		generic_commit_write(NULL, new_page, zerofrom, page_cache_size(mapping));
>  		unlock_page(new_page);
>  		page_cache_release(new_page);
>  	}
> @@ -2117,7 +2122,7 @@ int cont_prepare_write(struct page *page
>  		zerofrom = offset;
>  	} else {
>  		/* page covers the boundary, find the boundary offset */
> -		zerofrom = *bytes & ~PAGE_CACHE_MASK;
> +		zerofrom = page_cache_offset(mapping, *bytes);
>  
>  		/* if we will expand the thing last block will be filled */
>  		if (to > zerofrom && (zerofrom & (blocksize-1))) {
> @@ -2169,8 +2174,9 @@ int block_commit_write(struct page *page
>  int generic_commit_write(struct file *file, struct page *page,
>  		unsigned from, unsigned to)
>  {
> -	struct inode *inode = page->mapping->host;
> -	loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
> +	struct address_space *mapping = page->mapping;
> +	struct inode *inode = mapping->host;
> +	loff_t pos = page_cache_pos(mapping, page->index, to);
>  	__block_commit_write(inode,page,from,to);
>  	/*
>  	 * No need to use i_size_read() here, the i_size
> @@ -2206,20 +2212,22 @@ block_page_mkwrite(struct vm_area_struct
>  	unsigned long end;
>  	loff_t size;
>  	int ret = -EINVAL;
> +	struct address_space *mapping;
>  
>  	lock_page(page);
> +	mapping = page->mapping;
>  	size = i_size_read(inode);
> -	if ((page->mapping != inode->i_mapping) ||
> +	if ((mapping != inode->i_mapping) ||
>  	    (page_offset(page) > size)) {
>  		/* page got truncated out from underneath us */
>  		goto out_unlock;
>  	}
>  
>  	/* page is wholly or partially inside EOF */
> -	if (((page->index + 1) << PAGE_CACHE_SHIFT) > size)
> -		end = size & ~PAGE_CACHE_MASK;
> +	if (page_cache_pos(mapping, page->index + 1, 0) > size)
> +		end = page_cache_offset(mapping, size);
>  	else
> -		end = PAGE_CACHE_SIZE;
> +		end = page_cache_size(mapping);
>  
>  	ret = block_prepare_write(page, 0, end, get_block);
>  	if (!ret)
> @@ -2258,6 +2266,7 @@ static void end_buffer_read_nobh(struct 
>  int nobh_prepare_write(struct page *page, unsigned from, unsigned to,
>  			get_block_t *get_block)
>  {
> +	struct address_space *mapping = page->mapping;
>  	struct inode *inode = page->mapping->host;
>  	const unsigned blkbits = inode->i_blkbits;
>  	const unsigned blocksize = 1 << blkbits;
> @@ -2265,6 +2274,7 @@ int nobh_prepare_write(struct page *page
>  	struct buffer_head *read_bh[MAX_BUF_PER_PAGE];
>  	unsigned block_in_page;
>  	unsigned block_start;
> +	unsigned page_size = page_cache_size(mapping);
>  	sector_t block_in_file;
>  	int nr_reads = 0;
>  	int i;
> @@ -2274,7 +2284,8 @@ int nobh_prepare_write(struct page *page
>  	if (PageMappedToDisk(page))
>  		return 0;
>  
> -	block_in_file = (sector_t)page->index << (PAGE_CACHE_SHIFT - blkbits);
> +	block_in_file = (sector_t)page->index <<
> +			(page_cache_shift(mapping) - blkbits);
>  	map_bh.b_page = page;
>  
>  	/*
> @@ -2283,7 +2294,7 @@ int nobh_prepare_write(struct page *page
>  	 * page is fully mapped-to-disk.
>  	 */
>  	for (block_start = 0, block_in_page = 0;
> -		  block_start < PAGE_CACHE_SIZE;
> +		  block_start < page_size;
>  		  block_in_page++, block_start += blocksize) {
>  		unsigned block_end = block_start + blocksize;
>  		int create;
> @@ -2372,7 +2383,7 @@ failed:
>  	 * Error recovery is pretty slack.  Clear the page and mark it dirty
>  	 * so we'll later zero out any blocks which _were_ allocated.
>  	 */
> -	zero_user(page, 0, PAGE_CACHE_SIZE);
> +	zero_user(page, 0, page_size);
>  	SetPageUptodate(page);
>  	set_page_dirty(page);
>  	return ret;
> @@ -2386,8 +2397,9 @@ EXPORT_SYMBOL(nobh_prepare_write);
>  int nobh_commit_write(struct file *file, struct page *page,
>  		unsigned from, unsigned to)
>  {
> -	struct inode *inode = page->mapping->host;
> -	loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
> +	struct address_space *mapping = page->mapping;
> +	struct inode *inode = mapping->host;
> +	loff_t pos = page_cache_pos(mapping, page->index, to);
>  
>  	SetPageUptodate(page);
>  	set_page_dirty(page);
> @@ -2407,9 +2419,10 @@ EXPORT_SYMBOL(nobh_commit_write);
>  int nobh_writepage(struct page *page, get_block_t *get_block,
>  			struct writeback_control *wbc)
>  {
> -	struct inode * const inode = page->mapping->host;
> +	struct address_space *mapping = page->mapping;
> +	struct inode * const inode = mapping->host;
>  	loff_t i_size = i_size_read(inode);
> -	const pgoff_t end_index = i_size >> PAGE_CACHE_SHIFT;
> +	const pgoff_t end_index = page_cache_offset(mapping, i_size);
>  	unsigned offset;
>  	int ret;
>  
> @@ -2418,7 +2431,7 @@ int nobh_writepage(struct page *page, ge
>  		goto out;
>  
>  	/* Is the page fully outside i_size? (truncate in progress) */
> -	offset = i_size & (PAGE_CACHE_SIZE-1);
> +	offset = page_cache_offset(mapping, i_size);
>  	if (page->index >= end_index+1 || !offset) {
>  		/*
>  		 * The page may have dirty, unmapped buffers.  For example,
> @@ -2441,7 +2454,7 @@ int nobh_writepage(struct page *page, ge
>  	 * the  page size, the remaining memory is zeroed when mapped, and
>  	 * writes to that region are not written out to the file."
>  	 */
> -	zero_user_segment(page, offset, PAGE_CACHE_SIZE);
> +	zero_user_segment(page, offset, page_cache_size(mapping));
>  out:
>  	ret = mpage_writepage(page, get_block, wbc);
>  	if (ret == -EAGAIN)
> @@ -2457,8 +2470,8 @@ int nobh_truncate_page(struct address_sp
>  {
>  	struct inode *inode = mapping->host;
>  	unsigned blocksize = 1 << inode->i_blkbits;
> -	pgoff_t index = from >> PAGE_CACHE_SHIFT;
> -	unsigned offset = from & (PAGE_CACHE_SIZE-1);
> +	pgoff_t index = page_cache_index(mapping, from);
> +	unsigned offset = page_cache_offset(mapping, from);
>  	unsigned to;
>  	struct page *page;
>  	const struct address_space_operations *a_ops = mapping->a_ops;
> @@ -2475,7 +2488,7 @@ int nobh_truncate_page(struct address_sp
>  	to = (offset + blocksize) & ~(blocksize - 1);
>  	ret = a_ops->prepare_write(NULL, page, offset, to);
>  	if (ret == 0) {
> -		zero_user_segment(page, offset, PAGE_CACHE_SIZE);
> +		zero_user_segment(page, offset, page_cache_size(mapping));
>  		/*
>  		 * It would be more correct to call aops->commit_write()
>  		 * here, but this is more efficient.
> @@ -2493,8 +2506,8 @@ EXPORT_SYMBOL(nobh_truncate_page);
>  int block_truncate_page(struct address_space *mapping,
>  			loff_t from, get_block_t *get_block)
>  {
> -	pgoff_t index = from >> PAGE_CACHE_SHIFT;
> -	unsigned offset = from & (PAGE_CACHE_SIZE-1);
> +	pgoff_t index = page_cache_index(mapping, from);
> +	unsigned offset = page_cache_offset(mapping, from);
>  	unsigned blocksize;
>  	sector_t iblock;
>  	unsigned length, pos;
> @@ -2511,8 +2524,8 @@ int block_truncate_page(struct address_s
>  		return 0;
>  
>  	length = blocksize - length;
> -	iblock = (sector_t)index << (PAGE_CACHE_SHIFT - inode->i_blkbits);
> -	
> +	iblock = (sector_t)index <<
> +			(page_cache_shift(mapping) - inode->i_blkbits);
>  	page = grab_cache_page(mapping, index);
>  	err = -ENOMEM;
>  	if (!page)
> @@ -2571,9 +2584,10 @@ out:
>  int block_write_full_page(struct page *page, get_block_t *get_block,
>  			struct writeback_control *wbc)
>  {
> -	struct inode * const inode = page->mapping->host;
> +	struct address_space *mapping = page->mapping;
> +	struct inode * const inode = mapping->host;
>  	loff_t i_size = i_size_read(inode);
> -	const pgoff_t end_index = i_size >> PAGE_CACHE_SHIFT;
> +	const pgoff_t end_index = page_cache_index(mapping, i_size);
>  	unsigned offset;
>  
>  	/* Is the page fully inside i_size? */
> @@ -2581,7 +2595,7 @@ int block_write_full_page(struct page *p
>  		return __block_write_full_page(inode, page, get_block, wbc);
>  
>  	/* Is the page fully outside i_size? (truncate in progress) */
> -	offset = i_size & (PAGE_CACHE_SIZE-1);
> +	offset = page_cache_offset(mapping, i_size);
>  	if (page->index >= end_index+1 || !offset) {
>  		/*
>  		 * The page may have dirty, unmapped buffers.  For example,
> @@ -2600,7 +2614,7 @@ int block_write_full_page(struct page *p
>  	 * the  page size, the remaining memory is zeroed when mapped, and
>  	 * writes to that region are not written out to the file."
>  	 */
> -	zero_user_segment(page, offset, PAGE_CACHE_SIZE);
> +	zero_user_segment(page, offset, page_cache_size(mapping));
>  	return __block_write_full_page(inode, page, get_block, wbc);
>  }
>  
> @@ -2854,7 +2868,7 @@ int try_to_free_buffers(struct page *pag
>  	 * dirty bit from being lost.
>  	 */
>  	if (ret)
> -		cancel_dirty_page(page, PAGE_CACHE_SIZE);
> +		cancel_dirty_page(page, page_cache_size(mapping));
>  	spin_unlock(&mapping->private_lock);
>  out:
>  	if (buffers_to_free) {
> 
> -- 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [11/36] Use page_cache_xxx in fs/buffer.c
  2007-08-30  9:20   ` [11/36] Use page_cache_xxx in fs/buffer.c Dmitry Monakhov
@ 2007-08-30 18:14     ` Christoph Lameter
  2007-08-31  1:47       ` Christoph Lameter
  0 siblings, 1 reply; 56+ messages in thread
From: Christoph Lameter @ 2007-08-30 18:14 UTC (permalink / raw)
  To: Dmitry Monakhov
  Cc: torvalds, linux-fsdevel, linux-kernel, Christoph Hellwig,
	Mel Gorman, William Lee Irwin III, David Chinner, Jens Axboe,
	Badari Pulavarty, Maxim Levitsky, Fengguang Wu, swin wang,
	totty.lu, H. Peter Anvin, joern, Eric W. Biederman

On Thu, 30 Aug 2007, Dmitry Monakhov wrote:

> On 12:06 Tue 28 Aug     , clameter@sgi.com wrote:
> > Use page_cache_xxx in fs/buffer.c.
> submit_bh wasn't changed this means what bio pages may have huge size
> without respect to queue reqsrictions (q->max_hw_segments, and etc)
> At least driver/md/raid0 will be broken by you'r patch.

Hmmm... So we need to check the page size and generate multiple requests 
in submit_bh?

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [11/36] Use page_cache_xxx in fs/buffer.c
  2007-08-30 18:14     ` Christoph Lameter
@ 2007-08-31  1:47       ` Christoph Lameter
  2007-08-31  6:56         ` Jens Axboe
  0 siblings, 1 reply; 56+ messages in thread
From: Christoph Lameter @ 2007-08-31  1:47 UTC (permalink / raw)
  To: Dmitry Monakhov
  Cc: torvalds, linux-fsdevel, linux-kernel, Christoph Hellwig,
	Mel Gorman, William Lee Irwin III, David Chinner, Jens Axboe,
	Badari Pulavarty, Maxim Levitsky, Fengguang Wu, swin wang,
	totty.lu, H. Peter Anvin, joern, Eric W. Biederman

This may already be handled?

submit_bh() calls submit_bio() which calls __generic_make_request() and 
there we do:

                if (unlikely(bio_sectors(bio) > q->max_hw_sectors)) {
                        printk("bio too big device %s (%u > %u)\n",
                                bdevname(bio->bi_bdev, b),
                                bio_sectors(bio),
                                q->max_hw_sectors);
                        goto end_io;
                }

So if we try to push a too large buffer down with submit_bh() we get a 
failure.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [11/36] Use page_cache_xxx in fs/buffer.c
  2007-08-31  1:47       ` Christoph Lameter
@ 2007-08-31  6:56         ` Jens Axboe
  2007-08-31  7:03           ` Christoph Lameter
  0 siblings, 1 reply; 56+ messages in thread
From: Jens Axboe @ 2007-08-31  6:56 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Dmitry Monakhov, torvalds, linux-fsdevel, linux-kernel,
	Christoph Hellwig, Mel Gorman, William Lee Irwin III,
	David Chinner, Badari Pulavarty, Maxim Levitsky, Fengguang Wu,
	swin wang, totty.lu, H. Peter Anvin, joern, Eric W. Biederman

On Thu, Aug 30 2007, Christoph Lameter wrote:
> This may already be handled?
> 
> submit_bh() calls submit_bio() which calls __generic_make_request() and 
> there we do:
> 
>                 if (unlikely(bio_sectors(bio) > q->max_hw_sectors)) {
>                         printk("bio too big device %s (%u > %u)\n",
>                                 bdevname(bio->bi_bdev, b),
>                                 bio_sectors(bio),
>                                 q->max_hw_sectors);
>                         goto end_io;
>                 }
> 
> So if we try to push a too large buffer down with submit_bh() we get a 
> failure.

Only partly, you may be violating a number of other restrictions (size
is many things, not just length of the data).

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [11/36] Use page_cache_xxx in fs/buffer.c
  2007-08-31  6:56         ` Jens Axboe
@ 2007-08-31  7:03           ` Christoph Lameter
  2007-08-31  7:11             ` Jens Axboe
  0 siblings, 1 reply; 56+ messages in thread
From: Christoph Lameter @ 2007-08-31  7:03 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Dmitry Monakhov, torvalds, linux-fsdevel, linux-kernel,
	Christoph Hellwig, Mel Gorman, William Lee Irwin III,
	David Chinner, Badari Pulavarty, Maxim Levitsky, Fengguang Wu,
	swin wang, totty.lu, H. Peter Anvin, joern, Eric W. Biederman

On Fri, 31 Aug 2007, Jens Axboe wrote:

> > So if we try to push a too large buffer down with submit_bh() we get a 
> > failure.
> 
> Only partly, you may be violating a number of other restrictions (size
> is many things, not just length of the data).

Could you be more specific?


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [11/36] Use page_cache_xxx in fs/buffer.c
  2007-08-31  7:03           ` Christoph Lameter
@ 2007-08-31  7:11             ` Jens Axboe
  2007-08-31  7:17               ` Christoph Lameter
  0 siblings, 1 reply; 56+ messages in thread
From: Jens Axboe @ 2007-08-31  7:11 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Dmitry Monakhov, torvalds, linux-fsdevel, linux-kernel,
	Christoph Hellwig, Mel Gorman, William Lee Irwin III,
	David Chinner, Badari Pulavarty, Maxim Levitsky, Fengguang Wu,
	swin wang, totty.lu, H. Peter Anvin, joern, Eric W. Biederman

On Fri, Aug 31 2007, Christoph Lameter wrote:
> On Fri, 31 Aug 2007, Jens Axboe wrote:
> 
> > > So if we try to push a too large buffer down with submit_bh() we get a 
> > > failure.
> > 
> > Only partly, you may be violating a number of other restrictions (size
> > is many things, not just length of the data).
> 
> Could you be more specific?

Size of a single segment, for instance. Or if the bio crosses a dma
boundary. If your block is 64kb and the maximum segment size is 32kb,
then you would need to clone the bio and split it into two.

Things like that. This isn't a problem with single page requests, as we
based the lower possible boundaries on that.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [11/36] Use page_cache_xxx in fs/buffer.c
  2007-08-31  7:11             ` Jens Axboe
@ 2007-08-31  7:17               ` Christoph Lameter
  2007-08-31  7:26                 ` Jens Axboe
  0 siblings, 1 reply; 56+ messages in thread
From: Christoph Lameter @ 2007-08-31  7:17 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Dmitry Monakhov, torvalds, linux-fsdevel, linux-kernel,
	Christoph Hellwig, Mel Gorman, William Lee Irwin III,
	David Chinner, Badari Pulavarty, Maxim Levitsky, Fengguang Wu,
	swin wang, totty.lu, H. Peter Anvin, joern, Eric W. Biederman

On Fri, 31 Aug 2007, Jens Axboe wrote:

> > Could you be more specific?
> 
> Size of a single segment, for instance. Or if the bio crosses a dma
> boundary. If your block is 64kb and the maximum segment size is 32kb,
> then you would need to clone the bio and split it into two.

A DMA boundary cannot be crossed AFAIK. The compound pages are aligned to 
the power of two boundaries and the page allocator will not create pages 
that cross the zone boundaries.

It looks like the code will correctly signal a failure if you try to write 
a 64k block on a device with a maximum segment size of 32k. Isnt this 
okay? One would not want to use a larger block size than supported by the 
underlying hardware?

> Things like that. This isn't a problem with single page requests, as we
> based the lower possible boundaries on that.

submit_bh() is used to submit a single buffer and I think that was our 
main concern here.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [11/36] Use page_cache_xxx in fs/buffer.c
  2007-08-31  7:17               ` Christoph Lameter
@ 2007-08-31  7:26                 ` Jens Axboe
  2007-08-31  7:33                   ` Christoph Lameter
  0 siblings, 1 reply; 56+ messages in thread
From: Jens Axboe @ 2007-08-31  7:26 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Dmitry Monakhov, torvalds, linux-fsdevel, linux-kernel,
	Christoph Hellwig, Mel Gorman, William Lee Irwin III,
	David Chinner, Badari Pulavarty, Maxim Levitsky, Fengguang Wu,
	swin wang, totty.lu, H. Peter Anvin, joern, Eric W. Biederman

On Fri, Aug 31 2007, Christoph Lameter wrote:
> On Fri, 31 Aug 2007, Jens Axboe wrote:
> 
> > > Could you be more specific?
> > 
> > Size of a single segment, for instance. Or if the bio crosses a dma
> > boundary. If your block is 64kb and the maximum segment size is 32kb,
> > then you would need to clone the bio and split it into two.
> 
> A DMA boundary cannot be crossed AFAIK. The compound pages are aligned to 
> the power of two boundaries and the page allocator will not create pages 
> that cross the zone boundaries.

With a 64k page and a dma boundary of 0x7fff, that's two segments.

> It looks like the code will correctly signal a failure if you try to write 
> a 64k block on a device with a maximum segment size of 32k. Isnt this 
> okay? One would not want to use a larger block size than supported by the 
> underlying hardware?

That's just the size in sectors limitation again. And that also needs to
be handled, the fact that it currently errors out is reassuring but
definitely a long term solution. You don't want to knowingly setup such
a system where the fs block size is larger than what the hardware would
want, but it should work. You could be moving hardware around, for
recovery or otherwise.

> > Things like that. This isn't a problem with single page requests, as we
> > based the lower possible boundaries on that.
> 
> submit_bh() is used to submit a single buffer and I think that was our 
> main concern here.

And how large can that be?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [11/36] Use page_cache_xxx in fs/buffer.c
  2007-08-31  7:26                 ` Jens Axboe
@ 2007-08-31  7:33                   ` Christoph Lameter
  2007-08-31  7:43                     ` Jens Axboe
  0 siblings, 1 reply; 56+ messages in thread
From: Christoph Lameter @ 2007-08-31  7:33 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Dmitry Monakhov, torvalds, linux-fsdevel, linux-kernel,
	Christoph Hellwig, Mel Gorman, William Lee Irwin III,
	David Chinner, Badari Pulavarty, Maxim Levitsky, Fengguang Wu,
	swin wang, totty.lu, H. Peter Anvin, joern, Eric W. Biederman

On Fri, 31 Aug 2007, Jens Axboe wrote:

> > A DMA boundary cannot be crossed AFAIK. The compound pages are aligned to 
> > the power of two boundaries and the page allocator will not create pages 
> > that cross the zone boundaries.
> 
> With a 64k page and a dma boundary of 0x7fff, that's two segments.

Ok so DMA memory restrictions not conforming to the DMA zones? The 
example is a bit weird. DMA only to the first 32k of memory? If the limit 
would be higher like 16MB then we would not have an issue. Is there really 
a device that can only do I/O to the first 32k of memory?

How do we split that up today? We could add processing to submit_bio to 
check for the boundary and create two bios.

> > submit_bh() is used to submit a single buffer and I think that was our 
> > main concern here.
> 
> And how large can that be?

As large as mkxxxfs allowed it to be. For XFS and extX with the current 
patchset 32k is the limit (64k with the fixes to ext2) but a new 
filesystem could theoretically use a larger blocksize.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [11/36] Use page_cache_xxx in fs/buffer.c
  2007-08-31  7:33                   ` Christoph Lameter
@ 2007-08-31  7:43                     ` Jens Axboe
  2007-08-31  7:52                       ` Christoph Lameter
  0 siblings, 1 reply; 56+ messages in thread
From: Jens Axboe @ 2007-08-31  7:43 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Dmitry Monakhov, torvalds, linux-fsdevel, linux-kernel,
	Christoph Hellwig, Mel Gorman, William Lee Irwin III,
	David Chinner, Badari Pulavarty, Maxim Levitsky, Fengguang Wu,
	swin wang, totty.lu, H. Peter Anvin, joern, Eric W. Biederman

On Fri, Aug 31 2007, Christoph Lameter wrote:
> On Fri, 31 Aug 2007, Jens Axboe wrote:
> 
> > > A DMA boundary cannot be crossed AFAIK. The compound pages are aligned to 
> > > the power of two boundaries and the page allocator will not create pages 
> > > that cross the zone boundaries.
> > 
> > With a 64k page and a dma boundary of 0x7fff, that's two segments.
> 
> Ok so DMA memory restrictions not conforming to the DMA zones? The 
> example is a bit weird. DMA only to the first 32k of memory? If the limit 
> would be higher like 16MB then we would not have an issue. Is there really 
> a device that can only do I/O to the first 32k of memory?

They have nothing to do with each other, you are mixing things up. It
has nothing to do with the device being able to dma into that memory or
not, we have fine existing infrastructure to handle that. But different
hardware have different characteristics on what a single segment is. You
can say "a single segment cannot cross a 32kb boundary". So from the
example above, your single 64k page may need to be split into two
segments. Or it could have a maximum segment size of 32k, in which case
it would have to be split as well.

Do you see what I mean now?

> How do we split that up today? We could add processing to submit_bio
> to check for the boundary and create two bios.

But we do not split them up today - see what I wrote! Today we impose
the restriction that a device must be able to handle a single "normal"
page, and if it can't do that, it has to split it up itself.

But yes, you would have to create some out-of-line function to use
bio_split() until you have chopped things down enough. It's not a good
thing for performance naturally, but if we consider this a "just make it
work" fallback, I don't think it's too bad. You want to make a note of
that it is happening though, so people realize that it is happening.

> > > submit_bh() is used to submit a single buffer and I think that was
> > > our main concern here.
> > 
> > And how large can that be?
> 
> As large as mkxxxfs allowed it to be. For XFS and extX with the
> current patchset 32k is the limit (64k with the fixes to ext2) but a
> new filesystem could theoretically use a larger blocksize.

OK, since it goes direct to bio anyway, it can be handled there.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [11/36] Use page_cache_xxx in fs/buffer.c
  2007-08-31  7:43                     ` Jens Axboe
@ 2007-08-31  7:52                       ` Christoph Lameter
  2007-08-31  8:12                         ` Jens Axboe
  2007-08-31  8:36                         ` Dmitry Monakhov
  0 siblings, 2 replies; 56+ messages in thread
From: Christoph Lameter @ 2007-08-31  7:52 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Dmitry Monakhov, torvalds, linux-fsdevel, linux-kernel,
	Christoph Hellwig, Mel Gorman, William Lee Irwin III,
	David Chinner, Badari Pulavarty, Maxim Levitsky, Fengguang Wu,
	swin wang, totty.lu, H. Peter Anvin, joern, Eric W. Biederman

On Fri, 31 Aug 2007, Jens Axboe wrote:

> They have nothing to do with each other, you are mixing things up. It
> has nothing to do with the device being able to dma into that memory or
> not, we have fine existing infrastructure to handle that. But different
> hardware have different characteristics on what a single segment is. You
> can say "a single segment cannot cross a 32kb boundary". So from the
> example above, your single 64k page may need to be split into two
> segments. Or it could have a maximum segment size of 32k, in which case
> it would have to be split as well.
> 
> Do you see what I mean now?

Ok. So another solution maybe to limit the blocksizes that can be used 
with a device?

> > How do we split that up today? We could add processing to submit_bio
> > to check for the boundary and create two bios.
> 
> But we do not split them up today - see what I wrote! Today we impose
> the restriction that a device must be able to handle a single "normal"
> page, and if it can't do that, it has to split it up itself.
> 
> But yes, you would have to create some out-of-line function to use
> bio_split() until you have chopped things down enough. It's not a good
> thing for performance naturally, but if we consider this a "just make it
> work" fallback, I don't think it's too bad. You want to make a note of
> that it is happening though, so people realize that it is happening.

Hmmmm.. We could keep the existing scheme too and check that device 
drivers split things up if they are too large? Isnt it possible today
to create a huge bio of 2M for huge pages and send it to a device?

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [11/36] Use page_cache_xxx in fs/buffer.c
  2007-08-31  7:52                       ` Christoph Lameter
@ 2007-08-31  8:12                         ` Jens Axboe
  2007-08-31 15:22                           ` Christoph Lameter
  2007-08-31  8:36                         ` Dmitry Monakhov
  1 sibling, 1 reply; 56+ messages in thread
From: Jens Axboe @ 2007-08-31  8:12 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Dmitry Monakhov, torvalds, linux-fsdevel, linux-kernel,
	Christoph Hellwig, Mel Gorman, William Lee Irwin III,
	David Chinner, Badari Pulavarty, Maxim Levitsky, Fengguang Wu,
	swin wang, totty.lu, H. Peter Anvin, joern, Eric W. Biederman

On Fri, Aug 31 2007, Christoph Lameter wrote:
> On Fri, 31 Aug 2007, Jens Axboe wrote:
> 
> > They have nothing to do with each other, you are mixing things up. It
> > has nothing to do with the device being able to dma into that memory or
> > not, we have fine existing infrastructure to handle that. But different
> > hardware have different characteristics on what a single segment is. You
> > can say "a single segment cannot cross a 32kb boundary". So from the
> > example above, your single 64k page may need to be split into two
> > segments. Or it could have a maximum segment size of 32k, in which case
> > it would have to be split as well.
> > 
> > Do you see what I mean now?
> 
> Ok. So another solution maybe to limit the blocksizes that can be used 
> with a device?

That'd work for creation, but not for moving things around.

> > > How do we split that up today? We could add processing to submit_bio
> > > to check for the boundary and create two bios.
> > 
> > But we do not split them up today - see what I wrote! Today we impose
> > the restriction that a device must be able to handle a single "normal"
> > page, and if it can't do that, it has to split it up itself.
> > 
> > But yes, you would have to create some out-of-line function to use
> > bio_split() until you have chopped things down enough. It's not a good
> > thing for performance naturally, but if we consider this a "just make it
> > work" fallback, I don't think it's too bad. You want to make a note of
> > that it is happening though, so people realize that it is happening.
> 
> Hmmmm.. We could keep the existing scheme too and check that device 
> drivers split things up if they are too large? Isnt it possible today
> to create a huge bio of 2M for huge pages and send it to a device?

Not sure, aren't the constituents of compound pages the basis for IO?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [11/36] Use page_cache_xxx in fs/buffer.c
  2007-08-31  7:52                       ` Christoph Lameter
  2007-08-31  8:12                         ` Jens Axboe
@ 2007-08-31  8:36                         ` Dmitry Monakhov
  2007-08-31 15:28                           ` Christoph Lameter
  1 sibling, 1 reply; 56+ messages in thread
From: Dmitry Monakhov @ 2007-08-31  8:36 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Jens Axboe, torvalds, linux-fsdevel, linux-kernel,
	Christoph Hellwig, Mel Gorman, William Lee Irwin III,
	David Chinner, Badari Pulavarty, Maxim Levitsky, Fengguang Wu,
	swin wang, totty.lu, H. Peter Anvin, joern, Eric W. Biederman

On 00:52 Fri 31 Aug     , Christoph Lameter wrote:
> On Fri, 31 Aug 2007, Jens Axboe wrote:
> 
> > They have nothing to do with each other, you are mixing things up. It
> > has nothing to do with the device being able to dma into that memory or
> > not, we have fine existing infrastructure to handle that. But different
> > hardware have different characteristics on what a single segment is. You
> > can say "a single segment cannot cross a 32kb boundary". So from the
> > example above, your single 64k page may need to be split into two
> > segments. Or it could have a maximum segment size of 32k, in which case
> > it would have to be split as well.
> > 
> > Do you see what I mean now?
> 
> Ok. So another solution maybe to limit the blocksizes that can be used 
> with a device?
IMHO It is not good because after fs was created with big blksize it's image
cant be used on other devices. 
We may just rewrite submit_bh simular to drivers/md/dm-io.c:do_region
with following pseudocode:

remaning = super_page_size();
while (remaining) {
	init_bio(bio);
	/*Try and add as many pages as possible*/
	while (remaining) {
		dp->get_page(dp, &page, &len, &offset);
		len = min(len,
		to_bytes(remaining));
        	if(!bio_add_page(bio, page, len, offset))
		break;
        	offset = 0;
		remaining -= to_sector(len);
		dp->next_page(dp);
		}
	atomic_inc(&io->count);
	submit_bio(rw, bio);
}
> > > How do we split that up today? We could add processing to submit_bio
> > > to check for the boundary and create two bios.
> > 
> > But we do not split them up today - see what I wrote! Today we impose
> > the restriction that a device must be able to handle a single "normal"
> > page, and if it can't do that, it has to split it up itself.
> > 
> > But yes, you would have to create some out-of-line function to use
> > bio_split() until you have chopped things down enough. It's not a good
> > thing for performance naturally, but if we consider this a "just make it
> > work" fallback, I don't think it's too bad. You want to make a note of
> > that it is happening though, so people realize that it is happening.
> 
> Hmmmm.. We could keep the existing scheme too and check that device 
> drivers split things up if they are too large? Isnt it possible today
> to create a huge bio of 2M for huge pages and send it to a device?

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [11/36] Use page_cache_xxx in fs/buffer.c
  2007-08-31  8:12                         ` Jens Axboe
@ 2007-08-31 15:22                           ` Christoph Lameter
  2007-08-31 16:35                             ` Jörn Engel
  2007-08-31 19:00                             ` Jens Axboe
  0 siblings, 2 replies; 56+ messages in thread
From: Christoph Lameter @ 2007-08-31 15:22 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Dmitry Monakhov, torvalds, linux-fsdevel, linux-kernel,
	Christoph Hellwig, Mel Gorman, William Lee Irwin III,
	David Chinner, Badari Pulavarty, Maxim Levitsky, Fengguang Wu,
	swin wang, totty.lu, H. Peter Anvin, joern, Eric W. Biederman

On Fri, 31 Aug 2007, Jens Axboe wrote:

> > Ok. So another solution maybe to limit the blocksizes that can be used 
> > with a device?
> 
> That'd work for creation, but not for moving things around.

What do you mean by moving things around? Creation binds a filesystem to a 
device.

> > Hmmmm.. We could keep the existing scheme too and check that device 
> > drivers split things up if they are too large? Isnt it possible today
> > to create a huge bio of 2M for huge pages and send it to a device?
> 
> Not sure, aren't the constituents of compound pages the basis for IO?

get_user_pages() serializes compound pages into the base pages. But doesnt 
the I/O layer coalesce these later into 2M chunks again?

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [11/36] Use page_cache_xxx in fs/buffer.c
  2007-08-31  8:36                         ` Dmitry Monakhov
@ 2007-08-31 15:28                           ` Christoph Lameter
  0 siblings, 0 replies; 56+ messages in thread
From: Christoph Lameter @ 2007-08-31 15:28 UTC (permalink / raw)
  To: Dmitry Monakhov
  Cc: Jens Axboe, torvalds, linux-fsdevel, linux-kernel,
	Christoph Hellwig, Mel Gorman, William Lee Irwin III,
	David Chinner, Badari Pulavarty, Maxim Levitsky, Fengguang Wu,
	swin wang, totty.lu, H. Peter Anvin, joern, Eric W. Biederman

On Fri, 31 Aug 2007, Dmitry Monakhov wrote:

> > Ok. So another solution maybe to limit the blocksizes that can be used 
> > with a device?
> IMHO It is not good because after fs was created with big blksize it's image
> cant be used on other devices. 

Ok so a raw copy of the partition would do this?

> We may just rewrite submit_bh simular to drivers/md/dm-io.c:do_region
> with following pseudocode:
> 
> remaning = super_page_size();

That would be compound_size(page)

> while (remaining) {
> 	init_bio(bio);
> 	/*Try and add as many pages as possible*/

This seems to be doing the same as get_user_pages() serializing the 
compound page.

> 	while (remaining) {
> 		dp->get_page(dp, &page, &len, &offset);
> 		len = min(len,
> 		to_bytes(remaining));
>         	if(!bio_add_page(bio, page, len, offset))
> 		break;
>         	offset = 0;
> 		remaining -= to_sector(len);
> 		dp->next_page(dp);
> 		}
> 	atomic_inc(&io->count);
> 	submit_bio(rw, bio);
> }

Another solution may be to not serialize but instead determine the maximum 
segment length and generate bios that reference various subsection of the
compound page of that length? That way you do not serialize and later 
coalesce again.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [11/36] Use page_cache_xxx in fs/buffer.c
  2007-08-31 15:22                           ` Christoph Lameter
@ 2007-08-31 16:35                             ` Jörn Engel
  2007-08-31 19:00                             ` Jens Axboe
  1 sibling, 0 replies; 56+ messages in thread
From: Jörn Engel @ 2007-08-31 16:35 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Jens Axboe, Dmitry Monakhov, torvalds, linux-fsdevel,
	linux-kernel, Christoph Hellwig, Mel Gorman,
	William Lee Irwin III, David Chinner, Badari Pulavarty,
	Maxim Levitsky, Fengguang Wu, swin wang, totty.lu, H. Peter Anvin,
	Eric W. Biederman

On Fri, 31 August 2007 08:22:45 -0700, Christoph Lameter wrote:
> 
> What do you mean by moving things around? Creation binds a filesystem to a 
> device.

Create the filesystem on a usb key, then move it to the next machine,
i suppose.

Or on any other movable medium, including disks, nbd, iSCSI,...

Jörn

-- 
Doubt is not a pleasant condition, but certainty is an absurd one.
-- Voltaire
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [11/36] Use page_cache_xxx in fs/buffer.c
  2007-08-31 15:22                           ` Christoph Lameter
  2007-08-31 16:35                             ` Jörn Engel
@ 2007-08-31 19:00                             ` Jens Axboe
  1 sibling, 0 replies; 56+ messages in thread
From: Jens Axboe @ 2007-08-31 19:00 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Dmitry Monakhov, torvalds, linux-fsdevel, linux-kernel,
	Christoph Hellwig, Mel Gorman, William Lee Irwin III,
	David Chinner, Badari Pulavarty, Maxim Levitsky, Fengguang Wu,
	swin wang, totty.lu, H. Peter Anvin, joern, Eric W. Biederman

On Fri, Aug 31 2007, Christoph Lameter wrote:
> On Fri, 31 Aug 2007, Jens Axboe wrote:
> 
> > > Ok. So another solution maybe to limit the blocksizes that can be used 
> > > with a device?
> > 
> > That'd work for creation, but not for moving things around.
> 
> What do you mean by moving things around? Creation binds a filesystem to a 
> device.

Only the bottom part. Change controller, move disk, whatever. There are
lots of ways to change part of the IO path.

> > > Hmmmm.. We could keep the existing scheme too and check that device 
> > > drivers split things up if they are too large? Isnt it possible today
> > > to create a huge bio of 2M for huge pages and send it to a device?
> > 
> > Not sure, aren't the constituents of compound pages the basis for IO?
> 
> get_user_pages() serializes compound pages into the base pages. But doesnt 
> the I/O layer coalesce these later into 2M chunks again?

You pretty much hit the nail on the head there yourself. The io layer
_may_ coalesce them all together, but it may also stop at am arbitrary
point and put the remainder in another request.

This situation is different from submitting one huge piece, the above is
what has always happened regardless of whether the origin happens to be
a compound page or not.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC 1/4] Large Blocksize support for Ext2/3/4
  2007-08-30  0:47     ` [RFC 1/4] Large Blocksize support for Ext2/3/4 Mingming Cao
  2007-08-30  0:59       ` Christoph Lameter
@ 2007-09-01  0:01       ` Mingming Cao
  2007-09-01  0:12       ` [RFC 1/2] JBD: slab management support for large block(>8k) Mingming Cao
  2007-09-01  0:12       ` [RFC 2/2] JBD: blocks reservation fix for large block support Mingming Cao
  3 siblings, 0 replies; 56+ messages in thread
From: Mingming Cao @ 2007-09-01  0:01 UTC (permalink / raw)
  To: clameter; +Cc: linux-fsdevel, adilger, sho, ext4 development, linux-kernel

On Wed, 2007-08-29 at 17:47 -0700, Mingming Cao wrote:

> Just rebase to 2.6.23-rc4 and against the ext4 patch queue. Compile tested only. 
> 
> Next steps:
> Need a e2fsprogs changes to able test this feature. As mkfs needs to be
> educated not assuming rec_len to be blocksize all the time.
> Will try it with Christoph Lameter's large block patch next.
> 

Two problems were found when testing largeblock on ext3.  Patches to
follow. 

Good news is, with your changes, plus all these extN changes, I am able
to run ext2/3/4 with 64k block size, tested on x86 and ppc64 with 4k
page size. fsx test runs fine for an hour on ext3 with 16k blocksize on
x86:-)

Mingming

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [RFC 1/2] JBD: slab management support for large block(>8k)
  2007-08-30  0:47     ` [RFC 1/4] Large Blocksize support for Ext2/3/4 Mingming Cao
  2007-08-30  0:59       ` Christoph Lameter
  2007-09-01  0:01       ` Mingming Cao
@ 2007-09-01  0:12       ` Mingming Cao
  2007-09-01 18:39         ` Christoph Hellwig
  2007-09-01  0:12       ` [RFC 2/2] JBD: blocks reservation fix for large block support Mingming Cao
  3 siblings, 1 reply; 56+ messages in thread
From: Mingming Cao @ 2007-09-01  0:12 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: adilger, sho, ext4 development, linux-kernel, clameter

>From clameter:
Teach jbd/jbd2 slab management to support >8k block size. Without this, it refused to mount on >8k ext3.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>

Index: my2.6/fs/jbd/journal.c
===================================================================
--- my2.6.orig/fs/jbd/journal.c	2007-08-30 18:40:02.000000000 -0700
+++ my2.6/fs/jbd/journal.c	2007-08-31 11:01:18.000000000 -0700
@@ -1627,16 +1627,17 @@ void * __jbd_kmalloc (const char *where,
  * jbd slab management: create 1k, 2k, 4k, 8k slabs as needed
  * and allocate frozen and commit buffers from these slabs.
  *
- * Reason for doing this is to avoid, SLAB_DEBUG - since it could
- * cause bh to cross page boundary.
+ * (Note: We only seem to need the definitions here for the SLAB_DEBUG
+ * case. In non debug operations SLUB will find the corresponding kmalloc
+ * cache and create an alias. --clameter)
  */
-
-#define JBD_MAX_SLABS 5
-#define JBD_SLAB_INDEX(size)  (size >> 11)
+#define JBD_MAX_SLABS 7
+#define JBD_SLAB_INDEX(size)  get_order((size) << (PAGE_SHIFT - 10))
 
 static struct kmem_cache *jbd_slab[JBD_MAX_SLABS];
 static const char *jbd_slab_names[JBD_MAX_SLABS] = {
-	"jbd_1k", "jbd_2k", "jbd_4k", NULL, "jbd_8k"
+	"jbd_1k", "jbd_2k", "jbd_4k", "jbd_8k",
+	"jbd_16k", "jbd_32k", "jbd_64k"
 };
 
 static void journal_destroy_jbd_slabs(void)
Index: my2.6/fs/jbd2/journal.c
===================================================================
--- my2.6.orig/fs/jbd2/journal.c	2007-08-30 18:40:02.000000000 -0700
+++ my2.6/fs/jbd2/journal.c	2007-08-31 11:04:37.000000000 -0700
@@ -1639,16 +1639,18 @@ void * __jbd2_kmalloc (const char *where
  * jbd slab management: create 1k, 2k, 4k, 8k slabs as needed
  * and allocate frozen and commit buffers from these slabs.
  *
- * Reason for doing this is to avoid, SLAB_DEBUG - since it could
- * cause bh to cross page boundary.
+ * (Note: We only seem to need the definitions here for the SLAB_DEBUG
+ * case. In non debug operations SLUB will find the corresponding kmalloc
+ * cache and create an alias. --clameter)
  */
 
-#define JBD_MAX_SLABS 5
-#define JBD_SLAB_INDEX(size)  (size >> 11)
+#define JBD_MAX_SLABS 7
+#define JBD_SLAB_INDEX(size)  get_order((size) << (PAGE_SHIFT - 10))
 
 static struct kmem_cache *jbd_slab[JBD_MAX_SLABS];
 static const char *jbd_slab_names[JBD_MAX_SLABS] = {
-	"jbd2_1k", "jbd2_2k", "jbd2_4k", NULL, "jbd2_8k"
+	"jbd2_1k", "jbd2_2k", "jbd2_4k", "jbd2_8k",
+        "jbd2_16k", "jbd2_32k", "jbd2_64k"
 };
 
 static void jbd2_journal_destroy_jbd_slabs(void)



^ permalink raw reply	[flat|nested] 56+ messages in thread

* [RFC 2/2] JBD: blocks reservation fix for large block support
  2007-08-30  0:47     ` [RFC 1/4] Large Blocksize support for Ext2/3/4 Mingming Cao
                         ` (2 preceding siblings ...)
  2007-09-01  0:12       ` [RFC 1/2] JBD: slab management support for large block(>8k) Mingming Cao
@ 2007-09-01  0:12       ` Mingming Cao
  3 siblings, 0 replies; 56+ messages in thread
From: Mingming Cao @ 2007-09-01  0:12 UTC (permalink / raw)
  To: ext4 development; +Cc: adilger, sho, linux-kernel, clameter, linux-fsdevel

The blocks per page could be less or quals to 1 with the large block support in VM.
The patch fixed the way to calculate the number of blocks to reserve in journal in the
case blocksize > pagesize.



Signed-off-by: Mingming Cao <cmm@us.ibm.com>

Index: my2.6/fs/jbd/journal.c
===================================================================
--- my2.6.orig/fs/jbd/journal.c	2007-08-31 13:27:16.000000000 -0700
+++ my2.6/fs/jbd/journal.c	2007-08-31 13:28:18.000000000 -0700
@@ -1611,7 +1611,12 @@ void journal_ack_err(journal_t *journal)
 
 int journal_blocks_per_page(struct inode *inode)
 {
-	return 1 << (PAGE_CACHE_SHIFT - inode->i_sb->s_blocksize_bits);
+	int bits = PAGE_CACHE_SHIFT - inode->i_sb->s_blocksize_bits;
+
+	if (bits > 0)
+		return 1 << bits;
+	else
+		return 1;
 }
 
 /*
Index: my2.6/fs/jbd2/journal.c
===================================================================
--- my2.6.orig/fs/jbd2/journal.c	2007-08-31 13:32:21.000000000 -0700
+++ my2.6/fs/jbd2/journal.c	2007-08-31 13:32:30.000000000 -0700
@@ -1612,7 +1612,12 @@ void jbd2_journal_ack_err(journal_t *jou
 
 int jbd2_journal_blocks_per_page(struct inode *inode)
 {
-	return 1 << (PAGE_CACHE_SHIFT - inode->i_sb->s_blocksize_bits);
+	int bits = PAGE_CACHE_SHIFT - inode->i_sb->s_blocksize_bits;
+
+	if (bits > 0)
+		return 1 << bits;
+	else
+		return 1;
 }
 
 /*

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [00/36] Large Blocksize Support V6
       [not found]   ` <Pine.LNX.4.64.0708281254070.16473@schroedinger.engr.sgi.com>
@ 2007-09-01  1:11     ` Christoph Lameter
  0 siblings, 0 replies; 56+ messages in thread
From: Christoph Lameter @ 2007-09-01  1:11 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: torvalds, linux-fsdevel, linux-kernel, Mel Gorman,
	William Lee Irwin III, David Chinner, Jens Axboe,
	Badari Pulavarty, Maxim Levitsky, Fengguang Wu, swin wang,
	totty.lu, H. Peter Anvin, joern, Eric W. Biederman, Mingming Cao

Thanks to some help Mingming Cao we now have support for extX with up to 
64k blocksize. There were several issues in the jbd layer.... (The ext2 
patch that Christoph complained about was dropped).

The patchset can be tested (assuming one has a current git tree)

git checkout -b largeblock
git pull git://git.kernel.org/pub/scm/linux/kernel/git/christoph/largeblocksize.git largeblock

... Fiddle around with large blocksize functionality....

git checkout master

... Back to Linus' tree.

git branch -D largeblock

... Get rid of it.


commit ed541c23b8e71a0217fd96d1b421992fdd7519df
Author: Mingming Cao <cmm@us.ibm.com>

    JBD: blocks reservation fix for large block support

commit a1eaa33cf1600f18e961f1cf5c87820bca44df08
Author: Christoph Lameter <clameter@sgi.com>

    Teach jbd/jbd2 slab management to support >8k block size.

commit 8199976e04333d66202edcaec6cef46771ed194e
Author: Christoph Lameter <clameter@sgi.com>

    Do not use f_mapping in simple_prepare_write()

commit ac4d742ff3b3526d4c22d5b42e9f9fcc99881a8c
Author: Mingming Cao <cmm@us.ibm.com>

    ext4: fix rec_len overflow with 64KB block size

commit f336a2d00e7c79500ff30fad40f6e3090319cbe7
Author: Mingming Cao <cmm@us.ibm.com>

    ext3: fix rec_len overflow with 64KB block size

commit b0c1b74d42cce96c592f8d13b7b842a3e07b0273
Author: Christoph Lameter <christoph@qirst.com>

    ext2: fix rec_len overflow with 64KB block size

commit 01229e6a2e84178a8b8467930c113a0096c069f2
Author: Mingming Cao <cmm@us.ibm.com>

    Large Blocksize support for Ext2/3/4



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC 1/2] JBD: slab management support for large block(>8k)
  2007-09-01  0:12       ` [RFC 1/2] JBD: slab management support for large block(>8k) Mingming Cao
@ 2007-09-01 18:39         ` Christoph Hellwig
  2007-09-02 11:40           ` Christoph Lameter
  0 siblings, 1 reply; 56+ messages in thread
From: Christoph Hellwig @ 2007-09-01 18:39 UTC (permalink / raw)
  To: Mingming Cao
  Cc: linux-fsdevel, adilger, sho, ext4 development, linux-kernel,
	clameter

On Fri, Aug 31, 2007 at 05:12:18PM -0700, Mingming Cao wrote:
> >From clameter:
> Teach jbd/jbd2 slab management to support >8k block size. Without this, it refused to mount on >8k ext3.


But the real fix is to kill this code.  We can't send down slab pages
down the block layer without breaking iscsi or aoe.  And this code is
only used in so rare cases that all the normal testing won't hit it.
Very bad combination.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [00/36] Large Blocksize Support V6
       [not found] <20070828190551.415127746@sgi.com>
                   ` (2 preceding siblings ...)
       [not found] ` <20070828192034.GA13883@lst.de>
@ 2007-09-01 19:17 ` Peter Zijlstra
  2007-09-02 11:44   ` Christoph Lameter
  3 siblings, 1 reply; 56+ messages in thread
From: Peter Zijlstra @ 2007-09-01 19:17 UTC (permalink / raw)
  To: clameter; +Cc: torvalds, linux-fsdevel, linux-kernel, Andrea Arcangeli


On Tue, 2007-08-28 at 12:05 -0700, clameter@sgi.com wrote:

> Todo/Issues:

 - reclaim

by mixing large order pages into the regular lru, page aging gets rather
unfair.

One possible solution to this is address_space based reclaim, something
which might also be beneficial for containers.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC 1/2] JBD: slab management support for large block(>8k)
  2007-09-01 18:39         ` Christoph Hellwig
@ 2007-09-02 11:40           ` Christoph Lameter
  2007-09-02 15:28             ` Christoph Hellwig
  0 siblings, 1 reply; 56+ messages in thread
From: Christoph Lameter @ 2007-09-02 11:40 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Mingming Cao, linux-fsdevel, adilger, sho, ext4 development,
	linux-kernel

On Sat, 1 Sep 2007, Christoph Hellwig wrote:

> On Fri, Aug 31, 2007 at 05:12:18PM -0700, Mingming Cao wrote:
> > >From clameter:
> > Teach jbd/jbd2 slab management to support >8k block size. Without this, it refused to mount on >8k ext3.
> 
> 
> But the real fix is to kill this code.  We can't send down slab pages
> down the block layer without breaking iscsi or aoe.  And this code is
> only used in so rare cases that all the normal testing won't hit it.
> Very bad combination.

We are doing what you describe right now. So the current code is broken?



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [00/36] Large Blocksize Support V6
  2007-09-01 19:17 ` Peter Zijlstra
@ 2007-09-02 11:44   ` Christoph Lameter
  0 siblings, 0 replies; 56+ messages in thread
From: Christoph Lameter @ 2007-09-02 11:44 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: torvalds, linux-fsdevel, linux-kernel, Andrea Arcangeli

On Sat, 1 Sep 2007, Peter Zijlstra wrote:

> by mixing large order pages into the regular lru, page aging gets rather
> unfair.

Not in general, only for particular loads. On average this is okay. It is 
consistent to age the whole block and not just a part of it.





^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC 1/2] JBD: slab management support for large block(>8k)
  2007-09-02 11:40           ` Christoph Lameter
@ 2007-09-02 15:28             ` Christoph Hellwig
  2007-09-03  7:55               ` Christoph Lameter
  0 siblings, 1 reply; 56+ messages in thread
From: Christoph Hellwig @ 2007-09-02 15:28 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Christoph Hellwig, Mingming Cao, linux-fsdevel, adilger, sho,
	ext4 development, linux-kernel

On Sun, Sep 02, 2007 at 04:40:21AM -0700, Christoph Lameter wrote:
> On Sat, 1 Sep 2007, Christoph Hellwig wrote:
> 
> > On Fri, Aug 31, 2007 at 05:12:18PM -0700, Mingming Cao wrote:
> > > >From clameter:
> > > Teach jbd/jbd2 slab management to support >8k block size. Without this, it refused to mount on >8k ext3.
> > 
> > 
> > But the real fix is to kill this code.  We can't send down slab pages
> > down the block layer without breaking iscsi or aoe.  And this code is
> > only used in so rare cases that all the normal testing won't hit it.
> > Very bad combination.
> 
> We are doing what you describe right now. So the current code is broken?

Yes.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC 1/2] JBD: slab management support for large block(>8k)
  2007-09-02 15:28             ` Christoph Hellwig
@ 2007-09-03  7:55               ` Christoph Lameter
  2007-09-03 13:40                 ` Christoph Hellwig
  0 siblings, 1 reply; 56+ messages in thread
From: Christoph Lameter @ 2007-09-03  7:55 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Mingming Cao, linux-fsdevel, adilger, sho, ext4 development,
	linux-kernel

On Sun, 2 Sep 2007, Christoph Hellwig wrote:

> > We are doing what you describe right now. So the current code is broken?
> Yes.

How about getting rid of the slabs there and use kmalloc? Kmalloc in mm 
(and therfore hopefully 2.6.24) will convert kmallocs > PAGE_SIZE to page 
allocator calls. Not sure what to do about the 1k and 2k requests though.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC 1/2] JBD: slab management support for large block(>8k)
  2007-09-03  7:55               ` Christoph Lameter
@ 2007-09-03 13:40                 ` Christoph Hellwig
  2007-09-03 19:31                   ` Christoph Lameter
  0 siblings, 1 reply; 56+ messages in thread
From: Christoph Hellwig @ 2007-09-03 13:40 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Christoph Hellwig, Mingming Cao, linux-fsdevel, adilger, sho,
	ext4 development, linux-kernel

On Mon, Sep 03, 2007 at 12:55:04AM -0700, Christoph Lameter wrote:
> On Sun, 2 Sep 2007, Christoph Hellwig wrote:
> 
> > > We are doing what you describe right now. So the current code is broken?
> > Yes.
> 
> How about getting rid of the slabs there and use kmalloc? Kmalloc in mm 
> (and therfore hopefully 2.6.24) will convert kmallocs > PAGE_SIZE to page 
> allocator calls. Not sure what to do about the 1k and 2k requests though.

The problem is that we must never use kmalloc pages, so we always need
to request a page or more for these.  Better to use get_free_page directly,
that's how I fixed it in XFS a while ago.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC 1/2] JBD: slab management support for large block(>8k)
  2007-09-03 13:40                 ` Christoph Hellwig
@ 2007-09-03 19:31                   ` Christoph Lameter
  2007-09-03 19:33                     ` Christoph Hellwig
  0 siblings, 1 reply; 56+ messages in thread
From: Christoph Lameter @ 2007-09-03 19:31 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Mingming Cao, linux-fsdevel, adilger, sho, ext4 development,
	linux-kernel

On Mon, 3 Sep 2007, Christoph Hellwig wrote:

> > How about getting rid of the slabs there and use kmalloc? Kmalloc in mm 
> > (and therfore hopefully 2.6.24) will convert kmallocs > PAGE_SIZE to page 
> > allocator calls. Not sure what to do about the 1k and 2k requests though.
> 
> The problem is that we must never use kmalloc pages, so we always need
> to request a page or more for these.  Better to use get_free_page directly,
> that's how I fixed it in XFS a while ago.

So you'd be fine with replacing the allocs with

get_free_pages(GFP_xxx, get_order(size)) ?

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC 1/2] JBD: slab management support for large block(>8k)
  2007-09-03 19:31                   ` Christoph Lameter
@ 2007-09-03 19:33                     ` Christoph Hellwig
  2007-09-14 18:53                       ` [PATCH] JBD slab cleanups Mingming Cao
  0 siblings, 1 reply; 56+ messages in thread
From: Christoph Hellwig @ 2007-09-03 19:33 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Christoph Hellwig, Mingming Cao, linux-fsdevel, adilger, sho,
	ext4 development, linux-kernel

On Mon, Sep 03, 2007 at 12:31:49PM -0700, Christoph Lameter wrote:
> So you'd be fine with replacing the allocs with
> 
> get_free_pages(GFP_xxx, get_order(size)) ?

Yes.  And rip out all that code related to setting up the slabs.  I plan
to add WARN_ONs to bio_add_page and friends to detect further usage of
slab pages if there is any.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH] JBD slab cleanups
  2007-09-03 19:33                     ` Christoph Hellwig
@ 2007-09-14 18:53                       ` Mingming Cao
  2007-09-14 18:58                         ` Christoph Lameter
  2007-09-17 19:29                         ` Mingming Cao
  0 siblings, 2 replies; 56+ messages in thread
From: Mingming Cao @ 2007-09-14 18:53 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Christoph Lameter, linux-fsdevel, ext4 development, linux-kernel

jbd/jbd2: Replace slab allocations with page cache allocations

From: Christoph Lameter <clameter@sgi.com>

JBD should not pass slab pages down to the block layer.
Use page allocator pages instead. This will also prepare
JBD for the large blocksize patchset.

Tested on 2.6.23-rc6 with fsx runs fine.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
---
 fs/jbd/checkpoint.c   |    2 
 fs/jbd/commit.c       |    6 +-
 fs/jbd/journal.c      |  107 ++++---------------------------------------------
 fs/jbd/transaction.c  |   10 ++--
 fs/jbd2/checkpoint.c  |    2 
 fs/jbd2/commit.c      |    6 +-
 fs/jbd2/journal.c     |  109 ++++----------------------------------------------
 fs/jbd2/transaction.c |   18 ++++----
 include/linux/jbd.h   |   23 +++++++++-
 include/linux/jbd2.h  |   28 ++++++++++--
 10 files changed, 83 insertions(+), 228 deletions(-)

Index: linux-2.6.23-rc5/fs/jbd/journal.c
===================================================================
--- linux-2.6.23-rc5.orig/fs/jbd/journal.c	2007-09-13 13:37:57.000000000 -0700
+++ linux-2.6.23-rc5/fs/jbd/journal.c	2007-09-13 13:45:39.000000000 -0700
@@ -83,7 +83,6 @@ EXPORT_SYMBOL(journal_force_commit);
 
 static int journal_convert_superblock_v1(journal_t *, journal_superblock_t *);
 static void __journal_abort_soft (journal_t *journal, int errno);
-static int journal_create_jbd_slab(size_t slab_size);
 
 /*
  * Helper function used to manage commit timeouts
@@ -334,10 +333,10 @@ repeat:
 		char *tmp;
 
 		jbd_unlock_bh_state(bh_in);
-		tmp = jbd_slab_alloc(bh_in->b_size, GFP_NOFS);
+		tmp = jbd_alloc(bh_in->b_size, GFP_NOFS);
 		jbd_lock_bh_state(bh_in);
 		if (jh_in->b_frozen_data) {
-			jbd_slab_free(tmp, bh_in->b_size);
+			jbd_free(tmp, bh_in->b_size);
 			goto repeat;
 		}
 
@@ -679,7 +678,7 @@ static journal_t * journal_init_common (
 	/* Set up a default-sized revoke table for the new mount. */
 	err = journal_init_revoke(journal, JOURNAL_REVOKE_DEFAULT_HASH);
 	if (err) {
-		kfree(journal);
+		jbd_kfree(journal);
 		goto fail;
 	}
 	return journal;
@@ -728,7 +727,7 @@ journal_t * journal_init_dev(struct bloc
 	if (!journal->j_wbuf) {
 		printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
 			__FUNCTION__);
-		kfree(journal);
+		jbd_kfree(journal);
 		journal = NULL;
 		goto out;
 	}
@@ -782,7 +781,7 @@ journal_t * journal_init_inode (struct i
 	if (!journal->j_wbuf) {
 		printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
 			__FUNCTION__);
-		kfree(journal);
+		jbd_kfree(journal);
 		return NULL;
 	}
 
@@ -791,7 +790,7 @@ journal_t * journal_init_inode (struct i
 	if (err) {
 		printk(KERN_ERR "%s: Cannnot locate journal superblock\n",
 		       __FUNCTION__);
-		kfree(journal);
+		jbd_kfree(journal);
 		return NULL;
 	}
 
@@ -1095,13 +1094,6 @@ int journal_load(journal_t *journal)
 		}
 	}
 
-	/*
-	 * Create a slab for this blocksize
-	 */
-	err = journal_create_jbd_slab(be32_to_cpu(sb->s_blocksize));
-	if (err)
-		return err;
-
 	/* Let the recovery code check whether it needs to recover any
 	 * data from the journal. */
 	if (journal_recover(journal))
@@ -1166,7 +1158,7 @@ void journal_destroy(journal_t *journal)
 	if (journal->j_revoke)
 		journal_destroy_revoke(journal);
 	kfree(journal->j_wbuf);
-	kfree(journal);
+	jbd_kfree(journal);
 }
 
 
@@ -1615,86 +1607,6 @@ int journal_blocks_per_page(struct inode
 }
 
 /*
- * Simple support for retrying memory allocations.  Introduced to help to
- * debug different VM deadlock avoidance strategies.
- */
-void * __jbd_kmalloc (const char *where, size_t size, gfp_t flags, int retry)
-{
-	return kmalloc(size, flags | (retry ? __GFP_NOFAIL : 0));
-}
-
-/*
- * jbd slab management: create 1k, 2k, 4k, 8k slabs as needed
- * and allocate frozen and commit buffers from these slabs.
- *
- * Reason for doing this is to avoid, SLAB_DEBUG - since it could
- * cause bh to cross page boundary.
- */
-
-#define JBD_MAX_SLABS 5
-#define JBD_SLAB_INDEX(size)  (size >> 11)
-
-static struct kmem_cache *jbd_slab[JBD_MAX_SLABS];
-static const char *jbd_slab_names[JBD_MAX_SLABS] = {
-	"jbd_1k", "jbd_2k", "jbd_4k", NULL, "jbd_8k"
-};
-
-static void journal_destroy_jbd_slabs(void)
-{
-	int i;
-
-	for (i = 0; i < JBD_MAX_SLABS; i++) {
-		if (jbd_slab[i])
-			kmem_cache_destroy(jbd_slab[i]);
-		jbd_slab[i] = NULL;
-	}
-}
-
-static int journal_create_jbd_slab(size_t slab_size)
-{
-	int i = JBD_SLAB_INDEX(slab_size);
-
-	BUG_ON(i >= JBD_MAX_SLABS);
-
-	/*
-	 * Check if we already have a slab created for this size
-	 */
-	if (jbd_slab[i])
-		return 0;
-
-	/*
-	 * Create a slab and force alignment to be same as slabsize -
-	 * this will make sure that allocations won't cross the page
-	 * boundary.
-	 */
-	jbd_slab[i] = kmem_cache_create(jbd_slab_names[i],
-				slab_size, slab_size, 0, NULL);
-	if (!jbd_slab[i]) {
-		printk(KERN_EMERG "JBD: no memory for jbd_slab cache\n");
-		return -ENOMEM;
-	}
-	return 0;
-}
-
-void * jbd_slab_alloc(size_t size, gfp_t flags)
-{
-	int idx;
-
-	idx = JBD_SLAB_INDEX(size);
-	BUG_ON(jbd_slab[idx] == NULL);
-	return kmem_cache_alloc(jbd_slab[idx], flags | __GFP_NOFAIL);
-}
-
-void jbd_slab_free(void *ptr,  size_t size)
-{
-	int idx;
-
-	idx = JBD_SLAB_INDEX(size);
-	BUG_ON(jbd_slab[idx] == NULL);
-	kmem_cache_free(jbd_slab[idx], ptr);
-}
-
-/*
  * Journal_head storage management
  */
 static struct kmem_cache *journal_head_cache;
@@ -1881,13 +1793,13 @@ static void __journal_remove_journal_hea
 				printk(KERN_WARNING "%s: freeing "
 						"b_frozen_data\n",
 						__FUNCTION__);
-				jbd_slab_free(jh->b_frozen_data, bh->b_size);
+				jbd_free(jh->b_frozen_data, bh->b_size);
 			}
 			if (jh->b_committed_data) {
 				printk(KERN_WARNING "%s: freeing "
 						"b_committed_data\n",
 						__FUNCTION__);
-				jbd_slab_free(jh->b_committed_data, bh->b_size);
+				jbd_free(jh->b_committed_data, bh->b_size);
 			}
 			bh->b_private = NULL;
 			jh->b_bh = NULL;	/* debug, really */
@@ -2042,7 +1954,6 @@ static void journal_destroy_caches(void)
 	journal_destroy_revoke_caches();
 	journal_destroy_journal_head_cache();
 	journal_destroy_handle_cache();
-	journal_destroy_jbd_slabs();
 }
 
 static int __init journal_init(void)
Index: linux-2.6.23-rc5/include/linux/jbd.h
===================================================================
--- linux-2.6.23-rc5.orig/include/linux/jbd.h	2007-09-13 13:37:57.000000000 -0700
+++ linux-2.6.23-rc5/include/linux/jbd.h	2007-09-13 13:42:27.000000000 -0700
@@ -71,9 +71,26 @@ extern int journal_enable_debug;
 #define jbd_debug(f, a...)	/**/
 #endif
 
-extern void * __jbd_kmalloc (const char *where, size_t size, gfp_t flags, int retry);
-extern void * jbd_slab_alloc(size_t size, gfp_t flags);
-extern void jbd_slab_free(void *ptr, size_t size);
+static inline void *__jbd_kmalloc(const char *where, size_t size,
+						gfp_t flags, int retry)
+{
+	return kmalloc(size, flags | (retry ? __GFP_NOFAIL : 0));
+}
+
+static inline void jbd_kfree(void *ptr)
+{
+	return kfree(ptr);
+}
+
+static inline void *jbd_alloc(size_t size, gfp_t flags)
+{
+	return (void *)__get_free_pages(flags, get_order(size));
+}
+
+static inline void jbd_free(void *ptr, size_t size)
+{
+	free_pages((unsigned long)ptr, get_order(size));
+};
 
 #define jbd_kmalloc(size, flags) \
 	__jbd_kmalloc(__FUNCTION__, (size), (flags), journal_oom_retry)
Index: linux-2.6.23-rc5/include/linux/jbd2.h
===================================================================
--- linux-2.6.23-rc5.orig/include/linux/jbd2.h	2007-09-13 13:37:58.000000000 -0700
+++ linux-2.6.23-rc5/include/linux/jbd2.h	2007-09-13 13:51:49.000000000 -0700
@@ -71,11 +71,27 @@ extern u8 jbd2_journal_enable_debug;
 #define jbd_debug(f, a...)	/**/
 #endif
 
-extern void * __jbd2_kmalloc (const char *where, size_t size, gfp_t flags, int retry);
-extern void * jbd2_slab_alloc(size_t size, gfp_t flags);
-extern void jbd2_slab_free(void *ptr, size_t size);
+static inline void *__jbd2_kmalloc(const char *where, size_t size,
+						gfp_t flags, int retry)
+{
+	return kmalloc(size, flags | (retry ? __GFP_NOFAIL : 0));
+}
+static inline void jbd2_kfree(void *ptr)
+{
+	return kfree(ptr);
+}
+
+static inline void *jbd2_alloc(size_t size, gfp_t flags)
+{
+	return (void *)__get_free_pages(flags, get_order(size));
+}
+
+static inline void jbd2_free(void *ptr, size_t size)
+{
+	free_pages((unsigned long)ptr, get_order(size));
+};
 
-#define jbd_kmalloc(size, flags) \
+#define jbd2_kmalloc(size, flags) \
 	__jbd2_kmalloc(__FUNCTION__, (size), (flags), journal_oom_retry)
 #define jbd_rep_kmalloc(size, flags) \
 	__jbd2_kmalloc(__FUNCTION__, (size), (flags), 1)
@@ -959,12 +975,12 @@ void jbd2_journal_put_journal_head(struc
  */
 extern struct kmem_cache *jbd2_handle_cache;
 
-static inline handle_t *jbd_alloc_handle(gfp_t gfp_flags)
+static inline handle_t *jbd2_alloc_handle(gfp_t gfp_flags)
 {
 	return kmem_cache_alloc(jbd2_handle_cache, gfp_flags);
 }
 
-static inline void jbd_free_handle(handle_t *handle)
+static inline void jbd2_free_handle(handle_t *handle)
 {
 	kmem_cache_free(jbd2_handle_cache, handle);
 }
Index: linux-2.6.23-rc5/fs/jbd2/journal.c
===================================================================
--- linux-2.6.23-rc5.orig/fs/jbd2/journal.c	2007-09-13 13:37:57.000000000 -0700
+++ linux-2.6.23-rc5/fs/jbd2/journal.c	2007-09-13 14:00:17.000000000 -0700
@@ -84,7 +84,6 @@ EXPORT_SYMBOL(jbd2_journal_force_commit)
 
 static int journal_convert_superblock_v1(journal_t *, journal_superblock_t *);
 static void __journal_abort_soft (journal_t *journal, int errno);
-static int jbd2_journal_create_jbd_slab(size_t slab_size);
 
 /*
  * Helper function used to manage commit timeouts
@@ -335,10 +334,10 @@ repeat:
 		char *tmp;
 
 		jbd_unlock_bh_state(bh_in);
-		tmp = jbd2_slab_alloc(bh_in->b_size, GFP_NOFS);
+		tmp = jbd2_alloc(bh_in->b_size, GFP_NOFS);
 		jbd_lock_bh_state(bh_in);
 		if (jh_in->b_frozen_data) {
-			jbd2_slab_free(tmp, bh_in->b_size);
+			jbd2_free(tmp, bh_in->b_size);
 			goto repeat;
 		}
 
@@ -655,7 +654,7 @@ static journal_t * journal_init_common (
 	journal_t *journal;
 	int err;
 
-	journal = jbd_kmalloc(sizeof(*journal), GFP_KERNEL);
+	journal = jbd2_kmalloc(sizeof(*journal), GFP_KERNEL);
 	if (!journal)
 		goto fail;
 	memset(journal, 0, sizeof(*journal));
@@ -680,7 +679,7 @@ static journal_t * journal_init_common (
 	/* Set up a default-sized revoke table for the new mount. */
 	err = jbd2_journal_init_revoke(journal, JOURNAL_REVOKE_DEFAULT_HASH);
 	if (err) {
-		kfree(journal);
+		jbd2_kfree(journal);
 		goto fail;
 	}
 	return journal;
@@ -729,7 +728,7 @@ journal_t * jbd2_journal_init_dev(struct
 	if (!journal->j_wbuf) {
 		printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
 			__FUNCTION__);
-		kfree(journal);
+		jbd2_kfree(journal);
 		journal = NULL;
 		goto out;
 	}
@@ -783,7 +782,7 @@ journal_t * jbd2_journal_init_inode (str
 	if (!journal->j_wbuf) {
 		printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
 			__FUNCTION__);
-		kfree(journal);
+		jbd2_kfree(journal);
 		return NULL;
 	}
 
@@ -792,7 +791,7 @@ journal_t * jbd2_journal_init_inode (str
 	if (err) {
 		printk(KERN_ERR "%s: Cannnot locate journal superblock\n",
 		       __FUNCTION__);
-		kfree(journal);
+		jbd2_kfree(journal);
 		return NULL;
 	}
 
@@ -1096,13 +1095,6 @@ int jbd2_journal_load(journal_t *journal
 		}
 	}
 
-	/*
-	 * Create a slab for this blocksize
-	 */
-	err = jbd2_journal_create_jbd_slab(be32_to_cpu(sb->s_blocksize));
-	if (err)
-		return err;
-
 	/* Let the recovery code check whether it needs to recover any
 	 * data from the journal. */
 	if (jbd2_journal_recover(journal))
@@ -1167,7 +1159,7 @@ void jbd2_journal_destroy(journal_t *jou
 	if (journal->j_revoke)
 		jbd2_journal_destroy_revoke(journal);
 	kfree(journal->j_wbuf);
-	kfree(journal);
+	jbd2_kfree(journal);
 }
 
 
@@ -1627,86 +1619,6 @@ size_t journal_tag_bytes(journal_t *jour
 }
 
 /*
- * Simple support for retrying memory allocations.  Introduced to help to
- * debug different VM deadlock avoidance strategies.
- */
-void * __jbd2_kmalloc (const char *where, size_t size, gfp_t flags, int retry)
-{
-	return kmalloc(size, flags | (retry ? __GFP_NOFAIL : 0));
-}
-
-/*
- * jbd slab management: create 1k, 2k, 4k, 8k slabs as needed
- * and allocate frozen and commit buffers from these slabs.
- *
- * Reason for doing this is to avoid, SLAB_DEBUG - since it could
- * cause bh to cross page boundary.
- */
-
-#define JBD_MAX_SLABS 5
-#define JBD_SLAB_INDEX(size)  (size >> 11)
-
-static struct kmem_cache *jbd_slab[JBD_MAX_SLABS];
-static const char *jbd_slab_names[JBD_MAX_SLABS] = {
-	"jbd2_1k", "jbd2_2k", "jbd2_4k", NULL, "jbd2_8k"
-};
-
-static void jbd2_journal_destroy_jbd_slabs(void)
-{
-	int i;
-
-	for (i = 0; i < JBD_MAX_SLABS; i++) {
-		if (jbd_slab[i])
-			kmem_cache_destroy(jbd_slab[i]);
-		jbd_slab[i] = NULL;
-	}
-}
-
-static int jbd2_journal_create_jbd_slab(size_t slab_size)
-{
-	int i = JBD_SLAB_INDEX(slab_size);
-
-	BUG_ON(i >= JBD_MAX_SLABS);
-
-	/*
-	 * Check if we already have a slab created for this size
-	 */
-	if (jbd_slab[i])
-		return 0;
-
-	/*
-	 * Create a slab and force alignment to be same as slabsize -
-	 * this will make sure that allocations won't cross the page
-	 * boundary.
-	 */
-	jbd_slab[i] = kmem_cache_create(jbd_slab_names[i],
-				slab_size, slab_size, 0, NULL);
-	if (!jbd_slab[i]) {
-		printk(KERN_EMERG "JBD: no memory for jbd_slab cache\n");
-		return -ENOMEM;
-	}
-	return 0;
-}
-
-void * jbd2_slab_alloc(size_t size, gfp_t flags)
-{
-	int idx;
-
-	idx = JBD_SLAB_INDEX(size);
-	BUG_ON(jbd_slab[idx] == NULL);
-	return kmem_cache_alloc(jbd_slab[idx], flags | __GFP_NOFAIL);
-}
-
-void jbd2_slab_free(void *ptr,  size_t size)
-{
-	int idx;
-
-	idx = JBD_SLAB_INDEX(size);
-	BUG_ON(jbd_slab[idx] == NULL);
-	kmem_cache_free(jbd_slab[idx], ptr);
-}
-
-/*
  * Journal_head storage management
  */
 static struct kmem_cache *jbd2_journal_head_cache;
@@ -1893,13 +1805,13 @@ static void __journal_remove_journal_hea
 				printk(KERN_WARNING "%s: freeing "
 						"b_frozen_data\n",
 						__FUNCTION__);
-				jbd2_slab_free(jh->b_frozen_data, bh->b_size);
+				jbd2_free(jh->b_frozen_data, bh->b_size);
 			}
 			if (jh->b_committed_data) {
 				printk(KERN_WARNING "%s: freeing "
 						"b_committed_data\n",
 						__FUNCTION__);
-				jbd2_slab_free(jh->b_committed_data, bh->b_size);
+				jbd2_free(jh->b_committed_data, bh->b_size);
 			}
 			bh->b_private = NULL;
 			jh->b_bh = NULL;	/* debug, really */
@@ -2040,7 +1952,6 @@ static void jbd2_journal_destroy_caches(
 	jbd2_journal_destroy_revoke_caches();
 	jbd2_journal_destroy_jbd2_journal_head_cache();
 	jbd2_journal_destroy_handle_cache();
-	jbd2_journal_destroy_jbd_slabs();
 }
 
 static int __init journal_init(void)
Index: linux-2.6.23-rc5/fs/jbd/commit.c
===================================================================
--- linux-2.6.23-rc5.orig/fs/jbd/commit.c	2007-09-13 13:37:57.000000000 -0700
+++ linux-2.6.23-rc5/fs/jbd/commit.c	2007-09-13 13:40:03.000000000 -0700
@@ -375,7 +375,7 @@ void journal_commit_transaction(journal_
 			struct buffer_head *bh = jh2bh(jh);
 
 			jbd_lock_bh_state(bh);
-			jbd_slab_free(jh->b_committed_data, bh->b_size);
+			jbd_free(jh->b_committed_data, bh->b_size);
 			jh->b_committed_data = NULL;
 			jbd_unlock_bh_state(bh);
 		}
@@ -792,14 +792,14 @@ restart_loop:
 		 * Otherwise, we can just throw away the frozen data now.
 		 */
 		if (jh->b_committed_data) {
-			jbd_slab_free(jh->b_committed_data, bh->b_size);
+			jbd_free(jh->b_committed_data, bh->b_size);
 			jh->b_committed_data = NULL;
 			if (jh->b_frozen_data) {
 				jh->b_committed_data = jh->b_frozen_data;
 				jh->b_frozen_data = NULL;
 			}
 		} else if (jh->b_frozen_data) {
-			jbd_slab_free(jh->b_frozen_data, bh->b_size);
+			jbd_free(jh->b_frozen_data, bh->b_size);
 			jh->b_frozen_data = NULL;
 		}
 
Index: linux-2.6.23-rc5/fs/jbd2/commit.c
===================================================================
--- linux-2.6.23-rc5.orig/fs/jbd2/commit.c	2007-09-13 13:37:57.000000000 -0700
+++ linux-2.6.23-rc5/fs/jbd2/commit.c	2007-09-13 13:40:03.000000000 -0700
@@ -384,7 +384,7 @@ void jbd2_journal_commit_transaction(jou
 			struct buffer_head *bh = jh2bh(jh);
 
 			jbd_lock_bh_state(bh);
-			jbd2_slab_free(jh->b_committed_data, bh->b_size);
+			jbd2_free(jh->b_committed_data, bh->b_size);
 			jh->b_committed_data = NULL;
 			jbd_unlock_bh_state(bh);
 		}
@@ -801,14 +801,14 @@ restart_loop:
 		 * Otherwise, we can just throw away the frozen data now.
 		 */
 		if (jh->b_committed_data) {
-			jbd2_slab_free(jh->b_committed_data, bh->b_size);
+			jbd2_free(jh->b_committed_data, bh->b_size);
 			jh->b_committed_data = NULL;
 			if (jh->b_frozen_data) {
 				jh->b_committed_data = jh->b_frozen_data;
 				jh->b_frozen_data = NULL;
 			}
 		} else if (jh->b_frozen_data) {
-			jbd2_slab_free(jh->b_frozen_data, bh->b_size);
+			jbd2_free(jh->b_frozen_data, bh->b_size);
 			jh->b_frozen_data = NULL;
 		}
 
Index: linux-2.6.23-rc5/fs/jbd/transaction.c
===================================================================
--- linux-2.6.23-rc5.orig/fs/jbd/transaction.c	2007-09-13 13:37:57.000000000 -0700
+++ linux-2.6.23-rc5/fs/jbd/transaction.c	2007-09-13 13:46:23.000000000 -0700
@@ -229,7 +229,7 @@ repeat_locked:
 	spin_unlock(&journal->j_state_lock);
 out:
 	if (unlikely(new_transaction))		/* It's usually NULL */
-		kfree(new_transaction);
+		jbd_kfree(new_transaction);
 	return ret;
 }
 
@@ -668,7 +668,7 @@ repeat:
 				JBUFFER_TRACE(jh, "allocate memory for buffer");
 				jbd_unlock_bh_state(bh);
 				frozen_buffer =
-					jbd_slab_alloc(jh2bh(jh)->b_size,
+					jbd_alloc(jh2bh(jh)->b_size,
 							 GFP_NOFS);
 				if (!frozen_buffer) {
 					printk(KERN_EMERG
@@ -728,7 +728,7 @@ done:
 
 out:
 	if (unlikely(frozen_buffer))	/* It's usually NULL */
-		jbd_slab_free(frozen_buffer, bh->b_size);
+		jbd_free(frozen_buffer, bh->b_size);
 
 	JBUFFER_TRACE(jh, "exit");
 	return error;
@@ -881,7 +881,7 @@ int journal_get_undo_access(handle_t *ha
 
 repeat:
 	if (!jh->b_committed_data) {
-		committed_data = jbd_slab_alloc(jh2bh(jh)->b_size, GFP_NOFS);
+		committed_data = jbd_alloc(jh2bh(jh)->b_size, GFP_NOFS);
 		if (!committed_data) {
 			printk(KERN_EMERG "%s: No memory for committed data\n",
 				__FUNCTION__);
@@ -908,7 +908,7 @@ repeat:
 out:
 	journal_put_journal_head(jh);
 	if (unlikely(committed_data))
-		jbd_slab_free(committed_data, bh->b_size);
+		jbd_free(committed_data, bh->b_size);
 	return err;
 }
 
Index: linux-2.6.23-rc5/fs/jbd2/transaction.c
===================================================================
--- linux-2.6.23-rc5.orig/fs/jbd2/transaction.c	2007-09-13 13:37:57.000000000 -0700
+++ linux-2.6.23-rc5/fs/jbd2/transaction.c	2007-09-13 13:59:20.000000000 -0700
@@ -96,7 +96,7 @@ static int start_this_handle(journal_t *
 
 alloc_transaction:
 	if (!journal->j_running_transaction) {
-		new_transaction = jbd_kmalloc(sizeof(*new_transaction),
+		new_transaction = jbd2_kmalloc(sizeof(*new_transaction),
 						GFP_NOFS);
 		if (!new_transaction) {
 			ret = -ENOMEM;
@@ -229,14 +229,14 @@ repeat_locked:
 	spin_unlock(&journal->j_state_lock);
 out:
 	if (unlikely(new_transaction))		/* It's usually NULL */
-		kfree(new_transaction);
+		jbd2_kfree(new_transaction);
 	return ret;
 }
 
 /* Allocate a new handle.  This should probably be in a slab... */
 static handle_t *new_handle(int nblocks)
 {
-	handle_t *handle = jbd_alloc_handle(GFP_NOFS);
+	handle_t *handle = jbd2_alloc_handle(GFP_NOFS);
 	if (!handle)
 		return NULL;
 	memset(handle, 0, sizeof(*handle));
@@ -282,7 +282,7 @@ handle_t *jbd2_journal_start(journal_t *
 
 	err = start_this_handle(journal, handle);
 	if (err < 0) {
-		jbd_free_handle(handle);
+		jbd2_free_handle(handle);
 		current->journal_info = NULL;
 		handle = ERR_PTR(err);
 	}
@@ -668,7 +668,7 @@ repeat:
 				JBUFFER_TRACE(jh, "allocate memory for buffer");
 				jbd_unlock_bh_state(bh);
 				frozen_buffer =
-					jbd2_slab_alloc(jh2bh(jh)->b_size,
+					jbd2_alloc(jh2bh(jh)->b_size,
 							 GFP_NOFS);
 				if (!frozen_buffer) {
 					printk(KERN_EMERG
@@ -728,7 +728,7 @@ done:
 
 out:
 	if (unlikely(frozen_buffer))	/* It's usually NULL */
-		jbd2_slab_free(frozen_buffer, bh->b_size);
+		jbd2_free(frozen_buffer, bh->b_size);
 
 	JBUFFER_TRACE(jh, "exit");
 	return error;
@@ -881,7 +881,7 @@ int jbd2_journal_get_undo_access(handle_
 
 repeat:
 	if (!jh->b_committed_data) {
-		committed_data = jbd2_slab_alloc(jh2bh(jh)->b_size, GFP_NOFS);
+		committed_data = jbd2_alloc(jh2bh(jh)->b_size, GFP_NOFS);
 		if (!committed_data) {
 			printk(KERN_EMERG "%s: No memory for committed data\n",
 				__FUNCTION__);
@@ -908,7 +908,7 @@ repeat:
 out:
 	jbd2_journal_put_journal_head(jh);
 	if (unlikely(committed_data))
-		jbd2_slab_free(committed_data, bh->b_size);
+		jbd2_free(committed_data, bh->b_size);
 	return err;
 }
 
@@ -1411,7 +1411,7 @@ int jbd2_journal_stop(handle_t *handle)
 		spin_unlock(&journal->j_state_lock);
 	}
 
-	jbd_free_handle(handle);
+	jbd2_free_handle(handle);
 	return err;
 }
 
Index: linux-2.6.23-rc5/fs/jbd/checkpoint.c
===================================================================
--- linux-2.6.23-rc5.orig/fs/jbd/checkpoint.c	2007-09-13 13:37:57.000000000 -0700
+++ linux-2.6.23-rc5/fs/jbd/checkpoint.c	2007-09-14 09:57:21.000000000 -0700
@@ -693,5 +693,5 @@ void __journal_drop_transaction(journal_
 	J_ASSERT(journal->j_running_transaction != transaction);
 
 	jbd_debug(1, "Dropping transaction %d, all done\n", transaction->t_tid);
-	kfree(transaction);
+	jbd_kfree(transaction);
 }
Index: linux-2.6.23-rc5/fs/jbd2/checkpoint.c
===================================================================
--- linux-2.6.23-rc5.orig/fs/jbd2/checkpoint.c	2007-09-13 13:37:57.000000000 -0700
+++ linux-2.6.23-rc5/fs/jbd2/checkpoint.c	2007-09-14 09:57:03.000000000 -0700
@@ -693,5 +693,5 @@ void __jbd2_journal_drop_transaction(jou
 	J_ASSERT(journal->j_running_transaction != transaction);
 
 	jbd_debug(1, "Dropping transaction %d, all done\n", transaction->t_tid);
-	kfree(transaction);
+	jbd2_kfree(transaction);
 }

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] JBD slab cleanups
  2007-09-14 18:53                       ` [PATCH] JBD slab cleanups Mingming Cao
@ 2007-09-14 18:58                         ` Christoph Lameter
  2007-09-17 19:29                         ` Mingming Cao
  1 sibling, 0 replies; 56+ messages in thread
From: Christoph Lameter @ 2007-09-14 18:58 UTC (permalink / raw)
  To: Mingming Cao
  Cc: Christoph Hellwig, linux-fsdevel, ext4 development, linux-kernel

Thanks Mingming.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] JBD slab cleanups
  2007-09-14 18:53                       ` [PATCH] JBD slab cleanups Mingming Cao
  2007-09-14 18:58                         ` Christoph Lameter
@ 2007-09-17 19:29                         ` Mingming Cao
  2007-09-17 19:34                           ` Christoph Hellwig
  2007-09-17 22:01                           ` Badari Pulavarty
  1 sibling, 2 replies; 56+ messages in thread
From: Mingming Cao @ 2007-09-17 19:29 UTC (permalink / raw)
  To: Christoph Hellwig, pbadari
  Cc: Christoph Lameter, linux-fsdevel, ext4 development, linux-kernel

On Fri, 2007-09-14 at 11:53 -0700, Mingming Cao wrote:
> jbd/jbd2: Replace slab allocations with page cache allocations
> 
> From: Christoph Lameter <clameter@sgi.com>
> 
> JBD should not pass slab pages down to the block layer.
> Use page allocator pages instead. This will also prepare
> JBD for the large blocksize patchset.
> 

Currently memory allocation for committed_data(and frozen_buffer) for
bufferhead is done through jbd slab management, as Christoph Hellwig
pointed out that this is broken as jbd should not pass slab pages down
to IO layer. and suggested to use get_free_pages() directly.

The problem with this patch, as Andreas Dilger pointed today in ext4
interlock call, for 1k,2k block size ext2/3/4, get_free_pages() waste
1/3-1/2 page space. 

What was the originally intention to set up slabs for committed_data(and
frozen_buffer) in JBD? Why not using kmalloc?

Mingming

> Tested on 2.6.23-rc6 with fsx runs fine.
> 
> Signed-off-by: Christoph Lameter <clameter@sgi.com>
> Signed-off-by: Mingming Cao <cmm@us.ibm.com>
> ---
>  fs/jbd/checkpoint.c   |    2 
>  fs/jbd/commit.c       |    6 +-
>  fs/jbd/journal.c      |  107 ++++---------------------------------------------
>  fs/jbd/transaction.c  |   10 ++--
>  fs/jbd2/checkpoint.c  |    2 
>  fs/jbd2/commit.c      |    6 +-
>  fs/jbd2/journal.c     |  109 ++++----------------------------------------------
>  fs/jbd2/transaction.c |   18 ++++----
>  include/linux/jbd.h   |   23 +++++++++-
>  include/linux/jbd2.h  |   28 ++++++++++--
>  10 files changed, 83 insertions(+), 228 deletions(-)
> 
> Index: linux-2.6.23-rc5/fs/jbd/journal.c
> ===================================================================
> --- linux-2.6.23-rc5.orig/fs/jbd/journal.c	2007-09-13 13:37:57.000000000 -0700
> +++ linux-2.6.23-rc5/fs/jbd/journal.c	2007-09-13 13:45:39.000000000 -0700
> @@ -83,7 +83,6 @@ EXPORT_SYMBOL(journal_force_commit);
> 
>  static int journal_convert_superblock_v1(journal_t *, journal_superblock_t *);
>  static void __journal_abort_soft (journal_t *journal, int errno);
> -static int journal_create_jbd_slab(size_t slab_size);
> 
>  /*
>   * Helper function used to manage commit timeouts
> @@ -334,10 +333,10 @@ repeat:
>  		char *tmp;
> 
>  		jbd_unlock_bh_state(bh_in);
> -		tmp = jbd_slab_alloc(bh_in->b_size, GFP_NOFS);
> +		tmp = jbd_alloc(bh_in->b_size, GFP_NOFS);
>  		jbd_lock_bh_state(bh_in);
>  		if (jh_in->b_frozen_data) {
> -			jbd_slab_free(tmp, bh_in->b_size);
> +			jbd_free(tmp, bh_in->b_size);
>  			goto repeat;
>  		}
> 
> @@ -679,7 +678,7 @@ static journal_t * journal_init_common (
>  	/* Set up a default-sized revoke table for the new mount. */
>  	err = journal_init_revoke(journal, JOURNAL_REVOKE_DEFAULT_HASH);
>  	if (err) {
> -		kfree(journal);
> +		jbd_kfree(journal);
>  		goto fail;
>  	}
>  	return journal;
> @@ -728,7 +727,7 @@ journal_t * journal_init_dev(struct bloc
>  	if (!journal->j_wbuf) {
>  		printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
>  			__FUNCTION__);
> -		kfree(journal);
> +		jbd_kfree(journal);
>  		journal = NULL;
>  		goto out;
>  	}
> @@ -782,7 +781,7 @@ journal_t * journal_init_inode (struct i
>  	if (!journal->j_wbuf) {
>  		printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
>  			__FUNCTION__);
> -		kfree(journal);
> +		jbd_kfree(journal);
>  		return NULL;
>  	}
> 
> @@ -791,7 +790,7 @@ journal_t * journal_init_inode (struct i
>  	if (err) {
>  		printk(KERN_ERR "%s: Cannnot locate journal superblock\n",
>  		       __FUNCTION__);
> -		kfree(journal);
> +		jbd_kfree(journal);
>  		return NULL;
>  	}
> 
> @@ -1095,13 +1094,6 @@ int journal_load(journal_t *journal)
>  		}
>  	}
> 
> -	/*
> -	 * Create a slab for this blocksize
> -	 */
> -	err = journal_create_jbd_slab(be32_to_cpu(sb->s_blocksize));
> -	if (err)
> -		return err;
> -
>  	/* Let the recovery code check whether it needs to recover any
>  	 * data from the journal. */
>  	if (journal_recover(journal))
> @@ -1166,7 +1158,7 @@ void journal_destroy(journal_t *journal)
>  	if (journal->j_revoke)
>  		journal_destroy_revoke(journal);
>  	kfree(journal->j_wbuf);
> -	kfree(journal);
> +	jbd_kfree(journal);
>  }
> 
> 
> @@ -1615,86 +1607,6 @@ int journal_blocks_per_page(struct inode
>  }
> 
>  /*
> - * Simple support for retrying memory allocations.  Introduced to help to
> - * debug different VM deadlock avoidance strategies.
> - */
> -void * __jbd_kmalloc (const char *where, size_t size, gfp_t flags, int retry)
> -{
> -	return kmalloc(size, flags | (retry ? __GFP_NOFAIL : 0));
> -}
> -
> -/*
> - * jbd slab management: create 1k, 2k, 4k, 8k slabs as needed
> - * and allocate frozen and commit buffers from these slabs.
> - *
> - * Reason for doing this is to avoid, SLAB_DEBUG - since it could
> - * cause bh to cross page boundary.
> - */
> -
> -#define JBD_MAX_SLABS 5
> -#define JBD_SLAB_INDEX(size)  (size >> 11)
> -
> -static struct kmem_cache *jbd_slab[JBD_MAX_SLABS];
> -static const char *jbd_slab_names[JBD_MAX_SLABS] = {
> -	"jbd_1k", "jbd_2k", "jbd_4k", NULL, "jbd_8k"
> -};
> -
> -static void journal_destroy_jbd_slabs(void)
> -{
> -	int i;
> -
> -	for (i = 0; i < JBD_MAX_SLABS; i++) {
> -		if (jbd_slab[i])
> -			kmem_cache_destroy(jbd_slab[i]);
> -		jbd_slab[i] = NULL;
> -	}
> -}
> -
> -static int journal_create_jbd_slab(size_t slab_size)
> -{
> -	int i = JBD_SLAB_INDEX(slab_size);
> -
> -	BUG_ON(i >= JBD_MAX_SLABS);
> -
> -	/*
> -	 * Check if we already have a slab created for this size
> -	 */
> -	if (jbd_slab[i])
> -		return 0;
> -
> -	/*
> -	 * Create a slab and force alignment to be same as slabsize -
> -	 * this will make sure that allocations won't cross the page
> -	 * boundary.
> -	 */
> -	jbd_slab[i] = kmem_cache_create(jbd_slab_names[i],
> -				slab_size, slab_size, 0, NULL);
> -	if (!jbd_slab[i]) {
> -		printk(KERN_EMERG "JBD: no memory for jbd_slab cache\n");
> -		return -ENOMEM;
> -	}
> -	return 0;
> -}
> -
> -void * jbd_slab_alloc(size_t size, gfp_t flags)
> -{
> -	int idx;
> -
> -	idx = JBD_SLAB_INDEX(size);
> -	BUG_ON(jbd_slab[idx] == NULL);
> -	return kmem_cache_alloc(jbd_slab[idx], flags | __GFP_NOFAIL);
> -}
> -
> -void jbd_slab_free(void *ptr,  size_t size)
> -{
> -	int idx;
> -
> -	idx = JBD_SLAB_INDEX(size);
> -	BUG_ON(jbd_slab[idx] == NULL);
> -	kmem_cache_free(jbd_slab[idx], ptr);
> -}
> -
> -/*
>   * Journal_head storage management
>   */
>  static struct kmem_cache *journal_head_cache;
> @@ -1881,13 +1793,13 @@ static void __journal_remove_journal_hea
>  				printk(KERN_WARNING "%s: freeing "
>  						"b_frozen_data\n",
>  						__FUNCTION__);
> -				jbd_slab_free(jh->b_frozen_data, bh->b_size);
> +				jbd_free(jh->b_frozen_data, bh->b_size);
>  			}
>  			if (jh->b_committed_data) {
>  				printk(KERN_WARNING "%s: freeing "
>  						"b_committed_data\n",
>  						__FUNCTION__);
> -				jbd_slab_free(jh->b_committed_data, bh->b_size);
> +				jbd_free(jh->b_committed_data, bh->b_size);
>  			}
>  			bh->b_private = NULL;
>  			jh->b_bh = NULL;	/* debug, really */
> @@ -2042,7 +1954,6 @@ static void journal_destroy_caches(void)
>  	journal_destroy_revoke_caches();
>  	journal_destroy_journal_head_cache();
>  	journal_destroy_handle_cache();
> -	journal_destroy_jbd_slabs();
>  }
> 
>  static int __init journal_init(void)
> Index: linux-2.6.23-rc5/include/linux/jbd.h
> ===================================================================
> --- linux-2.6.23-rc5.orig/include/linux/jbd.h	2007-09-13 13:37:57.000000000 -0700
> +++ linux-2.6.23-rc5/include/linux/jbd.h	2007-09-13 13:42:27.000000000 -0700
> @@ -71,9 +71,26 @@ extern int journal_enable_debug;
>  #define jbd_debug(f, a...)	/**/
>  #endif
> 
> -extern void * __jbd_kmalloc (const char *where, size_t size, gfp_t flags, int retry);
> -extern void * jbd_slab_alloc(size_t size, gfp_t flags);
> -extern void jbd_slab_free(void *ptr, size_t size);
> +static inline void *__jbd_kmalloc(const char *where, size_t size,
> +						gfp_t flags, int retry)
> +{
> +	return kmalloc(size, flags | (retry ? __GFP_NOFAIL : 0));
> +}
> +
> +static inline void jbd_kfree(void *ptr)
> +{
> +	return kfree(ptr);
> +}
> +
> +static inline void *jbd_alloc(size_t size, gfp_t flags)
> +{
> +	return (void *)__get_free_pages(flags, get_order(size));
> +}
> +
> +static inline void jbd_free(void *ptr, size_t size)
> +{
> +	free_pages((unsigned long)ptr, get_order(size));
> +};
> 
>  #define jbd_kmalloc(size, flags) \
>  	__jbd_kmalloc(__FUNCTION__, (size), (flags), journal_oom_retry)
> Index: linux-2.6.23-rc5/include/linux/jbd2.h
> ===================================================================
> --- linux-2.6.23-rc5.orig/include/linux/jbd2.h	2007-09-13 13:37:58.000000000 -0700
> +++ linux-2.6.23-rc5/include/linux/jbd2.h	2007-09-13 13:51:49.000000000 -0700
> @@ -71,11 +71,27 @@ extern u8 jbd2_journal_enable_debug;
>  #define jbd_debug(f, a...)	/**/
>  #endif
> 
> -extern void * __jbd2_kmalloc (const char *where, size_t size, gfp_t flags, int retry);
> -extern void * jbd2_slab_alloc(size_t size, gfp_t flags);
> -extern void jbd2_slab_free(void *ptr, size_t size);
> +static inline void *__jbd2_kmalloc(const char *where, size_t size,
> +						gfp_t flags, int retry)
> +{
> +	return kmalloc(size, flags | (retry ? __GFP_NOFAIL : 0));
> +}
> +static inline void jbd2_kfree(void *ptr)
> +{
> +	return kfree(ptr);
> +}
> +
> +static inline void *jbd2_alloc(size_t size, gfp_t flags)
> +{
> +	return (void *)__get_free_pages(flags, get_order(size));
> +}
> +
> +static inline void jbd2_free(void *ptr, size_t size)
> +{
> +	free_pages((unsigned long)ptr, get_order(size));
> +};
> 
> -#define jbd_kmalloc(size, flags) \
> +#define jbd2_kmalloc(size, flags) \
>  	__jbd2_kmalloc(__FUNCTION__, (size), (flags), journal_oom_retry)
>  #define jbd_rep_kmalloc(size, flags) \
>  	__jbd2_kmalloc(__FUNCTION__, (size), (flags), 1)
> @@ -959,12 +975,12 @@ void jbd2_journal_put_journal_head(struc
>   */
>  extern struct kmem_cache *jbd2_handle_cache;
> 
> -static inline handle_t *jbd_alloc_handle(gfp_t gfp_flags)
> +static inline handle_t *jbd2_alloc_handle(gfp_t gfp_flags)
>  {
>  	return kmem_cache_alloc(jbd2_handle_cache, gfp_flags);
>  }
> 
> -static inline void jbd_free_handle(handle_t *handle)
> +static inline void jbd2_free_handle(handle_t *handle)
>  {
>  	kmem_cache_free(jbd2_handle_cache, handle);
>  }
> Index: linux-2.6.23-rc5/fs/jbd2/journal.c
> ===================================================================
> --- linux-2.6.23-rc5.orig/fs/jbd2/journal.c	2007-09-13 13:37:57.000000000 -0700
> +++ linux-2.6.23-rc5/fs/jbd2/journal.c	2007-09-13 14:00:17.000000000 -0700
> @@ -84,7 +84,6 @@ EXPORT_SYMBOL(jbd2_journal_force_commit)
> 
>  static int journal_convert_superblock_v1(journal_t *, journal_superblock_t *);
>  static void __journal_abort_soft (journal_t *journal, int errno);
> -static int jbd2_journal_create_jbd_slab(size_t slab_size);
> 
>  /*
>   * Helper function used to manage commit timeouts
> @@ -335,10 +334,10 @@ repeat:
>  		char *tmp;
> 
>  		jbd_unlock_bh_state(bh_in);
> -		tmp = jbd2_slab_alloc(bh_in->b_size, GFP_NOFS);
> +		tmp = jbd2_alloc(bh_in->b_size, GFP_NOFS);
>  		jbd_lock_bh_state(bh_in);
>  		if (jh_in->b_frozen_data) {
> -			jbd2_slab_free(tmp, bh_in->b_size);
> +			jbd2_free(tmp, bh_in->b_size);
>  			goto repeat;
>  		}
> 
> @@ -655,7 +654,7 @@ static journal_t * journal_init_common (
>  	journal_t *journal;
>  	int err;
> 
> -	journal = jbd_kmalloc(sizeof(*journal), GFP_KERNEL);
> +	journal = jbd2_kmalloc(sizeof(*journal), GFP_KERNEL);
>  	if (!journal)
>  		goto fail;
>  	memset(journal, 0, sizeof(*journal));
> @@ -680,7 +679,7 @@ static journal_t * journal_init_common (
>  	/* Set up a default-sized revoke table for the new mount. */
>  	err = jbd2_journal_init_revoke(journal, JOURNAL_REVOKE_DEFAULT_HASH);
>  	if (err) {
> -		kfree(journal);
> +		jbd2_kfree(journal);
>  		goto fail;
>  	}
>  	return journal;
> @@ -729,7 +728,7 @@ journal_t * jbd2_journal_init_dev(struct
>  	if (!journal->j_wbuf) {
>  		printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
>  			__FUNCTION__);
> -		kfree(journal);
> +		jbd2_kfree(journal);
>  		journal = NULL;
>  		goto out;
>  	}
> @@ -783,7 +782,7 @@ journal_t * jbd2_journal_init_inode (str
>  	if (!journal->j_wbuf) {
>  		printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
>  			__FUNCTION__);
> -		kfree(journal);
> +		jbd2_kfree(journal);
>  		return NULL;
>  	}
> 
> @@ -792,7 +791,7 @@ journal_t * jbd2_journal_init_inode (str
>  	if (err) {
>  		printk(KERN_ERR "%s: Cannnot locate journal superblock\n",
>  		       __FUNCTION__);
> -		kfree(journal);
> +		jbd2_kfree(journal);
>  		return NULL;
>  	}
> 
> @@ -1096,13 +1095,6 @@ int jbd2_journal_load(journal_t *journal
>  		}
>  	}
> 
> -	/*
> -	 * Create a slab for this blocksize
> -	 */
> -	err = jbd2_journal_create_jbd_slab(be32_to_cpu(sb->s_blocksize));
> -	if (err)
> -		return err;
> -
>  	/* Let the recovery code check whether it needs to recover any
>  	 * data from the journal. */
>  	if (jbd2_journal_recover(journal))
> @@ -1167,7 +1159,7 @@ void jbd2_journal_destroy(journal_t *jou
>  	if (journal->j_revoke)
>  		jbd2_journal_destroy_revoke(journal);
>  	kfree(journal->j_wbuf);
> -	kfree(journal);
> +	jbd2_kfree(journal);
>  }
> 
> 
> @@ -1627,86 +1619,6 @@ size_t journal_tag_bytes(journal_t *jour
>  }
> 
>  /*
> - * Simple support for retrying memory allocations.  Introduced to help to
> - * debug different VM deadlock avoidance strategies.
> - */
> -void * __jbd2_kmalloc (const char *where, size_t size, gfp_t flags, int retry)
> -{
> -	return kmalloc(size, flags | (retry ? __GFP_NOFAIL : 0));
> -}
> -
> -/*
> - * jbd slab management: create 1k, 2k, 4k, 8k slabs as needed
> - * and allocate frozen and commit buffers from these slabs.
> - *
> - * Reason for doing this is to avoid, SLAB_DEBUG - since it could
> - * cause bh to cross page boundary.
> - */
> -
> -#define JBD_MAX_SLABS 5
> -#define JBD_SLAB_INDEX(size)  (size >> 11)
> -
> -static struct kmem_cache *jbd_slab[JBD_MAX_SLABS];
> -static const char *jbd_slab_names[JBD_MAX_SLABS] = {
> -	"jbd2_1k", "jbd2_2k", "jbd2_4k", NULL, "jbd2_8k"
> -};
> -
> -static void jbd2_journal_destroy_jbd_slabs(void)
> -{
> -	int i;
> -
> -	for (i = 0; i < JBD_MAX_SLABS; i++) {
> -		if (jbd_slab[i])
> -			kmem_cache_destroy(jbd_slab[i]);
> -		jbd_slab[i] = NULL;
> -	}
> -}
> -
> -static int jbd2_journal_create_jbd_slab(size_t slab_size)
> -{
> -	int i = JBD_SLAB_INDEX(slab_size);
> -
> -	BUG_ON(i >= JBD_MAX_SLABS);
> -
> -	/*
> -	 * Check if we already have a slab created for this size
> -	 */
> -	if (jbd_slab[i])
> -		return 0;
> -
> -	/*
> -	 * Create a slab and force alignment to be same as slabsize -
> -	 * this will make sure that allocations won't cross the page
> -	 * boundary.
> -	 */
> -	jbd_slab[i] = kmem_cache_create(jbd_slab_names[i],
> -				slab_size, slab_size, 0, NULL);
> -	if (!jbd_slab[i]) {
> -		printk(KERN_EMERG "JBD: no memory for jbd_slab cache\n");
> -		return -ENOMEM;
> -	}
> -	return 0;
> -}
> -
> -void * jbd2_slab_alloc(size_t size, gfp_t flags)
> -{
> -	int idx;
> -
> -	idx = JBD_SLAB_INDEX(size);
> -	BUG_ON(jbd_slab[idx] == NULL);
> -	return kmem_cache_alloc(jbd_slab[idx], flags | __GFP_NOFAIL);
> -}
> -
> -void jbd2_slab_free(void *ptr,  size_t size)
> -{
> -	int idx;
> -
> -	idx = JBD_SLAB_INDEX(size);
> -	BUG_ON(jbd_slab[idx] == NULL);
> -	kmem_cache_free(jbd_slab[idx], ptr);
> -}
> -
> -/*
>   * Journal_head storage management
>   */
>  static struct kmem_cache *jbd2_journal_head_cache;
> @@ -1893,13 +1805,13 @@ static void __journal_remove_journal_hea
>  				printk(KERN_WARNING "%s: freeing "
>  						"b_frozen_data\n",
>  						__FUNCTION__);
> -				jbd2_slab_free(jh->b_frozen_data, bh->b_size);
> +				jbd2_free(jh->b_frozen_data, bh->b_size);
>  			}
>  			if (jh->b_committed_data) {
>  				printk(KERN_WARNING "%s: freeing "
>  						"b_committed_data\n",
>  						__FUNCTION__);
> -				jbd2_slab_free(jh->b_committed_data, bh->b_size);
> +				jbd2_free(jh->b_committed_data, bh->b_size);
>  			}
>  			bh->b_private = NULL;
>  			jh->b_bh = NULL;	/* debug, really */
> @@ -2040,7 +1952,6 @@ static void jbd2_journal_destroy_caches(
>  	jbd2_journal_destroy_revoke_caches();
>  	jbd2_journal_destroy_jbd2_journal_head_cache();
>  	jbd2_journal_destroy_handle_cache();
> -	jbd2_journal_destroy_jbd_slabs();
>  }
> 
>  static int __init journal_init(void)
> Index: linux-2.6.23-rc5/fs/jbd/commit.c
> ===================================================================
> --- linux-2.6.23-rc5.orig/fs/jbd/commit.c	2007-09-13 13:37:57.000000000 -0700
> +++ linux-2.6.23-rc5/fs/jbd/commit.c	2007-09-13 13:40:03.000000000 -0700
> @@ -375,7 +375,7 @@ void journal_commit_transaction(journal_
>  			struct buffer_head *bh = jh2bh(jh);
> 
>  			jbd_lock_bh_state(bh);
> -			jbd_slab_free(jh->b_committed_data, bh->b_size);
> +			jbd_free(jh->b_committed_data, bh->b_size);
>  			jh->b_committed_data = NULL;
>  			jbd_unlock_bh_state(bh);
>  		}
> @@ -792,14 +792,14 @@ restart_loop:
>  		 * Otherwise, we can just throw away the frozen data now.
>  		 */
>  		if (jh->b_committed_data) {
> -			jbd_slab_free(jh->b_committed_data, bh->b_size);
> +			jbd_free(jh->b_committed_data, bh->b_size);
>  			jh->b_committed_data = NULL;
>  			if (jh->b_frozen_data) {
>  				jh->b_committed_data = jh->b_frozen_data;
>  				jh->b_frozen_data = NULL;
>  			}
>  		} else if (jh->b_frozen_data) {
> -			jbd_slab_free(jh->b_frozen_data, bh->b_size);
> +			jbd_free(jh->b_frozen_data, bh->b_size);
>  			jh->b_frozen_data = NULL;
>  		}
> 
> Index: linux-2.6.23-rc5/fs/jbd2/commit.c
> ===================================================================
> --- linux-2.6.23-rc5.orig/fs/jbd2/commit.c	2007-09-13 13:37:57.000000000 -0700
> +++ linux-2.6.23-rc5/fs/jbd2/commit.c	2007-09-13 13:40:03.000000000 -0700
> @@ -384,7 +384,7 @@ void jbd2_journal_commit_transaction(jou
>  			struct buffer_head *bh = jh2bh(jh);
> 
>  			jbd_lock_bh_state(bh);
> -			jbd2_slab_free(jh->b_committed_data, bh->b_size);
> +			jbd2_free(jh->b_committed_data, bh->b_size);
>  			jh->b_committed_data = NULL;
>  			jbd_unlock_bh_state(bh);
>  		}
> @@ -801,14 +801,14 @@ restart_loop:
>  		 * Otherwise, we can just throw away the frozen data now.
>  		 */
>  		if (jh->b_committed_data) {
> -			jbd2_slab_free(jh->b_committed_data, bh->b_size);
> +			jbd2_free(jh->b_committed_data, bh->b_size);
>  			jh->b_committed_data = NULL;
>  			if (jh->b_frozen_data) {
>  				jh->b_committed_data = jh->b_frozen_data;
>  				jh->b_frozen_data = NULL;
>  			}
>  		} else if (jh->b_frozen_data) {
> -			jbd2_slab_free(jh->b_frozen_data, bh->b_size);
> +			jbd2_free(jh->b_frozen_data, bh->b_size);
>  			jh->b_frozen_data = NULL;
>  		}
> 
> Index: linux-2.6.23-rc5/fs/jbd/transaction.c
> ===================================================================
> --- linux-2.6.23-rc5.orig/fs/jbd/transaction.c	2007-09-13 13:37:57.000000000 -0700
> +++ linux-2.6.23-rc5/fs/jbd/transaction.c	2007-09-13 13:46:23.000000000 -0700
> @@ -229,7 +229,7 @@ repeat_locked:
>  	spin_unlock(&journal->j_state_lock);
>  out:
>  	if (unlikely(new_transaction))		/* It's usually NULL */
> -		kfree(new_transaction);
> +		jbd_kfree(new_transaction);
>  	return ret;
>  }
> 
> @@ -668,7 +668,7 @@ repeat:
>  				JBUFFER_TRACE(jh, "allocate memory for buffer");
>  				jbd_unlock_bh_state(bh);
>  				frozen_buffer =
> -					jbd_slab_alloc(jh2bh(jh)->b_size,
> +					jbd_alloc(jh2bh(jh)->b_size,
>  							 GFP_NOFS);
>  				if (!frozen_buffer) {
>  					printk(KERN_EMERG
> @@ -728,7 +728,7 @@ done:
> 
>  out:
>  	if (unlikely(frozen_buffer))	/* It's usually NULL */
> -		jbd_slab_free(frozen_buffer, bh->b_size);
> +		jbd_free(frozen_buffer, bh->b_size);
> 
>  	JBUFFER_TRACE(jh, "exit");
>  	return error;
> @@ -881,7 +881,7 @@ int journal_get_undo_access(handle_t *ha
> 
>  repeat:
>  	if (!jh->b_committed_data) {
> -		committed_data = jbd_slab_alloc(jh2bh(jh)->b_size, GFP_NOFS);
> +		committed_data = jbd_alloc(jh2bh(jh)->b_size, GFP_NOFS);
>  		if (!committed_data) {
>  			printk(KERN_EMERG "%s: No memory for committed data\n",
>  				__FUNCTION__);
> @@ -908,7 +908,7 @@ repeat:
>  out:
>  	journal_put_journal_head(jh);
>  	if (unlikely(committed_data))
> -		jbd_slab_free(committed_data, bh->b_size);
> +		jbd_free(committed_data, bh->b_size);
>  	return err;
>  }
> 
> Index: linux-2.6.23-rc5/fs/jbd2/transaction.c
> ===================================================================
> --- linux-2.6.23-rc5.orig/fs/jbd2/transaction.c	2007-09-13 13:37:57.000000000 -0700
> +++ linux-2.6.23-rc5/fs/jbd2/transaction.c	2007-09-13 13:59:20.000000000 -0700
> @@ -96,7 +96,7 @@ static int start_this_handle(journal_t *
> 
>  alloc_transaction:
>  	if (!journal->j_running_transaction) {
> -		new_transaction = jbd_kmalloc(sizeof(*new_transaction),
> +		new_transaction = jbd2_kmalloc(sizeof(*new_transaction),
>  						GFP_NOFS);
>  		if (!new_transaction) {
>  			ret = -ENOMEM;
> @@ -229,14 +229,14 @@ repeat_locked:
>  	spin_unlock(&journal->j_state_lock);
>  out:
>  	if (unlikely(new_transaction))		/* It's usually NULL */
> -		kfree(new_transaction);
> +		jbd2_kfree(new_transaction);
>  	return ret;
>  }
> 
>  /* Allocate a new handle.  This should probably be in a slab... */
>  static handle_t *new_handle(int nblocks)
>  {
> -	handle_t *handle = jbd_alloc_handle(GFP_NOFS);
> +	handle_t *handle = jbd2_alloc_handle(GFP_NOFS);
>  	if (!handle)
>  		return NULL;
>  	memset(handle, 0, sizeof(*handle));
> @@ -282,7 +282,7 @@ handle_t *jbd2_journal_start(journal_t *
> 
>  	err = start_this_handle(journal, handle);
>  	if (err < 0) {
> -		jbd_free_handle(handle);
> +		jbd2_free_handle(handle);
>  		current->journal_info = NULL;
>  		handle = ERR_PTR(err);
>  	}
> @@ -668,7 +668,7 @@ repeat:
>  				JBUFFER_TRACE(jh, "allocate memory for buffer");
>  				jbd_unlock_bh_state(bh);
>  				frozen_buffer =
> -					jbd2_slab_alloc(jh2bh(jh)->b_size,
> +					jbd2_alloc(jh2bh(jh)->b_size,
>  							 GFP_NOFS);
>  				if (!frozen_buffer) {
>  					printk(KERN_EMERG
> @@ -728,7 +728,7 @@ done:
> 
>  out:
>  	if (unlikely(frozen_buffer))	/* It's usually NULL */
> -		jbd2_slab_free(frozen_buffer, bh->b_size);
> +		jbd2_free(frozen_buffer, bh->b_size);
> 
>  	JBUFFER_TRACE(jh, "exit");
>  	return error;
> @@ -881,7 +881,7 @@ int jbd2_journal_get_undo_access(handle_
> 
>  repeat:
>  	if (!jh->b_committed_data) {
> -		committed_data = jbd2_slab_alloc(jh2bh(jh)->b_size, GFP_NOFS);
> +		committed_data = jbd2_alloc(jh2bh(jh)->b_size, GFP_NOFS);
>  		if (!committed_data) {
>  			printk(KERN_EMERG "%s: No memory for committed data\n",
>  				__FUNCTION__);
> @@ -908,7 +908,7 @@ repeat:
>  out:
>  	jbd2_journal_put_journal_head(jh);
>  	if (unlikely(committed_data))
> -		jbd2_slab_free(committed_data, bh->b_size);
> +		jbd2_free(committed_data, bh->b_size);
>  	return err;
>  }
> 
> @@ -1411,7 +1411,7 @@ int jbd2_journal_stop(handle_t *handle)
>  		spin_unlock(&journal->j_state_lock);
>  	}
> 
> -	jbd_free_handle(handle);
> +	jbd2_free_handle(handle);
>  	return err;
>  }
> 
> Index: linux-2.6.23-rc5/fs/jbd/checkpoint.c
> ===================================================================
> --- linux-2.6.23-rc5.orig/fs/jbd/checkpoint.c	2007-09-13 13:37:57.000000000 -0700
> +++ linux-2.6.23-rc5/fs/jbd/checkpoint.c	2007-09-14 09:57:21.000000000 -0700
> @@ -693,5 +693,5 @@ void __journal_drop_transaction(journal_
>  	J_ASSERT(journal->j_running_transaction != transaction);
> 
>  	jbd_debug(1, "Dropping transaction %d, all done\n", transaction->t_tid);
> -	kfree(transaction);
> +	jbd_kfree(transaction);
>  }
> Index: linux-2.6.23-rc5/fs/jbd2/checkpoint.c
> ===================================================================
> --- linux-2.6.23-rc5.orig/fs/jbd2/checkpoint.c	2007-09-13 13:37:57.000000000 -0700
> +++ linux-2.6.23-rc5/fs/jbd2/checkpoint.c	2007-09-14 09:57:03.000000000 -0700
> @@ -693,5 +693,5 @@ void __jbd2_journal_drop_transaction(jou
>  	J_ASSERT(journal->j_running_transaction != transaction);
> 
>  	jbd_debug(1, "Dropping transaction %d, all done\n", transaction->t_tid);
> -	kfree(transaction);
> +	jbd2_kfree(transaction);
>  }
> 
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] JBD slab cleanups
  2007-09-17 19:29                         ` Mingming Cao
@ 2007-09-17 19:34                           ` Christoph Hellwig
  2007-09-17 22:01                           ` Badari Pulavarty
  1 sibling, 0 replies; 56+ messages in thread
From: Christoph Hellwig @ 2007-09-17 19:34 UTC (permalink / raw)
  To: Mingming Cao
  Cc: Christoph Hellwig, pbadari, Christoph Lameter, linux-fsdevel,
	ext4 development, linux-kernel

On Mon, Sep 17, 2007 at 12:29:51PM -0700, Mingming Cao wrote:
> The problem with this patch, as Andreas Dilger pointed today in ext4
> interlock call, for 1k,2k block size ext2/3/4, get_free_pages() waste
> 1/3-1/2 page space. 
> 
> What was the originally intention to set up slabs for committed_data(and
> frozen_buffer) in JBD? Why not using kmalloc?

kmalloc is using slabs :)

The intent was to avoid the wasted memory, but as we've repeated a gazillion
times wasted memory on a rather rare codepath doesn't really matter when
you just crash random storage drivers otherwise.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] JBD slab cleanups
  2007-09-17 19:29                         ` Mingming Cao
  2007-09-17 19:34                           ` Christoph Hellwig
@ 2007-09-17 22:01                           ` Badari Pulavarty
  2007-09-17 22:57                             ` Mingming Cao
  1 sibling, 1 reply; 56+ messages in thread
From: Badari Pulavarty @ 2007-09-17 22:01 UTC (permalink / raw)
  To: cmm
  Cc: Christoph Hellwig, Christoph Lameter, linux-fsdevel,
	ext4 development, lkml

On Mon, 2007-09-17 at 12:29 -0700, Mingming Cao wrote:
> On Fri, 2007-09-14 at 11:53 -0700, Mingming Cao wrote:
> > jbd/jbd2: Replace slab allocations with page cache allocations
> > 
> > From: Christoph Lameter <clameter@sgi.com>
> > 
> > JBD should not pass slab pages down to the block layer.
> > Use page allocator pages instead. This will also prepare
> > JBD for the large blocksize patchset.
> > 
> 
> Currently memory allocation for committed_data(and frozen_buffer) for
> bufferhead is done through jbd slab management, as Christoph Hellwig
> pointed out that this is broken as jbd should not pass slab pages down
> to IO layer. and suggested to use get_free_pages() directly.
> 
> The problem with this patch, as Andreas Dilger pointed today in ext4
> interlock call, for 1k,2k block size ext2/3/4, get_free_pages() waste
> 1/3-1/2 page space. 
> 
> What was the originally intention to set up slabs for committed_data(and
> frozen_buffer) in JBD? Why not using kmalloc?
> 
> Mingming

Looks good. Small suggestion is to get rid of all kmalloc() usages and
consistently use jbd_kmalloc() or jbd2_kmalloc().

Thanks,
Badari

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] JBD slab cleanups
  2007-09-17 22:01                           ` Badari Pulavarty
@ 2007-09-17 22:57                             ` Mingming Cao
  2007-09-18  9:04                               ` Christoph Hellwig
  0 siblings, 1 reply; 56+ messages in thread
From: Mingming Cao @ 2007-09-17 22:57 UTC (permalink / raw)
  To: Badari Pulavarty
  Cc: Christoph Hellwig, Christoph Lameter, linux-fsdevel,
	ext4 development, lkml

On Mon, 2007-09-17 at 15:01 -0700, Badari Pulavarty wrote:
> On Mon, 2007-09-17 at 12:29 -0700, Mingming Cao wrote:
> > On Fri, 2007-09-14 at 11:53 -0700, Mingming Cao wrote:
> > > jbd/jbd2: Replace slab allocations with page cache allocations
> > > 
> > > From: Christoph Lameter <clameter@sgi.com>
> > > 
> > > JBD should not pass slab pages down to the block layer.
> > > Use page allocator pages instead. This will also prepare
> > > JBD for the large blocksize patchset.
> > > 
> > 
> > Currently memory allocation for committed_data(and frozen_buffer) for
> > bufferhead is done through jbd slab management, as Christoph Hellwig
> > pointed out that this is broken as jbd should not pass slab pages down
> > to IO layer. and suggested to use get_free_pages() directly.
> > 
> > The problem with this patch, as Andreas Dilger pointed today in ext4
> > interlock call, for 1k,2k block size ext2/3/4, get_free_pages() waste
> > 1/3-1/2 page space. 
> > 
> > What was the originally intention to set up slabs for committed_data(and
> > frozen_buffer) in JBD? Why not using kmalloc?
> > 
> > Mingming
> 
> Looks good. Small suggestion is to get rid of all kmalloc() usages and
> consistently use jbd_kmalloc() or jbd2_kmalloc().
> 
> Thanks,
> Badari
> 

Here is the incremental small cleanup patch. 

Remove kamlloc usages in jbd/jbd2 and consistently use jbd_kmalloc/jbd2_malloc.


Signed-off-by: Mingming Cao <cmm@us.ibm.com>
---
 fs/jbd/journal.c  |    8 +++++---
 fs/jbd/revoke.c   |   12 ++++++------
 fs/jbd2/journal.c |    8 +++++---
 fs/jbd2/revoke.c  |   12 ++++++------
 4 files changed, 22 insertions(+), 18 deletions(-)

Index: linux-2.6.23-rc6/fs/jbd/journal.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd/journal.c	2007-09-17 14:32:16.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd/journal.c	2007-09-17 14:33:59.000000000 -0700
@@ -723,7 +723,8 @@ journal_t * journal_init_dev(struct bloc
 	journal->j_blocksize = blocksize;
 	n = journal->j_blocksize / sizeof(journal_block_tag_t);
 	journal->j_wbufsize = n;
-	journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL);
+	journal->j_wbuf = jbd_kmalloc(n * sizeof(struct buffer_head*),
+					GFP_KERNEL);
 	if (!journal->j_wbuf) {
 		printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
 			__FUNCTION__);
@@ -777,7 +778,8 @@ journal_t * journal_init_inode (struct i
 	/* journal descriptor can store up to n blocks -bzzz */
 	n = journal->j_blocksize / sizeof(journal_block_tag_t);
 	journal->j_wbufsize = n;
-	journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL);
+	journal->j_wbuf = jbd_kmalloc(n * sizeof(struct buffer_head*),
+					GFP_KERNEL);
 	if (!journal->j_wbuf) {
 		printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
 			__FUNCTION__);
@@ -1157,7 +1159,7 @@ void journal_destroy(journal_t *journal)
 		iput(journal->j_inode);
 	if (journal->j_revoke)
 		journal_destroy_revoke(journal);
-	kfree(journal->j_wbuf);
+	jbd_kfree(journal->j_wbuf);
 	jbd_kfree(journal);
 }
 
Index: linux-2.6.23-rc6/fs/jbd/revoke.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd/revoke.c	2007-09-17 14:32:22.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd/revoke.c	2007-09-17 14:35:13.000000000 -0700
@@ -219,7 +219,7 @@ int journal_init_revoke(journal_t *journ
 	journal->j_revoke->hash_shift = shift;
 
 	journal->j_revoke->hash_table =
-		kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL);
+		jbd_kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL);
 	if (!journal->j_revoke->hash_table) {
 		kmem_cache_free(revoke_table_cache, journal->j_revoke_table[0]);
 		journal->j_revoke = NULL;
@@ -231,7 +231,7 @@ int journal_init_revoke(journal_t *journ
 
 	journal->j_revoke_table[1] = kmem_cache_alloc(revoke_table_cache, GFP_KERNEL);
 	if (!journal->j_revoke_table[1]) {
-		kfree(journal->j_revoke_table[0]->hash_table);
+		jbd_kfree(journal->j_revoke_table[0]->hash_table);
 		kmem_cache_free(revoke_table_cache, journal->j_revoke_table[0]);
 		return -ENOMEM;
 	}
@@ -246,9 +246,9 @@ int journal_init_revoke(journal_t *journ
 	journal->j_revoke->hash_shift = shift;
 
 	journal->j_revoke->hash_table =
-		kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL);
+		jbd_kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL);
 	if (!journal->j_revoke->hash_table) {
-		kfree(journal->j_revoke_table[0]->hash_table);
+		jbd_kfree(journal->j_revoke_table[0]->hash_table);
 		kmem_cache_free(revoke_table_cache, journal->j_revoke_table[0]);
 		kmem_cache_free(revoke_table_cache, journal->j_revoke_table[1]);
 		journal->j_revoke = NULL;
@@ -280,7 +280,7 @@ void journal_destroy_revoke(journal_t *j
 		J_ASSERT (list_empty(hash_list));
 	}
 
-	kfree(table->hash_table);
+	jbd_kfree(table->hash_table);
 	kmem_cache_free(revoke_table_cache, table);
 	journal->j_revoke = NULL;
 
@@ -293,7 +293,7 @@ void journal_destroy_revoke(journal_t *j
 		J_ASSERT (list_empty(hash_list));
 	}
 
-	kfree(table->hash_table);
+	jbd_kfree(table->hash_table);
 	kmem_cache_free(revoke_table_cache, table);
 	journal->j_revoke = NULL;
 }
Index: linux-2.6.23-rc6/fs/jbd2/journal.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd2/journal.c	2007-09-17 14:32:39.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd2/journal.c	2007-09-17 14:53:15.000000000 -0700
@@ -724,7 +724,8 @@ journal_t * jbd2_journal_init_dev(struct
 	journal->j_blocksize = blocksize;
 	n = journal->j_blocksize / sizeof(journal_block_tag_t);
 	journal->j_wbufsize = n;
-	journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL);
+	journal->j_wbuf = jbd2_kmalloc(n * sizeof(struct buffer_head*),
+					GFP_KERNEL);
 	if (!journal->j_wbuf) {
 		printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
 			__FUNCTION__);
@@ -778,7 +779,8 @@ journal_t * jbd2_journal_init_inode (str
 	/* journal descriptor can store up to n blocks -bzzz */
 	n = journal->j_blocksize / sizeof(journal_block_tag_t);
 	journal->j_wbufsize = n;
-	journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL);
+	journal->j_wbuf = jbd2_kmalloc(n * sizeof(struct buffer_head*),
+					GFP_KERNEL);
 	if (!journal->j_wbuf) {
 		printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
 			__FUNCTION__);
@@ -1158,7 +1160,7 @@ void jbd2_journal_destroy(journal_t *jou
 		iput(journal->j_inode);
 	if (journal->j_revoke)
 		jbd2_journal_destroy_revoke(journal);
-	kfree(journal->j_wbuf);
+	jbd2_kfree(journal->j_wbuf);
 	jbd2_kfree(journal);
 }
 
Index: linux-2.6.23-rc6/fs/jbd2/revoke.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd2/revoke.c	2007-09-17 14:32:34.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd2/revoke.c	2007-09-17 14:55:35.000000000 -0700
@@ -220,7 +220,7 @@ int jbd2_journal_init_revoke(journal_t *
 	journal->j_revoke->hash_shift = shift;
 
 	journal->j_revoke->hash_table =
-		kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL);
+		jbd2_kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL);
 	if (!journal->j_revoke->hash_table) {
 		kmem_cache_free(jbd2_revoke_table_cache, journal->j_revoke_table[0]);
 		journal->j_revoke = NULL;
@@ -232,7 +232,7 @@ int jbd2_journal_init_revoke(journal_t *
 
 	journal->j_revoke_table[1] = kmem_cache_alloc(jbd2_revoke_table_cache, GFP_KERNEL);
 	if (!journal->j_revoke_table[1]) {
-		kfree(journal->j_revoke_table[0]->hash_table);
+		jbd2_kfree(journal->j_revoke_table[0]->hash_table);
 		kmem_cache_free(jbd2_revoke_table_cache, journal->j_revoke_table[0]);
 		return -ENOMEM;
 	}
@@ -247,9 +247,9 @@ int jbd2_journal_init_revoke(journal_t *
 	journal->j_revoke->hash_shift = shift;
 
 	journal->j_revoke->hash_table =
-		kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL);
+		jbd2_kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL);
 	if (!journal->j_revoke->hash_table) {
-		kfree(journal->j_revoke_table[0]->hash_table);
+		jbd2_kfree(journal->j_revoke_table[0]->hash_table);
 		kmem_cache_free(jbd2_revoke_table_cache, journal->j_revoke_table[0]);
 		kmem_cache_free(jbd2_revoke_table_cache, journal->j_revoke_table[1]);
 		journal->j_revoke = NULL;
@@ -281,7 +281,7 @@ void jbd2_journal_destroy_revoke(journal
 		J_ASSERT (list_empty(hash_list));
 	}
 
-	kfree(table->hash_table);
+	jbd2_kfree(table->hash_table);
 	kmem_cache_free(jbd2_revoke_table_cache, table);
 	journal->j_revoke = NULL;
 
@@ -294,7 +294,7 @@ void jbd2_journal_destroy_revoke(journal
 		J_ASSERT (list_empty(hash_list));
 	}
 
-	kfree(table->hash_table);
+	jbd2_kfree(table->hash_table);
 	kmem_cache_free(jbd2_revoke_table_cache, table);
 	journal->j_revoke = NULL;
 }






^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] JBD slab cleanups
  2007-09-17 22:57                             ` Mingming Cao
@ 2007-09-18  9:04                               ` Christoph Hellwig
  2007-09-18 16:35                                 ` Mingming Cao
  0 siblings, 1 reply; 56+ messages in thread
From: Christoph Hellwig @ 2007-09-18  9:04 UTC (permalink / raw)
  To: Mingming Cao
  Cc: Badari Pulavarty, Christoph Hellwig, Christoph Lameter,
	linux-fsdevel, ext4 development, lkml

On Mon, Sep 17, 2007 at 03:57:31PM -0700, Mingming Cao wrote:
> Here is the incremental small cleanup patch. 
> 
> Remove kamlloc usages in jbd/jbd2 and consistently use jbd_kmalloc/jbd2_malloc.

Shouldn't we kill jbd_kmalloc instead?


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] JBD slab cleanups
  2007-09-18  9:04                               ` Christoph Hellwig
@ 2007-09-18 16:35                                 ` Mingming Cao
  2007-09-18 18:04                                   ` Dave Kleikamp
  0 siblings, 1 reply; 56+ messages in thread
From: Mingming Cao @ 2007-09-18 16:35 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Badari Pulavarty, Christoph Lameter, linux-fsdevel,
	ext4 development, lkml

On Tue, 2007-09-18 at 10:04 +0100, Christoph Hellwig wrote:
> On Mon, Sep 17, 2007 at 03:57:31PM -0700, Mingming Cao wrote:
> > Here is the incremental small cleanup patch. 
> > 
> > Remove kamlloc usages in jbd/jbd2 and consistently use jbd_kmalloc/jbd2_malloc.
> 
> Shouldn't we kill jbd_kmalloc instead?
> 

It seems useful to me to keep jbd_kmalloc/jbd_free. They are central
places to handle memory (de)allocation(<page size) via kmalloc/kfree, so
in the future if we need to change memory allocation in jbd(e.g. not
using kmalloc or using different flag), we don't need to touch every
place in the jbd code calling jbd_kmalloc.

Regards,
Mingming


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] JBD slab cleanups
  2007-09-18 16:35                                 ` Mingming Cao
@ 2007-09-18 18:04                                   ` Dave Kleikamp
  2007-09-19  1:00                                     ` Mingming Cao
  0 siblings, 1 reply; 56+ messages in thread
From: Dave Kleikamp @ 2007-09-18 18:04 UTC (permalink / raw)
  To: cmm
  Cc: Christoph Hellwig, Badari Pulavarty, Christoph Lameter,
	linux-fsdevel, ext4 development, lkml

On Tue, 2007-09-18 at 09:35 -0700, Mingming Cao wrote:
> On Tue, 2007-09-18 at 10:04 +0100, Christoph Hellwig wrote:
> > On Mon, Sep 17, 2007 at 03:57:31PM -0700, Mingming Cao wrote:
> > > Here is the incremental small cleanup patch. 
> > > 
> > > Remove kamlloc usages in jbd/jbd2 and consistently use jbd_kmalloc/jbd2_malloc.
> > 
> > Shouldn't we kill jbd_kmalloc instead?
> > 
> 
> It seems useful to me to keep jbd_kmalloc/jbd_free. They are central
> places to handle memory (de)allocation(<page size) via kmalloc/kfree, so
> in the future if we need to change memory allocation in jbd(e.g. not
> using kmalloc or using different flag), we don't need to touch every
> place in the jbd code calling jbd_kmalloc.

I disagree.  Why would jbd need to globally change the way it allocates
memory?  It currently uses kmalloc (and jbd_kmalloc) for allocating a
variety of structures.  Having to change one particular instance won't
necessarily mean we want to change all of them.  Adding unnecessary
wrappers only obfuscates the code making it harder to understand.  You
wouldn't want every subsystem to have it's own *_kmalloc() that took
different arguments.  Besides, there aren't that many calls to kmalloc
and kfree in the jbd code, so there wouldn't be much pain in changing
GFP flags or whatever, if it ever needed to be done.

Shaggy
-- 
David Kleikamp
IBM Linux Technology Center

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] JBD slab cleanups
  2007-09-18 18:04                                   ` Dave Kleikamp
@ 2007-09-19  1:00                                     ` Mingming Cao
  2007-09-19  2:19                                       ` Andrew Morton
  0 siblings, 1 reply; 56+ messages in thread
From: Mingming Cao @ 2007-09-19  1:00 UTC (permalink / raw)
  To: Dave Kleikamp, Andrew Morton
  Cc: Christoph Hellwig, Badari Pulavarty, Christoph Lameter,
	linux-fsdevel, ext4 development, lkml

On Tue, 2007-09-18 at 13:04 -0500, Dave Kleikamp wrote:
> On Tue, 2007-09-18 at 09:35 -0700, Mingming Cao wrote:
> > On Tue, 2007-09-18 at 10:04 +0100, Christoph Hellwig wrote:
> > > On Mon, Sep 17, 2007 at 03:57:31PM -0700, Mingming Cao wrote:
> > > > Here is the incremental small cleanup patch. 
> > > > 
> > > > Remove kamlloc usages in jbd/jbd2 and consistently use jbd_kmalloc/jbd2_malloc.
> > > 
> > > Shouldn't we kill jbd_kmalloc instead?
> > > 
> > 
> > It seems useful to me to keep jbd_kmalloc/jbd_free. They are central
> > places to handle memory (de)allocation(<page size) via kmalloc/kfree, so
> > in the future if we need to change memory allocation in jbd(e.g. not
> > using kmalloc or using different flag), we don't need to touch every
> > place in the jbd code calling jbd_kmalloc.
> 
> I disagree.  Why would jbd need to globally change the way it allocates
> memory?  It currently uses kmalloc (and jbd_kmalloc) for allocating a
> variety of structures.  Having to change one particular instance won't
> necessarily mean we want to change all of them.  Adding unnecessary
> wrappers only obfuscates the code making it harder to understand.  You
> wouldn't want every subsystem to have it's own *_kmalloc() that took
> different arguments.  Besides, there aren't that many calls to kmalloc
> and kfree in the jbd code, so there wouldn't be much pain in changing
> GFP flags or whatever, if it ever needed to be done.
> 
> Shaggy

Okay, Points taken, Here is the updated patch to get rid of slab
management and jbd_kmalloc from jbd totally. This patch is intend to
replace the patch in mm tree, Andrew, could you pick up this one
instead?

Thanks,

Mingming


jbd/jbd2: JBD memory allocation cleanups

From: Christoph Lameter <clameter@sgi.com>

JBD: Replace slab allocations with page cache allocations

JBD allocate memory for committed_data and frozen_data from slab. However
JBD should not pass slab pages down to the block layer. Use page allocator pages instead. This will also prepare JBD for the large blocksize patchset.


Also this patch cleans up jbd_kmalloc and replace it with kmalloc directly

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>

---
 fs/jbd/commit.c       |    6 +--
 fs/jbd/journal.c      |   99 ++------------------------------------------------
 fs/jbd/transaction.c  |   12 +++---
 fs/jbd2/commit.c      |    6 +--
 fs/jbd2/journal.c     |   99 ++------------------------------------------------
 fs/jbd2/transaction.c |   18 ++++-----
 include/linux/jbd.h   |   18 +++++----
 include/linux/jbd2.h  |   21 +++++-----
 8 files changed, 52 insertions(+), 227 deletions(-)

Index: linux-2.6.23-rc6/fs/jbd/journal.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd/journal.c	2007-09-18 17:19:01.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd/journal.c	2007-09-18 17:51:21.000000000 -0700
@@ -83,7 +83,6 @@ EXPORT_SYMBOL(journal_force_commit);
 
 static int journal_convert_superblock_v1(journal_t *, journal_superblock_t *);
 static void __journal_abort_soft (journal_t *journal, int errno);
-static int journal_create_jbd_slab(size_t slab_size);
 
 /*
  * Helper function used to manage commit timeouts
@@ -334,10 +333,10 @@ repeat:
 		char *tmp;
 
 		jbd_unlock_bh_state(bh_in);
-		tmp = jbd_slab_alloc(bh_in->b_size, GFP_NOFS);
+		tmp = jbd_alloc(bh_in->b_size, GFP_NOFS);
 		jbd_lock_bh_state(bh_in);
 		if (jh_in->b_frozen_data) {
-			jbd_slab_free(tmp, bh_in->b_size);
+			jbd_free(tmp, bh_in->b_size);
 			goto repeat;
 		}
 
@@ -654,7 +653,7 @@ static journal_t * journal_init_common (
 	journal_t *journal;
 	int err;
 
-	journal = jbd_kmalloc(sizeof(*journal), GFP_KERNEL);
+	journal = kmalloc(sizeof(*journal), GFP_KERNEL|__GFP_NOFAIL);
 	if (!journal)
 		goto fail;
 	memset(journal, 0, sizeof(*journal));
@@ -1095,13 +1094,6 @@ int journal_load(journal_t *journal)
 		}
 	}
 
-	/*
-	 * Create a slab for this blocksize
-	 */
-	err = journal_create_jbd_slab(be32_to_cpu(sb->s_blocksize));
-	if (err)
-		return err;
-
 	/* Let the recovery code check whether it needs to recover any
 	 * data from the journal. */
 	if (journal_recover(journal))
@@ -1615,86 +1607,6 @@ int journal_blocks_per_page(struct inode
 }
 
 /*
- * Simple support for retrying memory allocations.  Introduced to help to
- * debug different VM deadlock avoidance strategies.
- */
-void * __jbd_kmalloc (const char *where, size_t size, gfp_t flags, int retry)
-{
-	return kmalloc(size, flags | (retry ? __GFP_NOFAIL : 0));
-}
-
-/*
- * jbd slab management: create 1k, 2k, 4k, 8k slabs as needed
- * and allocate frozen and commit buffers from these slabs.
- *
- * Reason for doing this is to avoid, SLAB_DEBUG - since it could
- * cause bh to cross page boundary.
- */
-
-#define JBD_MAX_SLABS 5
-#define JBD_SLAB_INDEX(size)  (size >> 11)
-
-static struct kmem_cache *jbd_slab[JBD_MAX_SLABS];
-static const char *jbd_slab_names[JBD_MAX_SLABS] = {
-	"jbd_1k", "jbd_2k", "jbd_4k", NULL, "jbd_8k"
-};
-
-static void journal_destroy_jbd_slabs(void)
-{
-	int i;
-
-	for (i = 0; i < JBD_MAX_SLABS; i++) {
-		if (jbd_slab[i])
-			kmem_cache_destroy(jbd_slab[i]);
-		jbd_slab[i] = NULL;
-	}
-}
-
-static int journal_create_jbd_slab(size_t slab_size)
-{
-	int i = JBD_SLAB_INDEX(slab_size);
-
-	BUG_ON(i >= JBD_MAX_SLABS);
-
-	/*
-	 * Check if we already have a slab created for this size
-	 */
-	if (jbd_slab[i])
-		return 0;
-
-	/*
-	 * Create a slab and force alignment to be same as slabsize -
-	 * this will make sure that allocations won't cross the page
-	 * boundary.
-	 */
-	jbd_slab[i] = kmem_cache_create(jbd_slab_names[i],
-				slab_size, slab_size, 0, NULL);
-	if (!jbd_slab[i]) {
-		printk(KERN_EMERG "JBD: no memory for jbd_slab cache\n");
-		return -ENOMEM;
-	}
-	return 0;
-}
-
-void * jbd_slab_alloc(size_t size, gfp_t flags)
-{
-	int idx;
-
-	idx = JBD_SLAB_INDEX(size);
-	BUG_ON(jbd_slab[idx] == NULL);
-	return kmem_cache_alloc(jbd_slab[idx], flags | __GFP_NOFAIL);
-}
-
-void jbd_slab_free(void *ptr,  size_t size)
-{
-	int idx;
-
-	idx = JBD_SLAB_INDEX(size);
-	BUG_ON(jbd_slab[idx] == NULL);
-	kmem_cache_free(jbd_slab[idx], ptr);
-}
-
-/*
  * Journal_head storage management
  */
 static struct kmem_cache *journal_head_cache;
@@ -1881,13 +1793,13 @@ static void __journal_remove_journal_hea
 				printk(KERN_WARNING "%s: freeing "
 						"b_frozen_data\n",
 						__FUNCTION__);
-				jbd_slab_free(jh->b_frozen_data, bh->b_size);
+				jbd_free(jh->b_frozen_data, bh->b_size);
 			}
 			if (jh->b_committed_data) {
 				printk(KERN_WARNING "%s: freeing "
 						"b_committed_data\n",
 						__FUNCTION__);
-				jbd_slab_free(jh->b_committed_data, bh->b_size);
+				jbd_free(jh->b_committed_data, bh->b_size);
 			}
 			bh->b_private = NULL;
 			jh->b_bh = NULL;	/* debug, really */
@@ -2042,7 +1954,6 @@ static void journal_destroy_caches(void)
 	journal_destroy_revoke_caches();
 	journal_destroy_journal_head_cache();
 	journal_destroy_handle_cache();
-	journal_destroy_jbd_slabs();
 }
 
 static int __init journal_init(void)
Index: linux-2.6.23-rc6/include/linux/jbd.h
===================================================================
--- linux-2.6.23-rc6.orig/include/linux/jbd.h	2007-09-18 17:19:01.000000000 -0700
+++ linux-2.6.23-rc6/include/linux/jbd.h	2007-09-18 17:51:21.000000000 -0700
@@ -71,14 +71,16 @@ extern int journal_enable_debug;
 #define jbd_debug(f, a...)	/**/
 #endif
 
-extern void * __jbd_kmalloc (const char *where, size_t size, gfp_t flags, int retry);
-extern void * jbd_slab_alloc(size_t size, gfp_t flags);
-extern void jbd_slab_free(void *ptr, size_t size);
-
-#define jbd_kmalloc(size, flags) \
-	__jbd_kmalloc(__FUNCTION__, (size), (flags), journal_oom_retry)
-#define jbd_rep_kmalloc(size, flags) \
-	__jbd_kmalloc(__FUNCTION__, (size), (flags), 1)
+
+static inline void *jbd_alloc(size_t size, gfp_t flags)
+{
+	return (void *)__get_free_pages(flags, get_order(size));
+}
+
+static inline void jbd_free(void *ptr, size_t size)
+{
+	free_pages((unsigned long)ptr, get_order(size));
+};
 
 #define JFS_MIN_JOURNAL_BLOCKS 1024
 
Index: linux-2.6.23-rc6/include/linux/jbd2.h
===================================================================
--- linux-2.6.23-rc6.orig/include/linux/jbd2.h	2007-09-18 17:19:01.000000000 -0700
+++ linux-2.6.23-rc6/include/linux/jbd2.h	2007-09-18 17:51:21.000000000 -0700
@@ -71,14 +71,15 @@ extern u8 jbd2_journal_enable_debug;
 #define jbd_debug(f, a...)	/**/
 #endif
 
-extern void * __jbd2_kmalloc (const char *where, size_t size, gfp_t flags, int retry);
-extern void * jbd2_slab_alloc(size_t size, gfp_t flags);
-extern void jbd2_slab_free(void *ptr, size_t size);
-
-#define jbd_kmalloc(size, flags) \
-	__jbd2_kmalloc(__FUNCTION__, (size), (flags), journal_oom_retry)
-#define jbd_rep_kmalloc(size, flags) \
-	__jbd2_kmalloc(__FUNCTION__, (size), (flags), 1)
+static inline void *jbd2_alloc(size_t size, gfp_t flags)
+{
+	return (void *)__get_free_pages(flags, get_order(size));
+}
+
+static inline void jbd2_free(void *ptr, size_t size)
+{
+	free_pages((unsigned long)ptr, get_order(size));
+};
 
 #define JBD2_MIN_JOURNAL_BLOCKS 1024
 
@@ -959,12 +960,12 @@ void jbd2_journal_put_journal_head(struc
  */
 extern struct kmem_cache *jbd2_handle_cache;
 
-static inline handle_t *jbd_alloc_handle(gfp_t gfp_flags)
+static inline handle_t *jbd2_alloc_handle(gfp_t gfp_flags)
 {
 	return kmem_cache_alloc(jbd2_handle_cache, gfp_flags);
 }
 
-static inline void jbd_free_handle(handle_t *handle)
+static inline void jbd2_free_handle(handle_t *handle)
 {
 	kmem_cache_free(jbd2_handle_cache, handle);
 }
Index: linux-2.6.23-rc6/fs/jbd2/journal.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd2/journal.c	2007-09-18 17:19:01.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd2/journal.c	2007-09-18 17:51:21.000000000 -0700
@@ -84,7 +84,6 @@ EXPORT_SYMBOL(jbd2_journal_force_commit)
 
 static int journal_convert_superblock_v1(journal_t *, journal_superblock_t *);
 static void __journal_abort_soft (journal_t *journal, int errno);
-static int jbd2_journal_create_jbd_slab(size_t slab_size);
 
 /*
  * Helper function used to manage commit timeouts
@@ -335,10 +334,10 @@ repeat:
 		char *tmp;
 
 		jbd_unlock_bh_state(bh_in);
-		tmp = jbd2_slab_alloc(bh_in->b_size, GFP_NOFS);
+		tmp = jbd2_alloc(bh_in->b_size, GFP_NOFS);
 		jbd_lock_bh_state(bh_in);
 		if (jh_in->b_frozen_data) {
-			jbd2_slab_free(tmp, bh_in->b_size);
+			jbd2_free(tmp, bh_in->b_size);
 			goto repeat;
 		}
 
@@ -655,7 +654,7 @@ static journal_t * journal_init_common (
 	journal_t *journal;
 	int err;
 
-	journal = jbd_kmalloc(sizeof(*journal), GFP_KERNEL);
+	journal = kmalloc(sizeof(*journal), GFP_KERNEL|__GFP_NOFAIL);
 	if (!journal)
 		goto fail;
 	memset(journal, 0, sizeof(*journal));
@@ -1096,13 +1095,6 @@ int jbd2_journal_load(journal_t *journal
 		}
 	}
 
-	/*
-	 * Create a slab for this blocksize
-	 */
-	err = jbd2_journal_create_jbd_slab(be32_to_cpu(sb->s_blocksize));
-	if (err)
-		return err;
-
 	/* Let the recovery code check whether it needs to recover any
 	 * data from the journal. */
 	if (jbd2_journal_recover(journal))
@@ -1627,86 +1619,6 @@ size_t journal_tag_bytes(journal_t *jour
 }
 
 /*
- * Simple support for retrying memory allocations.  Introduced to help to
- * debug different VM deadlock avoidance strategies.
- */
-void * __jbd2_kmalloc (const char *where, size_t size, gfp_t flags, int retry)
-{
-	return kmalloc(size, flags | (retry ? __GFP_NOFAIL : 0));
-}
-
-/*
- * jbd slab management: create 1k, 2k, 4k, 8k slabs as needed
- * and allocate frozen and commit buffers from these slabs.
- *
- * Reason for doing this is to avoid, SLAB_DEBUG - since it could
- * cause bh to cross page boundary.
- */
-
-#define JBD_MAX_SLABS 5
-#define JBD_SLAB_INDEX(size)  (size >> 11)
-
-static struct kmem_cache *jbd_slab[JBD_MAX_SLABS];
-static const char *jbd_slab_names[JBD_MAX_SLABS] = {
-	"jbd2_1k", "jbd2_2k", "jbd2_4k", NULL, "jbd2_8k"
-};
-
-static void jbd2_journal_destroy_jbd_slabs(void)
-{
-	int i;
-
-	for (i = 0; i < JBD_MAX_SLABS; i++) {
-		if (jbd_slab[i])
-			kmem_cache_destroy(jbd_slab[i]);
-		jbd_slab[i] = NULL;
-	}
-}
-
-static int jbd2_journal_create_jbd_slab(size_t slab_size)
-{
-	int i = JBD_SLAB_INDEX(slab_size);
-
-	BUG_ON(i >= JBD_MAX_SLABS);
-
-	/*
-	 * Check if we already have a slab created for this size
-	 */
-	if (jbd_slab[i])
-		return 0;
-
-	/*
-	 * Create a slab and force alignment to be same as slabsize -
-	 * this will make sure that allocations won't cross the page
-	 * boundary.
-	 */
-	jbd_slab[i] = kmem_cache_create(jbd_slab_names[i],
-				slab_size, slab_size, 0, NULL);
-	if (!jbd_slab[i]) {
-		printk(KERN_EMERG "JBD: no memory for jbd_slab cache\n");
-		return -ENOMEM;
-	}
-	return 0;
-}
-
-void * jbd2_slab_alloc(size_t size, gfp_t flags)
-{
-	int idx;
-
-	idx = JBD_SLAB_INDEX(size);
-	BUG_ON(jbd_slab[idx] == NULL);
-	return kmem_cache_alloc(jbd_slab[idx], flags | __GFP_NOFAIL);
-}
-
-void jbd2_slab_free(void *ptr,  size_t size)
-{
-	int idx;
-
-	idx = JBD_SLAB_INDEX(size);
-	BUG_ON(jbd_slab[idx] == NULL);
-	kmem_cache_free(jbd_slab[idx], ptr);
-}
-
-/*
  * Journal_head storage management
  */
 static struct kmem_cache *jbd2_journal_head_cache;
@@ -1893,13 +1805,13 @@ static void __journal_remove_journal_hea
 				printk(KERN_WARNING "%s: freeing "
 						"b_frozen_data\n",
 						__FUNCTION__);
-				jbd2_slab_free(jh->b_frozen_data, bh->b_size);
+				jbd2_free(jh->b_frozen_data, bh->b_size);
 			}
 			if (jh->b_committed_data) {
 				printk(KERN_WARNING "%s: freeing "
 						"b_committed_data\n",
 						__FUNCTION__);
-				jbd2_slab_free(jh->b_committed_data, bh->b_size);
+				jbd2_free(jh->b_committed_data, bh->b_size);
 			}
 			bh->b_private = NULL;
 			jh->b_bh = NULL;	/* debug, really */
@@ -2040,7 +1952,6 @@ static void jbd2_journal_destroy_caches(
 	jbd2_journal_destroy_revoke_caches();
 	jbd2_journal_destroy_jbd2_journal_head_cache();
 	jbd2_journal_destroy_handle_cache();
-	jbd2_journal_destroy_jbd_slabs();
 }
 
 static int __init journal_init(void)
Index: linux-2.6.23-rc6/fs/jbd/commit.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd/commit.c	2007-09-18 17:19:01.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd/commit.c	2007-09-18 17:23:26.000000000 -0700
@@ -375,7 +375,7 @@ void journal_commit_transaction(journal_
 			struct buffer_head *bh = jh2bh(jh);
 
 			jbd_lock_bh_state(bh);
-			jbd_slab_free(jh->b_committed_data, bh->b_size);
+			jbd_free(jh->b_committed_data, bh->b_size);
 			jh->b_committed_data = NULL;
 			jbd_unlock_bh_state(bh);
 		}
@@ -792,14 +792,14 @@ restart_loop:
 		 * Otherwise, we can just throw away the frozen data now.
 		 */
 		if (jh->b_committed_data) {
-			jbd_slab_free(jh->b_committed_data, bh->b_size);
+			jbd_free(jh->b_committed_data, bh->b_size);
 			jh->b_committed_data = NULL;
 			if (jh->b_frozen_data) {
 				jh->b_committed_data = jh->b_frozen_data;
 				jh->b_frozen_data = NULL;
 			}
 		} else if (jh->b_frozen_data) {
-			jbd_slab_free(jh->b_frozen_data, bh->b_size);
+			jbd_free(jh->b_frozen_data, bh->b_size);
 			jh->b_frozen_data = NULL;
 		}
 
Index: linux-2.6.23-rc6/fs/jbd2/commit.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd2/commit.c	2007-09-18 17:19:01.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd2/commit.c	2007-09-18 17:23:26.000000000 -0700
@@ -384,7 +384,7 @@ void jbd2_journal_commit_transaction(jou
 			struct buffer_head *bh = jh2bh(jh);
 
 			jbd_lock_bh_state(bh);
-			jbd2_slab_free(jh->b_committed_data, bh->b_size);
+			jbd2_free(jh->b_committed_data, bh->b_size);
 			jh->b_committed_data = NULL;
 			jbd_unlock_bh_state(bh);
 		}
@@ -801,14 +801,14 @@ restart_loop:
 		 * Otherwise, we can just throw away the frozen data now.
 		 */
 		if (jh->b_committed_data) {
-			jbd2_slab_free(jh->b_committed_data, bh->b_size);
+			jbd2_free(jh->b_committed_data, bh->b_size);
 			jh->b_committed_data = NULL;
 			if (jh->b_frozen_data) {
 				jh->b_committed_data = jh->b_frozen_data;
 				jh->b_frozen_data = NULL;
 			}
 		} else if (jh->b_frozen_data) {
-			jbd2_slab_free(jh->b_frozen_data, bh->b_size);
+			jbd2_free(jh->b_frozen_data, bh->b_size);
 			jh->b_frozen_data = NULL;
 		}
 
Index: linux-2.6.23-rc6/fs/jbd/transaction.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd/transaction.c	2007-09-18 17:19:01.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd/transaction.c	2007-09-18 17:51:21.000000000 -0700
@@ -96,8 +96,8 @@ static int start_this_handle(journal_t *
 
 alloc_transaction:
 	if (!journal->j_running_transaction) {
-		new_transaction = jbd_kmalloc(sizeof(*new_transaction),
-						GFP_NOFS);
+		new_transaction = kmalloc(sizeof(*new_transaction),
+						GFP_NOFS|__GFP_NOFAIL);
 		if (!new_transaction) {
 			ret = -ENOMEM;
 			goto out;
@@ -668,7 +668,7 @@ repeat:
 				JBUFFER_TRACE(jh, "allocate memory for buffer");
 				jbd_unlock_bh_state(bh);
 				frozen_buffer =
-					jbd_slab_alloc(jh2bh(jh)->b_size,
+					jbd_alloc(jh2bh(jh)->b_size,
 							 GFP_NOFS);
 				if (!frozen_buffer) {
 					printk(KERN_EMERG
@@ -728,7 +728,7 @@ done:
 
 out:
 	if (unlikely(frozen_buffer))	/* It's usually NULL */
-		jbd_slab_free(frozen_buffer, bh->b_size);
+		jbd_free(frozen_buffer, bh->b_size);
 
 	JBUFFER_TRACE(jh, "exit");
 	return error;
@@ -881,7 +881,7 @@ int journal_get_undo_access(handle_t *ha
 
 repeat:
 	if (!jh->b_committed_data) {
-		committed_data = jbd_slab_alloc(jh2bh(jh)->b_size, GFP_NOFS);
+		committed_data = jbd_alloc(jh2bh(jh)->b_size, GFP_NOFS);
 		if (!committed_data) {
 			printk(KERN_EMERG "%s: No memory for committed data\n",
 				__FUNCTION__);
@@ -908,7 +908,7 @@ repeat:
 out:
 	journal_put_journal_head(jh);
 	if (unlikely(committed_data))
-		jbd_slab_free(committed_data, bh->b_size);
+		jbd_free(committed_data, bh->b_size);
 	return err;
 }
 
Index: linux-2.6.23-rc6/fs/jbd2/transaction.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd2/transaction.c	2007-09-18 17:19:01.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd2/transaction.c	2007-09-18 17:51:21.000000000 -0700
@@ -96,8 +96,8 @@ static int start_this_handle(journal_t *
 
 alloc_transaction:
 	if (!journal->j_running_transaction) {
-		new_transaction = jbd_kmalloc(sizeof(*new_transaction),
-						GFP_NOFS);
+		new_transaction = kmalloc(sizeof(*new_transaction),
+						GFP_NOFS|__GFP_NOFAIL);
 		if (!new_transaction) {
 			ret = -ENOMEM;
 			goto out;
@@ -236,7 +236,7 @@ out:
 /* Allocate a new handle.  This should probably be in a slab... */
 static handle_t *new_handle(int nblocks)
 {
-	handle_t *handle = jbd_alloc_handle(GFP_NOFS);
+	handle_t *handle = jbd2_alloc_handle(GFP_NOFS);
 	if (!handle)
 		return NULL;
 	memset(handle, 0, sizeof(*handle));
@@ -282,7 +282,7 @@ handle_t *jbd2_journal_start(journal_t *
 
 	err = start_this_handle(journal, handle);
 	if (err < 0) {
-		jbd_free_handle(handle);
+		jbd2_free_handle(handle);
 		current->journal_info = NULL;
 		handle = ERR_PTR(err);
 	}
@@ -668,7 +668,7 @@ repeat:
 				JBUFFER_TRACE(jh, "allocate memory for buffer");
 				jbd_unlock_bh_state(bh);
 				frozen_buffer =
-					jbd2_slab_alloc(jh2bh(jh)->b_size,
+					jbd2_alloc(jh2bh(jh)->b_size,
 							 GFP_NOFS);
 				if (!frozen_buffer) {
 					printk(KERN_EMERG
@@ -728,7 +728,7 @@ done:
 
 out:
 	if (unlikely(frozen_buffer))	/* It's usually NULL */
-		jbd2_slab_free(frozen_buffer, bh->b_size);
+		jbd2_free(frozen_buffer, bh->b_size);
 
 	JBUFFER_TRACE(jh, "exit");
 	return error;
@@ -881,7 +881,7 @@ int jbd2_journal_get_undo_access(handle_
 
 repeat:
 	if (!jh->b_committed_data) {
-		committed_data = jbd2_slab_alloc(jh2bh(jh)->b_size, GFP_NOFS);
+		committed_data = jbd2_alloc(jh2bh(jh)->b_size, GFP_NOFS);
 		if (!committed_data) {
 			printk(KERN_EMERG "%s: No memory for committed data\n",
 				__FUNCTION__);
@@ -908,7 +908,7 @@ repeat:
 out:
 	jbd2_journal_put_journal_head(jh);
 	if (unlikely(committed_data))
-		jbd2_slab_free(committed_data, bh->b_size);
+		jbd2_free(committed_data, bh->b_size);
 	return err;
 }
 
@@ -1411,7 +1411,7 @@ int jbd2_journal_stop(handle_t *handle)
 		spin_unlock(&journal->j_state_lock);
 	}
 
-	jbd_free_handle(handle);
+	jbd2_free_handle(handle);
 	return err;
 }
 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] JBD slab cleanups
  2007-09-19  1:00                                     ` Mingming Cao
@ 2007-09-19  2:19                                       ` Andrew Morton
  2007-09-19 19:15                                         ` Mingming Cao
  0 siblings, 1 reply; 56+ messages in thread
From: Andrew Morton @ 2007-09-19  2:19 UTC (permalink / raw)
  To: cmm
  Cc: Dave Kleikamp, Christoph Hellwig, Badari Pulavarty,
	Christoph Lameter, linux-fsdevel, ext4 development, lkml

On Tue, 18 Sep 2007 18:00:01 -0700 Mingming Cao <cmm@us.ibm.com> wrote:

> JBD: Replace slab allocations with page cache allocations
> 
> JBD allocate memory for committed_data and frozen_data from slab. However
> JBD should not pass slab pages down to the block layer. Use page allocator pages instead. This will also prepare JBD for the large blocksize patchset.
> 
> 
> Also this patch cleans up jbd_kmalloc and replace it with kmalloc directly

__GFP_NOFAIL should only be used when we have no way of recovering
from failure.  The allocation in journal_init_common() (at least)
_can_ recover and hence really shouldn't be using __GFP_NOFAIL.

(Actually, nothing in the kernel should be using __GFP_NOFAIL.  It is 
there as a marker which says "we really shouldn't be doing this but
we don't know how to fix it").

So sometime it'd be good if you could review all the __GFP_NOFAILs in
there and see if we can remove some, thanks.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] JBD slab cleanups
  2007-09-19  2:19                                       ` Andrew Morton
@ 2007-09-19 19:15                                         ` Mingming Cao
  2007-09-19 19:22                                           ` [PATCH] JBD: use GFP_NOFS in kmalloc Mingming Cao
                                                             ` (2 more replies)
  0 siblings, 3 replies; 56+ messages in thread
From: Mingming Cao @ 2007-09-19 19:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Kleikamp, Christoph Hellwig, Badari Pulavarty,
	Christoph Lameter, linux-fsdevel, ext4 development, lkml

On Tue, 2007-09-18 at 19:19 -0700, Andrew Morton wrote:
> On Tue, 18 Sep 2007 18:00:01 -0700 Mingming Cao <cmm@us.ibm.com> wrote:
> 
> > JBD: Replace slab allocations with page cache allocations
> > 
> > JBD allocate memory for committed_data and frozen_data from slab. However
> > JBD should not pass slab pages down to the block layer. Use page allocator pages instead. This will also prepare JBD for the large blocksize patchset.
> > 
> > 
> > Also this patch cleans up jbd_kmalloc and replace it with kmalloc directly
> 
> __GFP_NOFAIL should only be used when we have no way of recovering
> from failure.  The allocation in journal_init_common() (at least)
> _can_ recover and hence really shouldn't be using __GFP_NOFAIL.
> 
> (Actually, nothing in the kernel should be using __GFP_NOFAIL.  It is 
> there as a marker which says "we really shouldn't be doing this but
> we don't know how to fix it").
> 
> So sometime it'd be good if you could review all the __GFP_NOFAILs in
> there and see if we can remove some, thanks.

Here is the patch to clean up __GFP_NOFAIL flag in jbd/jbd2. In all
cases except one handles memory allocation failure so I get rid of those
GFP_NOFAIL flags.

Also, shouldn't we use GFP_KERNEL instead of GFP_NOFS flag for kmalloc
in jbd/jbd2? I will send a separate patch to cleanup that.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
---
 fs/jbd/journal.c      |    2 +-
 fs/jbd/transaction.c  |    3 +--
 fs/jbd2/journal.c     |    2 +-
 fs/jbd2/transaction.c |    3 +--
 4 files changed, 4 insertions(+), 6 deletions(-)

Index: linux-2.6.23-rc6/fs/jbd/journal.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd/journal.c	2007-09-19 11:47:58.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd/journal.c	2007-09-19 11:48:40.000000000 -0700
@@ -653,7 +653,7 @@ static journal_t * journal_init_common (
 	journal_t *journal;
 	int err;
 
-	journal = kmalloc(sizeof(*journal), GFP_KERNEL|__GFP_NOFAIL);
+	journal = kmalloc(sizeof(*journal), GFP_KERNEL);
 	if (!journal)
 		goto fail;
 	memset(journal, 0, sizeof(*journal));
Index: linux-2.6.23-rc6/fs/jbd/transaction.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd/transaction.c	2007-09-19 11:48:05.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd/transaction.c	2007-09-19 11:49:10.000000000 -0700
@@ -96,8 +96,7 @@ static int start_this_handle(journal_t *
 
 alloc_transaction:
 	if (!journal->j_running_transaction) {
-		new_transaction = kmalloc(sizeof(*new_transaction),
-						GFP_NOFS|__GFP_NOFAIL);
+		new_transaction = kmalloc(sizeof(*new_transaction), GFP_NOFS);
 		if (!new_transaction) {
 			ret = -ENOMEM;
 			goto out;
Index: linux-2.6.23-rc6/fs/jbd2/journal.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd2/journal.c	2007-09-19 11:48:14.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd2/journal.c	2007-09-19 11:49:46.000000000 -0700
@@ -654,7 +654,7 @@ static journal_t * journal_init_common (
 	journal_t *journal;
 	int err;
 
-	journal = kmalloc(sizeof(*journal), GFP_KERNEL|__GFP_NOFAIL);
+	journal = kmalloc(sizeof(*journal), GFP_KERNEL);
 	if (!journal)
 		goto fail;
 	memset(journal, 0, sizeof(*journal));
Index: linux-2.6.23-rc6/fs/jbd2/transaction.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd2/transaction.c	2007-09-19 11:48:08.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd2/transaction.c	2007-09-19 11:50:12.000000000 -0700
@@ -96,8 +96,7 @@ static int start_this_handle(journal_t *
 
 alloc_transaction:
 	if (!journal->j_running_transaction) {
-		new_transaction = kmalloc(sizeof(*new_transaction),
-						GFP_NOFS|__GFP_NOFAIL);
+		new_transaction = kmalloc(sizeof(*new_transaction), GFP_NOFS);
 		if (!new_transaction) {
 			ret = -ENOMEM;
 			goto out;





^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH] JBD: use GFP_NOFS  in kmalloc
  2007-09-19 19:15                                         ` Mingming Cao
@ 2007-09-19 19:22                                           ` Mingming Cao
  2007-09-19 21:34                                             ` Andrew Morton
  2007-09-20  4:25                                             ` Andreas Dilger
  2007-09-19 19:26                                           ` [PATCH] JBD slab cleanups Dave Kleikamp
  2007-09-19 19:48                                           ` Andreas Dilger
  2 siblings, 2 replies; 56+ messages in thread
From: Mingming Cao @ 2007-09-19 19:22 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-fsdevel, ext4 development, lkml

Convert the GFP_KERNEL flag used in JBD/JBD2 to GFP_NOFS, consistent
with the rest of kmalloc flag used in the JBD/JBD2 layer.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>

---
 fs/jbd/journal.c  |    6 +++---
 fs/jbd/revoke.c   |    8 ++++----
 fs/jbd2/journal.c |    6 +++---
 fs/jbd2/revoke.c  |    8 ++++----
 4 files changed, 14 insertions(+), 14 deletions(-)

Index: linux-2.6.23-rc6/fs/jbd/journal.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd/journal.c	2007-09-19 11:51:10.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd/journal.c	2007-09-19 11:51:57.000000000 -0700
@@ -653,7 +653,7 @@ static journal_t * journal_init_common (
 	journal_t *journal;
 	int err;
 
-	journal = kmalloc(sizeof(*journal), GFP_KERNEL);
+	journal = kmalloc(sizeof(*journal), GFP_NOFS);
 	if (!journal)
 		goto fail;
 	memset(journal, 0, sizeof(*journal));
@@ -723,7 +723,7 @@ journal_t * journal_init_dev(struct bloc
 	journal->j_blocksize = blocksize;
 	n = journal->j_blocksize / sizeof(journal_block_tag_t);
 	journal->j_wbufsize = n;
-	journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL);
+	journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_NOFS);
 	if (!journal->j_wbuf) {
 		printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
 			__FUNCTION__);
@@ -777,7 +777,7 @@ journal_t * journal_init_inode (struct i
 	/* journal descriptor can store up to n blocks -bzzz */
 	n = journal->j_blocksize / sizeof(journal_block_tag_t);
 	journal->j_wbufsize = n;
-	journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL);
+	journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_NOFS);
 	if (!journal->j_wbuf) {
 		printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
 			__FUNCTION__);
Index: linux-2.6.23-rc6/fs/jbd/revoke.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd/revoke.c	2007-09-19 11:51:30.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd/revoke.c	2007-09-19 11:52:34.000000000 -0700
@@ -206,7 +206,7 @@ int journal_init_revoke(journal_t *journ
 	while((tmp >>= 1UL) != 0UL)
 		shift++;
 
-	journal->j_revoke_table[0] = kmem_cache_alloc(revoke_table_cache, GFP_KERNEL);
+	journal->j_revoke_table[0] = kmem_cache_alloc(revoke_table_cache, GFP_NOFS);
 	if (!journal->j_revoke_table[0])
 		return -ENOMEM;
 	journal->j_revoke = journal->j_revoke_table[0];
@@ -219,7 +219,7 @@ int journal_init_revoke(journal_t *journ
 	journal->j_revoke->hash_shift = shift;
 
 	journal->j_revoke->hash_table =
-		kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL);
+		kmalloc(hash_size * sizeof(struct list_head), GFP_NOFS);
 	if (!journal->j_revoke->hash_table) {
 		kmem_cache_free(revoke_table_cache, journal->j_revoke_table[0]);
 		journal->j_revoke = NULL;
@@ -229,7 +229,7 @@ int journal_init_revoke(journal_t *journ
 	for (tmp = 0; tmp < hash_size; tmp++)
 		INIT_LIST_HEAD(&journal->j_revoke->hash_table[tmp]);
 
-	journal->j_revoke_table[1] = kmem_cache_alloc(revoke_table_cache, GFP_KERNEL);
+	journal->j_revoke_table[1] = kmem_cache_alloc(revoke_table_cache, GFP_NOFS);
 	if (!journal->j_revoke_table[1]) {
 		kfree(journal->j_revoke_table[0]->hash_table);
 		kmem_cache_free(revoke_table_cache, journal->j_revoke_table[0]);
@@ -246,7 +246,7 @@ int journal_init_revoke(journal_t *journ
 	journal->j_revoke->hash_shift = shift;
 
 	journal->j_revoke->hash_table =
-		kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL);
+		kmalloc(hash_size * sizeof(struct list_head), GFP_NOFS);
 	if (!journal->j_revoke->hash_table) {
 		kfree(journal->j_revoke_table[0]->hash_table);
 		kmem_cache_free(revoke_table_cache, journal->j_revoke_table[0]);
Index: linux-2.6.23-rc6/fs/jbd2/journal.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd2/journal.c	2007-09-19 11:52:48.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd2/journal.c	2007-09-19 11:53:12.000000000 -0700
@@ -654,7 +654,7 @@ static journal_t * journal_init_common (
 	journal_t *journal;
 	int err;
 
-	journal = kmalloc(sizeof(*journal), GFP_KERNEL);
+	journal = kmalloc(sizeof(*journal), GFP_NOFS);
 	if (!journal)
 		goto fail;
 	memset(journal, 0, sizeof(*journal));
@@ -724,7 +724,7 @@ journal_t * jbd2_journal_init_dev(struct
 	journal->j_blocksize = blocksize;
 	n = journal->j_blocksize / sizeof(journal_block_tag_t);
 	journal->j_wbufsize = n;
-	journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL);
+	journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_NOFS);
 	if (!journal->j_wbuf) {
 		printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
 			__FUNCTION__);
@@ -778,7 +778,7 @@ journal_t * jbd2_journal_init_inode (str
 	/* journal descriptor can store up to n blocks -bzzz */
 	n = journal->j_blocksize / sizeof(journal_block_tag_t);
 	journal->j_wbufsize = n;
-	journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL);
+	journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_NOFS);
 	if (!journal->j_wbuf) {
 		printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
 			__FUNCTION__);
Index: linux-2.6.23-rc6/fs/jbd2/revoke.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd2/revoke.c	2007-09-19 11:52:53.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd2/revoke.c	2007-09-19 11:53:32.000000000 -0700
@@ -207,7 +207,7 @@ int jbd2_journal_init_revoke(journal_t *
 	while((tmp >>= 1UL) != 0UL)
 		shift++;
 
-	journal->j_revoke_table[0] = kmem_cache_alloc(jbd2_revoke_table_cache, GFP_KERNEL);
+	journal->j_revoke_table[0] = kmem_cache_alloc(jbd2_revoke_table_cache, GFP_NOFS);
 	if (!journal->j_revoke_table[0])
 		return -ENOMEM;
 	journal->j_revoke = journal->j_revoke_table[0];
@@ -220,7 +220,7 @@ int jbd2_journal_init_revoke(journal_t *
 	journal->j_revoke->hash_shift = shift;
 
 	journal->j_revoke->hash_table =
-		kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL);
+		kmalloc(hash_size * sizeof(struct list_head), GFP_NOFS);
 	if (!journal->j_revoke->hash_table) {
 		kmem_cache_free(jbd2_revoke_table_cache, journal->j_revoke_table[0]);
 		journal->j_revoke = NULL;
@@ -230,7 +230,7 @@ int jbd2_journal_init_revoke(journal_t *
 	for (tmp = 0; tmp < hash_size; tmp++)
 		INIT_LIST_HEAD(&journal->j_revoke->hash_table[tmp]);
 
-	journal->j_revoke_table[1] = kmem_cache_alloc(jbd2_revoke_table_cache, GFP_KERNEL);
+	journal->j_revoke_table[1] = kmem_cache_alloc(jbd2_revoke_table_cache, GFP_NOFS);
 	if (!journal->j_revoke_table[1]) {
 		kfree(journal->j_revoke_table[0]->hash_table);
 		kmem_cache_free(jbd2_revoke_table_cache, journal->j_revoke_table[0]);
@@ -247,7 +247,7 @@ int jbd2_journal_init_revoke(journal_t *
 	journal->j_revoke->hash_shift = shift;
 
 	journal->j_revoke->hash_table =
-		kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL);
+		kmalloc(hash_size * sizeof(struct list_head), GFP_NOFS);
 	if (!journal->j_revoke->hash_table) {
 		kfree(journal->j_revoke_table[0]->hash_table);
 		kmem_cache_free(jbd2_revoke_table_cache, journal->j_revoke_table[0]);

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] JBD slab cleanups
  2007-09-19 19:15                                         ` Mingming Cao
  2007-09-19 19:22                                           ` [PATCH] JBD: use GFP_NOFS in kmalloc Mingming Cao
@ 2007-09-19 19:26                                           ` Dave Kleikamp
  2007-09-19 19:28                                             ` Dave Kleikamp
  2007-09-19 19:48                                           ` Andreas Dilger
  2 siblings, 1 reply; 56+ messages in thread
From: Dave Kleikamp @ 2007-09-19 19:26 UTC (permalink / raw)
  To: cmm
  Cc: Andrew Morton, Christoph Hellwig, Badari Pulavarty,
	Christoph Lameter, linux-fsdevel, ext4 development, lkml

On Wed, 2007-09-19 at 12:15 -0700, Mingming Cao wrote:

> Here is the patch to clean up __GFP_NOFAIL flag in jbd/jbd2. In all
> cases except one handles memory allocation failure so I get rid of those
> GFP_NOFAIL flags.
> 
> Also, shouldn't we use GFP_KERNEL instead of GFP_NOFS flag for kmalloc
> in jbd/jbd2? I will send a separate patch to cleanup that.

No.  GFP_NOFS avoids deadlock.  It prevents the allocation from making
recursive calls back into the file system that could end up blocking on
jbd code.

Shaggy
-- 
David Kleikamp
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] JBD slab cleanups
  2007-09-19 19:26                                           ` [PATCH] JBD slab cleanups Dave Kleikamp
@ 2007-09-19 19:28                                             ` Dave Kleikamp
  2007-09-19 20:47                                               ` Mingming Cao
  0 siblings, 1 reply; 56+ messages in thread
From: Dave Kleikamp @ 2007-09-19 19:28 UTC (permalink / raw)
  To: cmm
  Cc: Andrew Morton, Christoph Hellwig, Badari Pulavarty,
	Christoph Lameter, linux-fsdevel, ext4 development, lkml

On Wed, 2007-09-19 at 14:26 -0500, Dave Kleikamp wrote:
> On Wed, 2007-09-19 at 12:15 -0700, Mingming Cao wrote:
> 
> > Here is the patch to clean up __GFP_NOFAIL flag in jbd/jbd2. In all
> > cases except one handles memory allocation failure so I get rid of those
> > GFP_NOFAIL flags.
> > 
> > Also, shouldn't we use GFP_KERNEL instead of GFP_NOFS flag for kmalloc
> > in jbd/jbd2? I will send a separate patch to cleanup that.
> 
> No.  GFP_NOFS avoids deadlock.  It prevents the allocation from making
> recursive calls back into the file system that could end up blocking on
> jbd code.

Oh, I see your patch now.  You mean use GFP_NOFS instead of
GFP_KERNEL.  :-)  OK then.

> Shaggy
-- 
David Kleikamp
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] JBD slab cleanups
  2007-09-19 19:15                                         ` Mingming Cao
  2007-09-19 19:22                                           ` [PATCH] JBD: use GFP_NOFS in kmalloc Mingming Cao
  2007-09-19 19:26                                           ` [PATCH] JBD slab cleanups Dave Kleikamp
@ 2007-09-19 19:48                                           ` Andreas Dilger
  2007-09-19 22:03                                             ` Mingming Cao
  2 siblings, 1 reply; 56+ messages in thread
From: Andreas Dilger @ 2007-09-19 19:48 UTC (permalink / raw)
  To: Mingming Cao
  Cc: Andrew Morton, Dave Kleikamp, Christoph Hellwig, Badari Pulavarty,
	Christoph Lameter, linux-fsdevel, ext4 development, lkml,
	Stephen C. Tweedie

On Sep 19, 2007  12:15 -0700, Mingming Cao wrote:
> @@ -96,8 +96,7 @@ static int start_this_handle(journal_t *
>  
>  alloc_transaction:
>  	if (!journal->j_running_transaction) {
> -		new_transaction = kmalloc(sizeof(*new_transaction),
> -						GFP_NOFS|__GFP_NOFAIL);
> +		new_transaction = kmalloc(sizeof(*new_transaction), GFP_NOFS);

This should probably be a __GFP_NOFAIL if we are trying to start a new
handle in truncate, as there is no way to propagate an error to the caller.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] JBD slab cleanups
  2007-09-19 19:28                                             ` Dave Kleikamp
@ 2007-09-19 20:47                                               ` Mingming Cao
  0 siblings, 0 replies; 56+ messages in thread
From: Mingming Cao @ 2007-09-19 20:47 UTC (permalink / raw)
  To: Dave Kleikamp
  Cc: Andrew Morton, Christoph Hellwig, Badari Pulavarty,
	Christoph Lameter, linux-fsdevel, ext4 development, lkml

On Wed, 2007-09-19 at 19:28 +0000, Dave Kleikamp wrote:
> On Wed, 2007-09-19 at 14:26 -0500, Dave Kleikamp wrote:
> > On Wed, 2007-09-19 at 12:15 -0700, Mingming Cao wrote:
> > 
> > > Here is the patch to clean up __GFP_NOFAIL flag in jbd/jbd2. In all
> > > cases except one handles memory allocation failure so I get rid of those
> > > GFP_NOFAIL flags.
> > > 
> > > Also, shouldn't we use GFP_KERNEL instead of GFP_NOFS flag for kmalloc
> > > in jbd/jbd2? I will send a separate patch to cleanup that.
> > 
> > No.  GFP_NOFS avoids deadlock.  It prevents the allocation from making
> > recursive calls back into the file system that could end up blocking on
> > jbd code.
> 
> Oh, I see your patch now.  You mean use GFP_NOFS instead of
> GFP_KERNEL.  :-)  OK then.
> 

oops, I did mean what you say here.:-)

> > Shaggy

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] JBD: use GFP_NOFS  in kmalloc
  2007-09-19 19:22                                           ` [PATCH] JBD: use GFP_NOFS in kmalloc Mingming Cao
@ 2007-09-19 21:34                                             ` Andrew Morton
  2007-09-19 21:55                                               ` Mingming Cao
  2007-09-20  4:25                                             ` Andreas Dilger
  1 sibling, 1 reply; 56+ messages in thread
From: Andrew Morton @ 2007-09-19 21:34 UTC (permalink / raw)
  To: cmm; +Cc: linux-fsdevel, ext4 development, lkml

On Wed, 19 Sep 2007 12:22:09 -0700
Mingming Cao <cmm@us.ibm.com> wrote:

> Convert the GFP_KERNEL flag used in JBD/JBD2 to GFP_NOFS, consistent
> with the rest of kmalloc flag used in the JBD/JBD2 layer.
> 
> Signed-off-by: Mingming Cao <cmm@us.ibm.com>
> 
> ---
>  fs/jbd/journal.c  |    6 +++---
>  fs/jbd/revoke.c   |    8 ++++----
>  fs/jbd2/journal.c |    6 +++---
>  fs/jbd2/revoke.c  |    8 ++++----
>  4 files changed, 14 insertions(+), 14 deletions(-)
> 
> Index: linux-2.6.23-rc6/fs/jbd/journal.c
> ===================================================================
> --- linux-2.6.23-rc6.orig/fs/jbd/journal.c	2007-09-19 11:51:10.000000000 -0700
> +++ linux-2.6.23-rc6/fs/jbd/journal.c	2007-09-19 11:51:57.000000000 -0700
> @@ -653,7 +653,7 @@ static journal_t * journal_init_common (
>  	journal_t *journal;
>  	int err;
>  
> -	journal = kmalloc(sizeof(*journal), GFP_KERNEL);
> +	journal = kmalloc(sizeof(*journal), GFP_NOFS);
>  	if (!journal)
>  		goto fail;
>  	memset(journal, 0, sizeof(*journal));
> @@ -723,7 +723,7 @@ journal_t * journal_init_dev(struct bloc
>  	journal->j_blocksize = blocksize;
>  	n = journal->j_blocksize / sizeof(journal_block_tag_t);
>  	journal->j_wbufsize = n;
> -	journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL);
> +	journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_NOFS);
>  	if (!journal->j_wbuf) {
>  		printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
>  			__FUNCTION__);
> @@ -777,7 +777,7 @@ journal_t * journal_init_inode (struct i
>  	/* journal descriptor can store up to n blocks -bzzz */
>  	n = journal->j_blocksize / sizeof(journal_block_tag_t);
>  	journal->j_wbufsize = n;
> -	journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL);
> +	journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_NOFS);
>  	if (!journal->j_wbuf) {
>  		printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
>  			__FUNCTION__);
> Index: linux-2.6.23-rc6/fs/jbd/revoke.c
> ===================================================================
> --- linux-2.6.23-rc6.orig/fs/jbd/revoke.c	2007-09-19 11:51:30.000000000 -0700
> +++ linux-2.6.23-rc6/fs/jbd/revoke.c	2007-09-19 11:52:34.000000000 -0700
> @@ -206,7 +206,7 @@ int journal_init_revoke(journal_t *journ
>  	while((tmp >>= 1UL) != 0UL)
>  		shift++;
>  
> -	journal->j_revoke_table[0] = kmem_cache_alloc(revoke_table_cache, GFP_KERNEL);
> +	journal->j_revoke_table[0] = kmem_cache_alloc(revoke_table_cache, GFP_NOFS);
>  	if (!journal->j_revoke_table[0])
>  		return -ENOMEM;
>  	journal->j_revoke = journal->j_revoke_table[0];
> @@ -219,7 +219,7 @@ int journal_init_revoke(journal_t *journ
>  	journal->j_revoke->hash_shift = shift;
>  
>  	journal->j_revoke->hash_table =
> -		kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL);
> +		kmalloc(hash_size * sizeof(struct list_head), GFP_NOFS);
>  	if (!journal->j_revoke->hash_table) {
>  		kmem_cache_free(revoke_table_cache, journal->j_revoke_table[0]);
>  		journal->j_revoke = NULL;
> @@ -229,7 +229,7 @@ int journal_init_revoke(journal_t *journ
>  	for (tmp = 0; tmp < hash_size; tmp++)
>  		INIT_LIST_HEAD(&journal->j_revoke->hash_table[tmp]);
>  
> -	journal->j_revoke_table[1] = kmem_cache_alloc(revoke_table_cache, GFP_KERNEL);
> +	journal->j_revoke_table[1] = kmem_cache_alloc(revoke_table_cache, GFP_NOFS);
>  	if (!journal->j_revoke_table[1]) {
>  		kfree(journal->j_revoke_table[0]->hash_table);
>  		kmem_cache_free(revoke_table_cache, journal->j_revoke_table[0]);
> @@ -246,7 +246,7 @@ int journal_init_revoke(journal_t *journ
>  	journal->j_revoke->hash_shift = shift;
>  
>  	journal->j_revoke->hash_table =
> -		kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL);
> +		kmalloc(hash_size * sizeof(struct list_head), GFP_NOFS);
>  	if (!journal->j_revoke->hash_table) {
>  		kfree(journal->j_revoke_table[0]->hash_table);
>  		kmem_cache_free(revoke_table_cache, journal->j_revoke_table[0]);

These were all OK using GFP_KERNEL.

GFP_NOFS should only be used when the caller is holding some fs locks which
might cause a deadlock if that caller reentered the fs in ->writepage (and
maybe put_inode and such).  That isn't the case in any of the above code,
which is all mount time stuff (I think).

ext3/4 should be using GFP_NOFS when the caller has a transaction open, has
a page locked, is holding i_mutex, etc.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] JBD: use GFP_NOFS  in kmalloc
  2007-09-19 21:34                                             ` Andrew Morton
@ 2007-09-19 21:55                                               ` Mingming Cao
  0 siblings, 0 replies; 56+ messages in thread
From: Mingming Cao @ 2007-09-19 21:55 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-fsdevel, ext4 development, lkml

On Wed, 2007-09-19 at 14:34 -0700, Andrew Morton wrote:
> On Wed, 19 Sep 2007 12:22:09 -0700
> Mingming Cao <cmm@us.ibm.com> wrote:
> 
> > Convert the GFP_KERNEL flag used in JBD/JBD2 to GFP_NOFS, consistent
> > with the rest of kmalloc flag used in the JBD/JBD2 layer.
> > 
> > Signed-off-by: Mingming Cao <cmm@us.ibm.com>
> > 
> > ---
> >  fs/jbd/journal.c  |    6 +++---
> >  fs/jbd/revoke.c   |    8 ++++----
> >  fs/jbd2/journal.c |    6 +++---
> >  fs/jbd2/revoke.c  |    8 ++++----
> >  4 files changed, 14 insertions(+), 14 deletions(-)
> > 
> > Index: linux-2.6.23-rc6/fs/jbd/journal.c
> > ===================================================================
> > --- linux-2.6.23-rc6.orig/fs/jbd/journal.c	2007-09-19 11:51:10.000000000 -0700
> > +++ linux-2.6.23-rc6/fs/jbd/journal.c	2007-09-19 11:51:57.000000000 -0700
> > @@ -653,7 +653,7 @@ static journal_t * journal_init_common (
> >  	journal_t *journal;
> >  	int err;
> >  
> > -	journal = kmalloc(sizeof(*journal), GFP_KERNEL);
> > +	journal = kmalloc(sizeof(*journal), GFP_NOFS);
> >  	if (!journal)
> >  		goto fail;
> >  	memset(journal, 0, sizeof(*journal));
> > @@ -723,7 +723,7 @@ journal_t * journal_init_dev(struct bloc
> >  	journal->j_blocksize = blocksize;
> >  	n = journal->j_blocksize / sizeof(journal_block_tag_t);
> >  	journal->j_wbufsize = n;
> > -	journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL);
> > +	journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_NOFS);
> >  	if (!journal->j_wbuf) {
> >  		printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
> >  			__FUNCTION__);
> > @@ -777,7 +777,7 @@ journal_t * journal_init_inode (struct i
> >  	/* journal descriptor can store up to n blocks -bzzz */
> >  	n = journal->j_blocksize / sizeof(journal_block_tag_t);
> >  	journal->j_wbufsize = n;
> > -	journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL);
> > +	journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_NOFS);
> >  	if (!journal->j_wbuf) {
> >  		printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
> >  			__FUNCTION__);
> > Index: linux-2.6.23-rc6/fs/jbd/revoke.c
> > ===================================================================
> > --- linux-2.6.23-rc6.orig/fs/jbd/revoke.c	2007-09-19 11:51:30.000000000 -0700
> > +++ linux-2.6.23-rc6/fs/jbd/revoke.c	2007-09-19 11:52:34.000000000 -0700
> > @@ -206,7 +206,7 @@ int journal_init_revoke(journal_t *journ
> >  	while((tmp >>= 1UL) != 0UL)
> >  		shift++;
> >  
> > -	journal->j_revoke_table[0] = kmem_cache_alloc(revoke_table_cache, GFP_KERNEL);
> > +	journal->j_revoke_table[0] = kmem_cache_alloc(revoke_table_cache, GFP_NOFS);
> >  	if (!journal->j_revoke_table[0])
> >  		return -ENOMEM;
> >  	journal->j_revoke = journal->j_revoke_table[0];
> > @@ -219,7 +219,7 @@ int journal_init_revoke(journal_t *journ
> >  	journal->j_revoke->hash_shift = shift;
> >  
> >  	journal->j_revoke->hash_table =
> > -		kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL);
> > +		kmalloc(hash_size * sizeof(struct list_head), GFP_NOFS);
> >  	if (!journal->j_revoke->hash_table) {
> >  		kmem_cache_free(revoke_table_cache, journal->j_revoke_table[0]);
> >  		journal->j_revoke = NULL;
> > @@ -229,7 +229,7 @@ int journal_init_revoke(journal_t *journ
> >  	for (tmp = 0; tmp < hash_size; tmp++)
> >  		INIT_LIST_HEAD(&journal->j_revoke->hash_table[tmp]);
> >  
> > -	journal->j_revoke_table[1] = kmem_cache_alloc(revoke_table_cache, GFP_KERNEL);
> > +	journal->j_revoke_table[1] = kmem_cache_alloc(revoke_table_cache, GFP_NOFS);
> >  	if (!journal->j_revoke_table[1]) {
> >  		kfree(journal->j_revoke_table[0]->hash_table);
> >  		kmem_cache_free(revoke_table_cache, journal->j_revoke_table[0]);
> > @@ -246,7 +246,7 @@ int journal_init_revoke(journal_t *journ
> >  	journal->j_revoke->hash_shift = shift;
> >  
> >  	journal->j_revoke->hash_table =
> > -		kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL);
> > +		kmalloc(hash_size * sizeof(struct list_head), GFP_NOFS);
> >  	if (!journal->j_revoke->hash_table) {
> >  		kfree(journal->j_revoke_table[0]->hash_table);
> >  		kmem_cache_free(revoke_table_cache, journal->j_revoke_table[0]);
> 
> These were all OK using GFP_KERNEL.
> 
> GFP_NOFS should only be used when the caller is holding some fs locks which
> might cause a deadlock if that caller reentered the fs in ->writepage (and
> maybe put_inode and such).  That isn't the case in any of the above code,
> which is all mount time stuff (I think).
> 

You are right they are all occur at initialization time.

> ext3/4 should be using GFP_NOFS when the caller has a transaction open, has
> a page locked, is holding i_mutex, etc.
> 

Thanks for your feedback.

Mingming


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] JBD slab cleanups
  2007-09-19 19:48                                           ` Andreas Dilger
@ 2007-09-19 22:03                                             ` Mingming Cao
  0 siblings, 0 replies; 56+ messages in thread
From: Mingming Cao @ 2007-09-19 22:03 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Andrew Morton, Dave Kleikamp, Christoph Hellwig, Badari Pulavarty,
	Christoph Lameter, linux-fsdevel, ext4 development, lkml,
	Stephen C. Tweedie

On Wed, 2007-09-19 at 13:48 -0600, Andreas Dilger wrote:
> On Sep 19, 2007  12:15 -0700, Mingming Cao wrote:
> > @@ -96,8 +96,7 @@ static int start_this_handle(journal_t *
> >  
> >  alloc_transaction:
> >  	if (!journal->j_running_transaction) {
> > -		new_transaction = kmalloc(sizeof(*new_transaction),
> > -						GFP_NOFS|__GFP_NOFAIL);
> > +		new_transaction = kmalloc(sizeof(*new_transaction), GFP_NOFS);
> 
> This should probably be a __GFP_NOFAIL if we are trying to start a new
> handle in truncate, as there is no way to propagate an error to the caller.
> 

Thanks, updated version.

Here is the patch to clean up __GFP_NOFAIL flag in jbd/jbd2, most cases
they are not needed.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
---
 fs/jbd/journal.c  |    2 +-
 fs/jbd2/journal.c |    2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

Index: linux-2.6.23-rc6/fs/jbd/journal.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd/journal.c	2007-09-19 11:47:58.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd/journal.c	2007-09-19 14:23:45.000000000 -0700
@@ -653,7 +653,7 @@ static journal_t * journal_init_common (
 	journal_t *journal;
 	int err;
 
-	journal = kmalloc(sizeof(*journal), GFP_KERNEL|__GFP_NOFAIL);
+	journal = kmalloc(sizeof(*journal), GFP_KERNEL);
 	if (!journal)
 		goto fail;
 	memset(journal, 0, sizeof(*journal));
Index: linux-2.6.23-rc6/fs/jbd2/journal.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd2/journal.c	2007-09-19 11:48:14.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd2/journal.c	2007-09-19 14:23:45.000000000 -0700
@@ -654,7 +654,7 @@ static journal_t * journal_init_common (
 	journal_t *journal;
 	int err;
 
-	journal = kmalloc(sizeof(*journal), GFP_KERNEL|__GFP_NOFAIL);
+	journal = kmalloc(sizeof(*journal), GFP_KERNEL);
 	if (!journal)
 		goto fail;
 	memset(journal, 0, sizeof(*journal));

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] JBD: use GFP_NOFS  in kmalloc
  2007-09-19 19:22                                           ` [PATCH] JBD: use GFP_NOFS in kmalloc Mingming Cao
  2007-09-19 21:34                                             ` Andrew Morton
@ 2007-09-20  4:25                                             ` Andreas Dilger
  1 sibling, 0 replies; 56+ messages in thread
From: Andreas Dilger @ 2007-09-20  4:25 UTC (permalink / raw)
  To: Mingming Cao; +Cc: Andrew Morton, linux-fsdevel, ext4 development, lkml

On Sep 19, 2007  12:22 -0700, Mingming Cao wrote:
> Convert the GFP_KERNEL flag used in JBD/JBD2 to GFP_NOFS, consistent
> with the rest of kmalloc flag used in the JBD/JBD2 layer.
> 
> @@ -653,7 +653,7 @@ static journal_t * journal_init_common (
> -	journal = kmalloc(sizeof(*journal), GFP_KERNEL);
> +	journal = kmalloc(sizeof(*journal), GFP_NOFS);
> @@ -723,7 +723,7 @@ journal_t * journal_init_dev(struct bloc
> -	journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL);
> +	journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_NOFS);
> @@ -777,7 +777,7 @@ journal_t * journal_init_inode (struct i
> -	journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL);
> +	journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_NOFS);

Is there a reason for this change except "it's in a filesystem, so it
should be GFP_NOFS"?  We are only doing journal setup during mount so
there shouldn't be any problem using GFP_KERNEL.  I don't think it will
inject any defect into the code, but I don't think it is needed either.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

^ permalink raw reply	[flat|nested] 56+ messages in thread

end of thread, other threads:[~2007-09-20  4:25 UTC | newest]

Thread overview: 56+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20070828190551.415127746@sgi.com>
     [not found] ` <20070828190735.292638294@sgi.com>
2007-08-30  0:11   ` [31/36] Large Blocksize: Core piece Mingming Cao
2007-08-30  0:12     ` Christoph Lameter
2007-08-30  0:47     ` [RFC 1/4] Large Blocksize support for Ext2/3/4 Mingming Cao
2007-08-30  0:59       ` Christoph Lameter
2007-09-01  0:01       ` Mingming Cao
2007-09-01  0:12       ` [RFC 1/2] JBD: slab management support for large block(>8k) Mingming Cao
2007-09-01 18:39         ` Christoph Hellwig
2007-09-02 11:40           ` Christoph Lameter
2007-09-02 15:28             ` Christoph Hellwig
2007-09-03  7:55               ` Christoph Lameter
2007-09-03 13:40                 ` Christoph Hellwig
2007-09-03 19:31                   ` Christoph Lameter
2007-09-03 19:33                     ` Christoph Hellwig
2007-09-14 18:53                       ` [PATCH] JBD slab cleanups Mingming Cao
2007-09-14 18:58                         ` Christoph Lameter
2007-09-17 19:29                         ` Mingming Cao
2007-09-17 19:34                           ` Christoph Hellwig
2007-09-17 22:01                           ` Badari Pulavarty
2007-09-17 22:57                             ` Mingming Cao
2007-09-18  9:04                               ` Christoph Hellwig
2007-09-18 16:35                                 ` Mingming Cao
2007-09-18 18:04                                   ` Dave Kleikamp
2007-09-19  1:00                                     ` Mingming Cao
2007-09-19  2:19                                       ` Andrew Morton
2007-09-19 19:15                                         ` Mingming Cao
2007-09-19 19:22                                           ` [PATCH] JBD: use GFP_NOFS in kmalloc Mingming Cao
2007-09-19 21:34                                             ` Andrew Morton
2007-09-19 21:55                                               ` Mingming Cao
2007-09-20  4:25                                             ` Andreas Dilger
2007-09-19 19:26                                           ` [PATCH] JBD slab cleanups Dave Kleikamp
2007-09-19 19:28                                             ` Dave Kleikamp
2007-09-19 20:47                                               ` Mingming Cao
2007-09-19 19:48                                           ` Andreas Dilger
2007-09-19 22:03                                             ` Mingming Cao
2007-09-01  0:12       ` [RFC 2/2] JBD: blocks reservation fix for large block support Mingming Cao
2007-08-30  0:47     ` [RFC 2/4]ext2: fix rec_len overflow with 64KB block size Mingming Cao
     [not found] ` <20070828190730.220393749@sgi.com>
2007-08-30  9:20   ` [11/36] Use page_cache_xxx in fs/buffer.c Dmitry Monakhov
2007-08-30 18:14     ` Christoph Lameter
2007-08-31  1:47       ` Christoph Lameter
2007-08-31  6:56         ` Jens Axboe
2007-08-31  7:03           ` Christoph Lameter
2007-08-31  7:11             ` Jens Axboe
2007-08-31  7:17               ` Christoph Lameter
2007-08-31  7:26                 ` Jens Axboe
2007-08-31  7:33                   ` Christoph Lameter
2007-08-31  7:43                     ` Jens Axboe
2007-08-31  7:52                       ` Christoph Lameter
2007-08-31  8:12                         ` Jens Axboe
2007-08-31 15:22                           ` Christoph Lameter
2007-08-31 16:35                             ` Jörn Engel
2007-08-31 19:00                             ` Jens Axboe
2007-08-31  8:36                         ` Dmitry Monakhov
2007-08-31 15:28                           ` Christoph Lameter
     [not found] ` <20070828192034.GA13883@lst.de>
     [not found]   ` <Pine.LNX.4.64.0708281254070.16473@schroedinger.engr.sgi.com>
2007-09-01  1:11     ` [00/36] Large Blocksize Support V6 Christoph Lameter
2007-09-01 19:17 ` Peter Zijlstra
2007-09-02 11:44   ` Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).