All of lore.kernel.org
 help / color / mirror / Atom feed
From: Maxim Levitsky <maximlevitsky@gmail.com>
To: clameter@sgi.com
Cc: linux-kernel@vger.kernel.org, Mel Gorman <mel@skynet.ie>,
	William Lee Irwin III <wli@holomorphy.com>,
	David Chinner <dgc@sgi.com>, Jens Axboe <jens.axboe@oracle.com>,
	Badari Pulavarty <pbadari@gmail.com>
Subject: Re: [00/17] Large Blocksize Support V3
Date: Thu, 26 Apr 2007 21:50:17 +0300	[thread overview]
Message-ID: <200704262150.18748.maximlevitsky@gmail.com> (raw)
In-Reply-To: <20070424222105.883597089@sgi.com>

On Wednesday 25 April 2007 01:21, clameter@sgi.com wrote:
> V2->V3
> - More restructuring
> - It actually works!
> - Add XFS support
> - Fix up UP support
> - Work out the direct I/O issues
> - Add CONFIG_LARGE_BLOCKSIZE. Off by default which makes the inlines revert
>   back to constants. Disabled for 32bit and HIGHMEM configurations.
>   This also allows a gradual migration to the new page cache
>   inline functions. LARGE_BLOCKSIZE capabilities can be
>   added gradually and if there is a problem then we can disable
>   a subsystem.
>
> V1->V2
> - Some ext2 support
> - Some block layer, fs layer support etc.
> - Better page cache macros
> - Use macros to clean up code.
>
> This patchset modifies the Linux kernel so that larger block sizes than
> page size can be supported. Larger block sizes are handled by using
> compound pages of an arbitrary order for the page cache instead of
> single pages with order 0.
>
> Rationales:
>
> 1. We have problems supporting devices with a higher blocksize than
>    page size. This is for example important to support CD and DVDs that
>    can only read and write 32k or 64k blocks. We currently have a shim
>    layer in there to deal with this situation which limits the speed
>    of I/O. The developers are currently looking for ways to completely
>    bypass the page cache because of this deficiency.
>
> 2. 32/64k blocksize is also used in flash devices. Same issues.
>
> 3. Future harddisks will support bigger block sizes that Linux cannot
>    support since we are limited to PAGE_SIZE. Ok the on board cache
>    may buffer this for us but what is the point of handling smaller
>    page sizes than what the drive supports?
>
> 4. Reduce fsck times. Larger block sizes mean faster file system checking.
>
> 5. Performance. If we look at IA64 vs. x86_64 then it seems that the
>    faster interrupt handling on x86_64 compensate for the speed loss due to
>    a smaller page size (4k vs 16k on IA64). Supporting larger block sizes
>    sizes on all allows a significant reduction in I/O overhead and
> increases the size of I/O that can be performed by hardware in a single
> request since the number of scatter gather entries are typically limited
> for one request. This is going to become increasingly important to support
> the ever growing memory sizes since we may have to handle excessively large
> amounts of 4k requests for data sizes that may become common soon. For
> example to write a 1 terabyte file the kernel would have to handle 256
> million 4k chunks.
>
> 6. Cross arch compatibility: It is currently not possible to mount
>    an 16k blocksize ext2 filesystem created on IA64 on an x86_64 system.
>    With this patch this becoems possible.
>
> The support here is currently only for buffered I/O. Modifications for
> three filesystems are included:
>
> A. XFS
> B. Ext2
> C. ramfs
>
> Unsupported
> - Mmapping blocks larger than page size
>
> Issues:
> - There are numerous places where the kernel can no longer assume that the
>   page cache consists of PAGE_SIZE pages that have not been fixed yet.
> - Defrag warning: The patch set can fragment memory very fast.
>   It is likely that Mel Gorman's anti-frag patches and some more
>   work by him on defragmentation may be needed if one wants to use
>   super sized pages.
>   If you run a 2.6.21 kernel with this patch and start a kernel compile
>   on a 4k volume with a concurrent copy operation to a 64k volume on
>   a system with only 1 Gig then you will go boom (ummm no ... OOM) fast.
>   How well Mel's antifrag/defrag methods address this issue still has to
>   be seen.
>
> Future:
> - Mmap support could be done in a way that makes the mmap page size
>   independent from the page cache order. It is okay to map a 4k section
>   of a larger page cache page via a pte. 4k mmap semantics can be
> completely preserved even for larger page sizes.
> - Maybe people could perform benchmarks to see how much of a difference
>   there is between 4k size I/O and 64k? Andrew surely would like to know.
> - If there is a chance for inclusion then I will diff this against mm,
>   do a complete scan over the kernel to find all page cache == PAGE_SIZE
>   assumptions and then try to get it upstream for 2.6.23.
>
> How to make this work:
>
> 1. Apply this patchset to 2.6.21-rc7
> 2. Configure LARGE_BLOCKSIZE Support
> 3. compile kernel
>
> --

Hi,
	I really like the idea of block size > page size

I just want to suggest mine ideas about how to implement it (I can't do that 
since it is too compicated for me and I don't have time now)


I visualized this like that:

For size <= 4K , page still holds a 1 or more blocks.

For size > 4K:

A page holds a fraction of block, and its ->buffer_head also holds info about 
that fraction.

But that ->bh contains a ->next pointer (or a list_head) that combines 
all fractions of block in single one.



This will minimize changes in fs code.

Today fs blindly uses bh retured by block code.
A modifed filesystem will also read from linked ->next bhs to get all parts of 
block. 

For blocksizes <= 4k that ->next will be null indicating that that bh 
contatins whole block. 

Then inplementation of block address_space opertions should not change a lot 
too.

They will be just aware that that page can be linked with other pages of same 
block and do that right thing. 

For example:

->readpage will not only read _that_ page but also will read all sibling pages
->writepage will magicly write not only that page but siblings too.


buffer_head alredy has pointer to page so it is easy to get page from buffer 
head:

page -> private -> next -> page -> private ...

andf so on (sorry, but I did a study (for myself, trying to improve 
packet-writing) on that a half year ago, so I don't remember now much about 
that)


What about that ?


Best regards,
	Maxim Levitsky

  parent reply	other threads:[~2007-04-26 18:51 UTC|newest]

Thread overview: 235+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-04-24 22:21 [00/17] Large Blocksize Support V3 clameter
2007-04-24 22:21 ` [01/17] Remove open coded implementation of memclear_highpage flush clameter
2007-04-24 22:21 ` [02/17] Fix page allocation flags in grow_dev_page() clameter
2007-04-24 22:21 ` [03/17] Fix: find_or_create_page does not spread memory clameter
2007-04-24 22:21 ` [04/17] Free up page->private for compound pages clameter
2007-04-24 22:21 ` [05/17] More compound page features clameter
2007-04-24 22:21 ` [06/17] Fix up handling of Compound head pages clameter
2007-04-24 22:21 ` [07/17] vmstat.c: Support accounting for compound pages clameter
2007-04-24 22:21 ` [08/17] Define functions for page cache handling clameter
2007-04-24 23:00   ` Eric Dumazet
2007-04-25  6:27     ` Christoph Lameter
2007-04-24 22:21 ` [09/17] Convert PAGE_CACHE_xxx -> page_cache_xxx function calls clameter
2007-04-24 22:21 ` [10/17] Variable Order Page Cache: Add clearing and flushing function clameter
2007-04-26  7:02   ` Christoph Lameter
2007-04-26  8:14     ` David Chinner
2007-04-24 22:21 ` [11/17] Readahead support for the variable order page cache clameter
2007-04-24 22:21 ` [12/17] Variable Page Cache Size: Fix up reclaim counters clameter
2007-04-24 22:21 ` [13/17] set_blocksize: Allow to set a larger block size than PAGE_SIZE clameter
2007-04-24 22:21 ` [14/17] Add VM_BUG_ONs to check for correct page order clameter
2007-04-24 22:21 ` [15/17] ramfs: Variable order page cache support clameter
2007-04-24 22:21 ` [16/17] ext2: " clameter
2007-04-24 22:21 ` [17/17] xfs: " clameter
2007-04-25  0:46 ` [00/17] Large Blocksize Support V3 Jörn Engel
2007-04-25  0:47 ` H. Peter Anvin
2007-04-25  3:11 ` William Lee Irwin III
2007-04-25 11:35 ` Jens Axboe
2007-04-25 15:36   ` Christoph Lameter
2007-04-25 17:53     ` Jens Axboe
2007-04-25 18:03       ` Christoph Lameter
2007-04-25 18:05         ` Jens Axboe
2007-04-25 18:14           ` Christoph Lameter
2007-04-25 18:16             ` Jens Axboe
2007-04-25 13:28 ` Mel Gorman
2007-04-25 15:23   ` Christoph Lameter
2007-04-25 22:46 ` Badari Pulavarty
2007-04-26  1:14   ` David Chinner
2007-04-26  1:17     ` David Chinner
2007-04-26  4:51 ` Eric W. Biederman
2007-04-26  5:05   ` Christoph Lameter
2007-04-26  5:44     ` Eric W. Biederman
2007-04-26  6:37       ` Christoph Lameter
2007-04-26  9:16         ` Mel Gorman
2007-04-26  6:38       ` Nick Piggin
2007-04-26  6:46         ` Christoph Lameter
2007-04-26  6:57           ` Nick Piggin
2007-04-26  7:10             ` Christoph Lameter
2007-04-26  7:22               ` Nick Piggin
2007-04-26  7:34                 ` Christoph Lameter
2007-04-26  7:48                   ` Nick Piggin
2007-04-26  9:20                     ` David Chinner
2007-04-26 13:53                       ` Avi Kivity
2007-04-26 14:33                         ` David Chinner
2007-04-26 14:56                           ` Avi Kivity
2007-04-26 15:20                       ` Nick Piggin
2007-04-26 17:42                         ` Jens Axboe
2007-04-26 18:59                           ` Eric W. Biederman
2007-04-26 16:07                     ` Christoph Hellwig
2007-04-27 10:05                       ` Nick Piggin
2007-04-27 13:06                         ` Mel Gorman
2007-04-26 13:50                   ` William Lee Irwin III
2007-04-26 18:09                     ` Eric W. Biederman
2007-04-26 23:34                       ` William Lee Irwin III
2007-04-26  7:48                 ` Questions on printk and console_drivers gshan
2007-04-26 10:06           ` [00/17] Large Blocksize Support V3 Mel Gorman
2007-04-26 14:47             ` Nick Piggin
2007-04-26 15:58         ` Christoph Hellwig
2007-04-26 16:05           ` Jens Axboe
2007-04-26 16:16             ` Christoph Hellwig
2007-04-26 13:28       ` Alan Cox
2007-04-26 13:30         ` Jens Axboe
2007-04-29 14:12         ` Matt Mackall
2007-04-28 10:55       ` Pierre Ossman
2007-04-28 15:39         ` Eric W. Biederman
2007-04-26  5:37   ` Nick Piggin
2007-04-26  6:38     ` David Chinner
2007-04-26  6:50       ` Nick Piggin
2007-04-26  8:40         ` Mel Gorman
2007-04-26  8:55           ` Nick Piggin
2007-04-26 10:30             ` Mel Gorman
2007-04-26 10:54               ` Eric W. Biederman
2007-04-26 12:23                 ` Mel Gorman
2007-04-26 17:58                 ` Christoph Lameter
2007-04-26 18:02                   ` Jens Axboe
2007-04-26 16:11         ` Christoph Hellwig
2007-04-26 17:49           ` Eric W. Biederman
2007-04-26 18:03             ` Christoph Lameter
2007-04-26 18:03               ` Jens Axboe
2007-04-26 18:09                 ` Christoph Hellwig
2007-04-26 18:12                   ` Jens Axboe
2007-04-26 18:24                     ` Christoph Hellwig
2007-04-26 18:24                       ` Jens Axboe
2007-04-26 18:28                     ` Christoph Lameter
2007-04-26 18:29                       ` Jens Axboe
2007-04-26 18:35                         ` Christoph Lameter
2007-04-26 18:39                           ` Jens Axboe
2007-04-26 19:35                             ` Eric W. Biederman
2007-04-26 19:42                               ` Jens Axboe
2007-04-27  4:05                                 ` Eric W. Biederman
2007-04-27 10:26                                   ` Nick Piggin
2007-04-27 13:51                                     ` Eric W. Biederman
2007-04-26 20:22                             ` Mel Gorman
2007-04-27  0:21                               ` William Lee Irwin III
2007-04-27  5:16                               ` Jens Axboe
2007-04-27 10:38           ` Nick Piggin
2007-04-26 10:10       ` Eric W. Biederman
2007-04-26 13:50         ` David Chinner
2007-04-26 14:40           ` William Lee Irwin III
2007-04-26 15:38           ` Nick Piggin
2007-04-26 15:58             ` William Lee Irwin III
2007-04-27  9:46               ` Nick Piggin
2007-04-27  0:19           ` Jeremy Higdon
2007-04-26 18:07         ` Christoph Lameter
2007-04-26 18:45           ` Eric W. Biederman
2007-04-26 18:59             ` Christoph Lameter
2007-04-26 19:21               ` Eric W. Biederman
2007-04-26  6:40     ` Christoph Lameter
2007-04-26  6:53       ` Nick Piggin
2007-04-26  7:04         ` David Chinner
2007-04-26  7:07           ` Nick Piggin
2007-04-26  7:11             ` Christoph Lameter
2007-04-26  7:17               ` Nick Piggin
2007-04-26  7:28                 ` Christoph Lameter
2007-04-26  7:45                   ` Nick Piggin
2007-04-26 18:10                     ` Christoph Lameter
2007-04-27 10:08                       ` Nick Piggin
2007-04-26  7:07         ` Christoph Lameter
2007-04-26  7:15           ` Nick Piggin
2007-04-26  7:22             ` Christoph Lameter
2007-04-26  7:42               ` Nick Piggin
2007-04-26 10:48                 ` Mel Gorman
2007-04-26 12:37                 ` Andy Whitcroft
2007-04-26 14:18                   ` David Chinner
2007-04-26 15:08                   ` Nick Piggin
2007-04-26 15:19                     ` William Lee Irwin III
2007-04-26 15:28                     ` David Chinner
2007-04-26 14:53                 ` William Lee Irwin III
2007-04-26 18:16                   ` Christoph Lameter
2007-04-26 18:21                   ` Eric W. Biederman
2007-04-27  0:32                     ` William Lee Irwin III
2007-04-27 10:22                       ` Nick Piggin
2007-04-27 12:58                         ` William Lee Irwin III
2007-04-27 13:06                           ` Nick Piggin
2007-04-27 14:49                             ` William Lee Irwin III
2007-04-26 18:13                 ` Christoph Lameter
2007-04-27 10:15                   ` Nick Piggin
2007-04-26 14:49               ` William Lee Irwin III
2007-04-26 18:50 ` Maxim Levitsky [this message]
2007-04-27  2:04 ` Andrew Morton
2007-04-27  2:27   ` David Chinner
2007-04-27  2:53     ` Andrew Morton
2007-04-27  3:47       ` [00/17] Large Blocksize Support V3 (mmap conceptual discussion) Christoph Lameter
2007-04-27  4:20       ` [00/17] Large Blocksize Support V3 David Chinner
2007-04-27  5:15         ` Andrew Morton
2007-04-27  5:49           ` Christoph Lameter
2007-04-27  6:55             ` Andrew Morton
2007-04-27  7:19               ` Christoph Lameter
2007-04-27  7:26                 ` Andrew Morton
2007-04-27  8:37                   ` David Chinner
2007-04-27 12:01                   ` Christoph Lameter
2007-04-27 16:36                   ` David Chinner
2007-04-27 17:34                     ` David Chinner
2007-04-27 19:11                       ` Andrew Morton
2007-04-28  1:43                         ` Nick Piggin
2007-04-28  8:04                           ` Peter Zijlstra
2007-04-28  8:22                             ` Andrew Morton
2007-04-28  8:32                               ` Peter Zijlstra
2007-04-28  8:55                                 ` Andrew Morton
2007-04-28  9:36                                   ` Peter Zijlstra
2007-04-28 14:09                               ` William Lee Irwin III
2007-04-28 18:26                                 ` Andrew Morton
2007-04-28 19:19                                   ` William Lee Irwin III
2007-04-28 21:28                                     ` Andrew Morton
2007-04-28  3:17                         ` David Chinner
2007-04-28  3:49                           ` Christoph Lameter
2007-04-28  4:56                           ` Andrew Morton
2007-04-28  5:08                             ` Christoph Lameter
2007-04-28  5:36                               ` Andrew Morton
2007-04-28  6:24                                 ` Christoph Lameter
2007-04-28  6:52                                   ` Andrew Morton
2007-04-30  5:30                                     ` Christoph Lameter
2007-04-28  9:43                             ` Alan Cox
2007-04-28  9:58                               ` Andrew Morton
2007-04-28 10:21                                 ` Alan Cox
2007-04-28 10:25                                   ` Andrew Morton
2007-04-28 11:29                                     ` Alan Cox
2007-04-28 14:37                                       ` William Lee Irwin III
2007-04-27  7:22               ` Christoph Lameter
2007-04-27  7:29                 ` Andrew Morton
2007-04-27  7:35                   ` Christoph Lameter
2007-04-27  7:43                     ` Andrew Morton
2007-04-27 11:05               ` Paul Mackerras
2007-04-27 11:41                 ` Nick Piggin
2007-04-27 12:12                   ` Christoph Lameter
2007-04-27 12:25                     ` Nick Piggin
2007-04-27 13:39                       ` Christoph Hellwig
2007-04-28  2:27                         ` Nick Piggin
2007-04-28  2:39                           ` William Lee Irwin III
2007-04-28  2:50                             ` Nick Piggin
2007-04-28  3:16                               ` William Lee Irwin III
2007-04-28  8:16                           ` Christoph Hellwig
2007-04-27 16:48                       ` Christoph Lameter
2007-04-27 13:37                     ` Christoph Hellwig
2007-04-27 12:14                   ` Paul Mackerras
2007-04-27 12:36                     ` Nick Piggin
2007-04-27 13:42                     ` Christoph Hellwig
2007-04-27 11:58                 ` Christoph Lameter
2007-04-27 13:44               ` William Lee Irwin III
2007-04-27 19:15                 ` Andrew Morton
2007-04-28  2:21                   ` William Lee Irwin III
2007-04-27  6:09           ` David Chinner
2007-04-27  7:04             ` Andrew Morton
2007-04-27  8:03               ` David Chinner
2007-04-27  8:48                 ` Andrew Morton
2007-04-27 16:45                   ` Theodore Tso
2007-05-04 13:33                     ` Eric W. Biederman
2007-05-07  4:29                       ` David Chinner
2007-05-07  4:48                         ` Eric W. Biederman
2007-05-07  5:27                           ` David Chinner
2007-05-07  6:43                             ` Eric W. Biederman
2007-05-07  6:49                               ` William Lee Irwin III
2007-05-07  7:06                                 ` William Lee Irwin III
2007-05-08  8:49                                   ` William Lee Irwin III
2007-05-07 16:06                               ` Christoph Lameter
2007-05-07 17:29                                 ` William Lee Irwin III
2007-05-04 12:57                   ` Eric W. Biederman
2007-05-04 13:31                 ` Eric W. Biederman
2007-05-04 16:11                   ` Christoph Lameter
2007-05-07  4:58                   ` David Chinner
2007-05-07  6:56                     ` Eric W. Biederman
2007-05-07 15:17                       ` Weigert, Daniel
2007-04-27 16:55           ` Theodore Tso
2007-04-27 17:32             ` Nicholas Miell
2007-04-27 18:12               ` William Lee Irwin III
2007-04-28 16:39 ` Maxim Levitsky
2007-04-30  5:23   ` Christoph Lameter

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=200704262150.18748.maximlevitsky@gmail.com \
    --to=maximlevitsky@gmail.com \
    --cc=clameter@sgi.com \
    --cc=dgc@sgi.com \
    --cc=jens.axboe@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mel@skynet.ie \
    --cc=pbadari@gmail.com \
    --cc=wli@holomorphy.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.