Address space operations questions

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Address space operations questions
       [not found] <8e70aacf05032616151c958eed@mail.gmail.com>
@ 2005-03-29 22:30 ` Martin Jambor
  2005-03-30 13:55   ` Nikita Danilov
  0 siblings, 1 reply; 13+ messages in thread
From: Martin Jambor @ 2005-03-29 22:30 UTC (permalink / raw)
  To: linux-fsdevel

Hi,

I have problems understanding the purpose of different entries of
struc address_space_operations in 2.6 kernels:

1. What is bmap for and what is it supposed to do?

2. What is the difference between sync_page and write_page?

3. What exactly (fs independent) is the relation in between
write_page, prepare_write and commit_write? Does prepare make sure a
page can be written (like allocating space), commit mark it dirty a
write write it sometime later on?

Thak you very much for any insight,

Martin

P.S.: I have dedicated a lot of time searching for any documentation,
tried IRC forums and even bought a book without getting a truly good
answer to the above questions. Oh yeah, I read a lot of the source
even though not all of it yet :-) On the other hand, if you find this
question inappropriate for this mailing list (too basic, perhaps),
please let me know.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Address space operations questions
  2005-03-29 22:30 ` Address space operations questions Martin Jambor
@ 2005-03-30 13:55   ` Nikita Danilov
  2005-03-31 19:59     ` Bryan Henderson
  2005-04-06 23:52     ` Martin Jambor
  0 siblings, 2 replies; 13+ messages in thread
From: Nikita Danilov @ 2005-03-30 13:55 UTC (permalink / raw)
  To: Martin Jambor; +Cc: linux-fsdevel

Martin Jambor writes:
 > Hi,
 > 
 > I have problems understanding the purpose of different entries of
 > struc address_space_operations in 2.6 kernels:
 > 
 > 1. What is bmap for and what is it supposed to do?

->bmap() maps logical block offset within "object" to physical block
number. It is used in few places, notably in the implementation of
FIBMAP ioctl.

 > 
 > 2. What is the difference between sync_page and write_page?

(It is spelt ->writepage() by the way).

->sync_page() is an awful misnomer. Usually, when page IO operation is
requested by calling ->writepage() or ->readpage(), file-system queues
IO request (e.g., disk-based file system may do this my calling
submit_bio()), but underlying device driver does not proceed with this
IO immediately, because IO scheduling is more efficient when there are
multiple requests in the queue.

Only when something really wants to wait for IO completion
(wait_on_page_{locked,writeback}() are used to wait for read and write
completion respectively) IO queue is processed. To do this
wait_on_page_bit() calls ->sync_page() (see block_sync_page()---standard
implementation of ->sync_page() for disk-based file systems).

So, semantics of ->sync_page() are roughly "kick underlying storage
driver to actually perform all IO queued for this page, and, maybe, for
other pages on this device too".

 > 
 > 3. What exactly (fs independent) is the relation in between
 > write_page, prepare_write and commit_write? Does prepare make sure a
 > page can be written (like allocating space), commit mark it dirty a
 > write write it sometime later on?

->prepare_write() and ->commit_write() are only used by
generic_file_write() (so, one may argue that they shouldn't be placed
into struct address_space at all).

generic_file_write() has a loop for each page overlapping with portion
of file that write goes into:

     a_ops->prepare_write(file, page, from, to);
     copy_from_user(...);
     a_ops->commit_write(file, page, from, to);

In page is partially overwritten, ->prepare_write() has to read parts of
the page that are not covered by write. ->commit_write() is expected to
mark page (or buffers) and inode dirty, and update inode size, if write
extends file.

As for block allocation and transaction handling, this is up to the file
system back end.

Usually ->commit_write() doesn't start IO by itself, it just marks pages
dirty. Write-out is done by balance_dirty_pages_ratelimited(): when
number of dirty pages in the system exceeds some threshold, kernel calls
->writepages() of dirty inodes.

->writepage() is used in two places:

    - by VM scanner to write out dirty page from tail of the inactive
    list.  This is "rare" path, because balance_dirty_pages() is
    supposed to keep amount of dirty pages under control.

    - by mpage_writepages(): default implementation of ->writepages()
    method.

 > 
 > Thak you very much for any insight,
 > 
 > Martin

Hope this helps.

Nikita.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Address space operations questions
  2005-03-30 13:55   ` Nikita Danilov
@ 2005-03-31 19:59     ` Bryan Henderson
  2005-03-31 20:43       ` Zach Brown
  2005-04-06 23:52     ` Martin Jambor
  1 sibling, 1 reply; 13+ messages in thread
From: Bryan Henderson @ 2005-03-31 19:59 UTC (permalink / raw)
  To: Nikita Danilov; +Cc: Martin Jambor, linux-fsdevel

>So, semantics of ->sync_page() are roughly "kick underlying storage
>driver to actually perform all IO queued for this page, and, maybe, for
>other pages on this device too".

I prefer to think of it in a more modular sense.  To preserve modularity, 
the caller of sync_page() can't know anything about I/O scheduling.  So I 
think the semantics of ->sync_page() are "Someone is about to wait for the 
results of the previously requested write_page on this page."  It's 
completely up to the owner of the address space to figure out what would 
be appropriate to do given that information.

I agree that for the conventional filesystem and device types for which 
this interface was designed, the appropriate response would be to start 
any queued I/O.

--
Bryan Henderson                          IBM Almaden Research Center
San Jose CA                              Filesystems

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Address space operations questions
  2005-03-31 19:59     ` Bryan Henderson
@ 2005-03-31 20:43       ` Zach Brown
  2005-03-31 21:40         ` Bryan Henderson
  0 siblings, 1 reply; 13+ messages in thread
From: Zach Brown @ 2005-03-31 20:43 UTC (permalink / raw)
  To: Bryan Henderson; +Cc: Nikita Danilov, linux-fsdevel

Bryan Henderson wrote:
>>So, semantics of ->sync_page() are roughly "kick underlying storage
>>driver to actually perform all IO queued for this page, and, maybe, for
>>other pages on this device too".
> 
> 
> I prefer to think of it in a more modular sense.  To preserve modularity, 
> the caller of sync_page() can't know anything about I/O scheduling.  So I 
> think the semantics of ->sync_page() are "Someone is about to wait for the 
> results of the previously requested write_page on this page."  It's 
> completely up to the owner of the address space to figure out what would 
> be appropriate to do given that information.

Though I agree with your desire for a "modular" interpretation, I'm
going disagree your description of the information sync_page() provides.
 Nikita's vague description is closer to the truth because sync_page()
is vague.

If you follow the callers of sync_page() you quickly find that what it
*really* means to be called in sync_page() is that you're being told
that some process is about to block on that page.  For what reason, you
can't know from the call alone.  Waiting for read to complete and
unlock?  Waiting for writeback to clear the writeback bit?  Some
processes just happened to race to lock_page() on the same page for
reasons that have nothing to do with IO?

And the not-so-initiated might think that sync_page() is called with the
page lock just like the other ops.  That's obviously not the case, as it
is called when the page lock can't be acquired, but it makes actually
*doing* something with that page argument perilous.  Synchronization is
left to the sync_page() callee.

For example, cachefs seems to be among the very few who don't provide
block_sync_page() directly.  All it does is printk() and then call
block_sync_page().  In that single line of code it derefs page->mapping
without checking and so can probably fatally race with truncation and
the nulling of page->mapping.

- z

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Address space operations questions
  2005-03-31 20:43       ` Zach Brown
@ 2005-03-31 21:40         ` Bryan Henderson
  2005-03-31 21:53           ` Trond Myklebust
  0 siblings, 1 reply; 13+ messages in thread
From: Bryan Henderson @ 2005-03-31 21:40 UTC (permalink / raw)
  To: Zach Brown; +Cc: linux-fsdevel, Nikita Danilov

>what it
>*really* means to be called in sync_page() is that you're being told
>that some process is about to block on that page.  For what reason, you
>can't know from the call alone.

Ugh.  IOW it barely means anything.

--
Bryan Henderson                          IBM Almaden Research Center
San Jose CA                              Filesystems


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Address space operations questions
  2005-03-31 21:40         ` Bryan Henderson
@ 2005-03-31 21:53           ` Trond Myklebust
  2005-04-01  0:06             ` Bryan Henderson
  0 siblings, 1 reply; 13+ messages in thread
From: Trond Myklebust @ 2005-03-31 21:53 UTC (permalink / raw)
  To: Bryan Henderson; +Cc: Zach Brown, Linux Filesystem Development, Nikita Danilov

to den 31.03.2005 Klokka 13:40 (-0800) skreiv Bryan Henderson:
> >what it
> >*really* means to be called in sync_page() is that you're being told
> >that some process is about to block on that page.  For what reason, you
> >can't know from the call alone.
> 
> Ugh.  IOW it barely means anything.

It reflects the fact that the page lock can be held for a variety of
reasons, some of which require you to kick the filesystem and some which
don't.

I introduced the sync_page() call in 2.4.x partly in order to get rid of
all those pathetic hard-coded calls to "run_task_queue(&tq_disk)" that
used to litter the 2.4.x mm code (and still do in some places). As far
as NFS is concerned, they are a useless distraction since only the block
code uses the tq_disk queue.

The other reason was that the NFS client itself had to defer actually
putting reads on the wire until someone requested the lock: the reason
was that there was no equivalent of the "readpages()" call, so that when
we wanted to coalesce more than 1 page worth of data into a single read
call, we had to exit readpage() without actually starting I/O in the
hope that the readahead code would then schedule a readpage() on a
neighbouring page.

Cheers,
  Trond

-- 
Trond Myklebust <trond.myklebust@fys.uio.no>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Address space operations questions
  2005-03-31 21:53           ` Trond Myklebust
@ 2005-04-01  0:06             ` Bryan Henderson
  0 siblings, 0 replies; 13+ messages in thread
From: Bryan Henderson @ 2005-04-01  0:06 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Linux Filesystem Development, Nikita Danilov, Zach Brown

>It reflects the fact that the page lock can be held for a variety of
>reasons, some of which require you to kick the filesystem and some which
>don't.

So then what I don't understand is why you would make a call that tells 
you someone is trying to hold the page lock?  Why not a call that tells 
you something meaningful like, "someone is trying to read this page"?  Or 
"someone is waiting for this page to get clean?"

>I introduced the sync_page() call in 2.4.x partly in order to get rid of
>all those pathetic hard-coded calls to "run_task_queue(&tq_disk)"

That was pathetic all right, and sync_page() would be a clear improvement 
if it just replaced those modularity-busting I/O scheduling calls.  But 
did it?  Were there run_task_queue's every time the kernel waited for page 
status to change?  I thought they were in more eclectic places.

>the NFS client itself had to defer actually
>putting reads on the wire until someone requested the lock

But really, you mean the client had to defer putting reads on the wire 
until someone was ready to use the data.  That suggests a call to 
->sync_page in file read or page fault code rather than deep in page 
management.

--
Bryan Henderson                          IBM Almaden Research Center
San Jose CA                              Filesystems

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Address space operations questions
  2005-03-30 13:55   ` Nikita Danilov
  2005-03-31 19:59     ` Bryan Henderson
@ 2005-04-06 23:52     ` Martin Jambor
  2005-04-07  8:23       ` Nikita Danilov
  2005-04-07 16:58       ` Address space operations - >bmap Bryan Henderson
  1 sibling, 2 replies; 13+ messages in thread
From: Martin Jambor @ 2005-04-06 23:52 UTC (permalink / raw)
  To: Nikita Danilov; +Cc: linux-fsdevel

Thank you very much for your reply.

On Mar 30, 2005 3:55 PM, Nikita Danilov <nikita@clusterfs.com> wrote:
>  > 1. What is bmap for and what is it supposed to do?
> 
> ->bmap() maps logical block offset within "object" to physical block
> number. It is used in few places, notably in the implementation of
> FIBMAP ioctl.

We are about to start implementing a fs where data can move around the
device and so a physical block address is not really useful. I have
understood from other postings to this list that reiserfs and ntfs
don;t implement this method so I suppose we'll do the same. I'll just
find some nice error to return.

>  > 2. What is the difference between sync_page and writepage?
>
> ->sync_page() is an awful misnomer. Usually, when page IO operation is
> requested by calling ->writepage() or ->readpage(), file-system queues
> IO request (e.g., disk-based file system may do this my calling
> submit_bio()), but underlying device driver does not proceed with this
> IO immediately, because IO scheduling is more efficient when there are
> multiple requests in the queue.
> 
> Only when something really wants to wait for IO completion
> (wait_on_page_{locked,writeback}() are used to wait for read and write
> completion respectively) IO queue is processed. To do this
> wait_on_page_bit() calls ->sync_page() (see block_sync_page()---standard
> implementation of ->sync_page() for disk-based file systems).

OK, so if I understand it well, sync_page does not actually write the
page anywhere, it only waits until the device driver finishes all
previous requests with that page, right? Does block_sync_page do
exactly that? (I would read the source but all it does is that it
calls a callback function) BTW, does it wait also for metadata?

Or is the semantics of this method really only wait until the device
driver releases this page and has nothing to do with data consistency
as we know it from syncing files and filesystems? Moreover, if a page
is marked dirty but not yet sent to the device to be written,
sync_page does actually nothing? Huh, please consider a comment in the
definition of the address_space_operations :-)

Again, thanks a lot for all your replies, I have learnt an important bit.

Martin

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Address space operations questions
  2005-04-06 23:52     ` Martin Jambor
@ 2005-04-07  8:23       ` Nikita Danilov
  2005-04-17 20:21         ` Lilo requirements (Was: Re: Address space operations questions) Martin Jambor
  2005-04-07 16:58       ` Address space operations - >bmap Bryan Henderson
  1 sibling, 1 reply; 13+ messages in thread
From: Nikita Danilov @ 2005-04-07  8:23 UTC (permalink / raw)
  To: Martin Jambor; +Cc: linux-fsdevel

Martin Jambor writes:
 > Thank you very much for your reply.
 > 
 > On Mar 30, 2005 3:55 PM, Nikita Danilov <nikita@clusterfs.com> wrote:
 > >  > 1. What is bmap for and what is it supposed to do?
 > > 
 > > ->bmap() maps logical block offset within "object" to physical block
 > > number. It is used in few places, notably in the implementation of
 > > FIBMAP ioctl.
 > 
 > We are about to start implementing a fs where data can move around the
 > device and so a physical block address is not really useful. I have
 > understood from other postings to this list that reiserfs and ntfs
 > don;t implement this method so I suppose we'll do the same. I'll just
 > find some nice error to return.

Consider tools like LILO that want stable block numbers for certain
files. In reiserfs (both v3 and v4) there is an ioctl that disables
relocation for a given file. Besides, I do not think ->bmap() is useless
even when block numbers are volatile, for one thing it allows user level
to track how file is laid out (for example, to measure fragmentation).

[...]

 > 
 > OK, so if I understand it well, sync_page does not actually write the
 > page anywhere, it only waits until the device driver finishes all
 > previous requests with that page, right? Does block_sync_page do

No. ->sync_page() doesn't wait for anything. It simply tells to the
underlying storage layer "start executing all queued IO requests". If
your file system uses block device as a storage, use block_sync_page as
your ->sync_page() method.

 > exactly that? (I would read the source but all it does is that it
 > calls a callback function) BTW, does it wait also for metadata?

No difference between data and meta-data at this level.

 > 
 > Martin

Nikita.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Address space operations - >bmap
  2005-04-06 23:52     ` Martin Jambor
  2005-04-07  8:23       ` Nikita Danilov
@ 2005-04-07 16:58       ` Bryan Henderson
  1 sibling, 0 replies; 13+ messages in thread
From: Bryan Henderson @ 2005-04-07 16:58 UTC (permalink / raw)
  To: Martin Jambor; +Cc: linux-fsdevel, Nikita Danilov

>We are about to start implementing a fs where data can move around the
>device and so a physical block address is not really useful. I have
>understood from other postings to this list that reiserfs and ntfs
>don't implement this method so I suppose we'll do the same. I'll just
>find some nice error to return.

It's appropriate only for the most classic of filesystems, really.  It was 
always a layering violation, but is handy for hackish things.

Interfaces that expose block addresses are in the same boat as all those 
fsstat fields -- block size, blocks used, blocks free, inodes used, inodes 
free.  They make sense for the original Unix File System, but get harder 
to give meaning with every new generation.

--
Bryan Henderson                          IBM Almaden Research Center
San Jose CA                              Filesystems

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Lilo requirements (Was: Re: Address space operations questions)
  2005-04-07  8:23       ` Nikita Danilov
@ 2005-04-17 20:21         ` Martin Jambor
  2005-04-17 21:33           ` Nikita Danilov
  0 siblings, 1 reply; 13+ messages in thread
From: Martin Jambor @ 2005-04-17 20:21 UTC (permalink / raw)
  To: Nikita Danilov; +Cc: linux-fsdevel

Thanks for your reply, I found the the following thing interesting on its own:

On 4/7/05, Nikita Danilov <nikita@clusterfs.com> wrote:
> Consider tools like LILO that want stable block numbers for certain
> files. In reiserfs (both v3 and v4) there is an ioctl that disables
> relocation for a given file. Besides, I do not think ->bmap() is useless
> even when block numbers are volatile, for one thing it allows user level
> to track how file is laid out (for example, to measure fragmentation).

I tried to google out what behaviour lilo requires filesystems to
exhibit without much success... is that information available
somnewhere I din't look? Is that simple enought to be explained here?

TIA

Martin

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Lilo requirements (Was: Re: Address space operations questions)
  2005-04-17 20:21         ` Lilo requirements (Was: Re: Address space operations questions) Martin Jambor
@ 2005-04-17 21:33           ` Nikita Danilov
  2005-04-18 17:33             ` Bryan Henderson
  0 siblings, 1 reply; 13+ messages in thread
From: Nikita Danilov @ 2005-04-17 21:33 UTC (permalink / raw)
  To: Martin Jambor; +Cc: linux-fsdevel

Martin Jambor writes:
 > Thanks for your reply, I found the the following thing interesting on its own:
 > 
 > On 4/7/05, Nikita Danilov <nikita@clusterfs.com> wrote:
 > > Consider tools like LILO that want stable block numbers for certain
 > > files. In reiserfs (both v3 and v4) there is an ioctl that disables
 > > relocation for a given file. Besides, I do not think ->bmap() is useless
 > > even when block numbers are volatile, for one thing it allows user level
 > > to track how file is laid out (for example, to measure fragmentation).
 > 
 > I tried to google out what behaviour lilo requires filesystems to
 > exhibit without much success... is that information available
 > somnewhere I din't look? Is that simple enought to be explained here?

As opposed to, say, GRUB, LILO doesn't parse file system layout at the
boot time. Instead it remembers in what blocks kernel image is
stored. This assumes following properties of the file system:

 - unit of disk space allocation for the kernel image file is
 block. That is, optimizations like UFS fragments or reiserfs tails are
 not applied, and

 - blocks that kernel image is stored into are real disk blocks (i.e.,
 there is a way to disable "delayed allocation"), and

 - kernel image file is not relocated, i.e., data are not moved into
 another blocks on the fly.

Currently the only file system that doesn't satisfy any of there
requirements is reiserfs, and it has special ioctl REISERFS_IOC_UNPACK
that forces LILO friendly behaviour for a specified file: no tails, no
delayed allocation, and no relocation. LILO detects when kernel image is
on reiserfs and calls that ioctl.

 > 
 > TIA
 > 
 > Martin

Nikita.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Lilo requirements (Was: Re: Address space operations questions)
  2005-04-17 21:33           ` Nikita Danilov
@ 2005-04-18 17:33             ` Bryan Henderson
  0 siblings, 0 replies; 13+ messages in thread
From: Bryan Henderson @ 2005-04-18 17:33 UTC (permalink / raw)
  To: Nikita Danilov; +Cc: Martin Jambor, linux-fsdevel

>- unit of disk space allocation for the kernel image file is
> block. That is, optimizations like UFS fragments or reiserfs tails are
> not applied, and
>
> - blocks that kernel image is stored into are real disk blocks (i.e.,
> there is a way to disable "delayed allocation"), and
>
> - kernel image file is not relocated, i.e., data are not moved into
> another blocks on the fly.

It also has to implement the ioctl that tells you what blocks a file is in 
(that kind of implies much of the above).  Except if the LILO installer 
makes special provisions as for Reiserfs, of course.

To be really exact, it's OK for the blocks to move, as long as it doesn't 
do so so subtly that the user doesn't know to rerun the LILO installer. 
E.g. you can move the blocks of the kernel file if someone overwrites it.

--
Bryan Henderson                          IBM Almaden Research Center
San Jose CA                              Filesystems

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2005-04-18 17:33 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <8e70aacf05032616151c958eed@mail.gmail.com>
2005-03-29 22:30 ` Address space operations questions Martin Jambor
2005-03-30 13:55   ` Nikita Danilov
2005-03-31 19:59     ` Bryan Henderson
2005-03-31 20:43       ` Zach Brown
2005-03-31 21:40         ` Bryan Henderson
2005-03-31 21:53           ` Trond Myklebust
2005-04-01  0:06             ` Bryan Henderson
2005-04-06 23:52     ` Martin Jambor
2005-04-07  8:23       ` Nikita Danilov
2005-04-17 20:21         ` Lilo requirements (Was: Re: Address space operations questions) Martin Jambor
2005-04-17 21:33           ` Nikita Danilov
2005-04-18 17:33             ` Bryan Henderson
2005-04-07 16:58       ` Address space operations - >bmap Bryan Henderson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).