* Question regarding concurrent accesses through block device and fs
[not found] <m2hc2yulrw.fsf@gmail.com>
@ 2009-02-19 11:07 ` Francis Moreau
2009-02-19 13:44 ` Nick Piggin
0 siblings, 1 reply; 14+ messages in thread
From: Francis Moreau @ 2009-02-19 11:07 UTC (permalink / raw)
To: Linux Kernel Mailing List, Andrew Morton
[ Resend to LKLM, hopping to get a wider audience ;) and to Andrew Morton since
he wrote that part of the code, I think ]
Hello,
I have a question regarding the page cache/buffer heads behaviour when
some blocks are accessed through a regular file and through the block
dev hosting this file.
First it looks like when accessing some blocks through a block device,
the same mechanisms are used as when reading a file through a file
system: the page cache is used.
That means that a block could be mapped by several buffers at the same
time.
I don't see any issues to this (if we agree that the behaviour is undefined
in that case) but looking at __block_prepare_write(), it seems that we don't
want this to happen since it does:
[...]
if (buffer_new(bh)) {
unmap_underlying_metadata(bh->b_bdev, bh->b_blocknr);
[...]
}
where unmap_underlying_metadata() unmaps the blockdev buffer
which maps b_blocknr block.
This code seems to catch only the case where the buffer is new (I don't
see why only this case is treated).
Also this call seems unneeded if __block_prepare_write() is called
when writing through the block dev since we already know that the buffer
doesn't exist (we are here to create it).
I already read the comment of the function unmap_underlying_metadata()
but I failed to understand it...
Could anybody tell me what is the actual policy ?
thanks
--
Francis
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Question regarding concurrent accesses through block device and fs
2009-02-19 11:07 ` Question regarding concurrent accesses through block device and fs Francis Moreau
@ 2009-02-19 13:44 ` Nick Piggin
2009-02-20 14:10 ` Francis Moreau
0 siblings, 1 reply; 14+ messages in thread
From: Nick Piggin @ 2009-02-19 13:44 UTC (permalink / raw)
To: Francis Moreau; +Cc: Linux Kernel Mailing List, Andrew Morton
On Thursday 19 February 2009 22:07:42 Francis Moreau wrote:
> [ Resend to LKLM, hopping to get a wider audience ;) and to Andrew Morton
> since he wrote that part of the code, I think ]
>
> Hello,
>
> I have a question regarding the page cache/buffer heads behaviour when
> some blocks are accessed through a regular file and through the block
> dev hosting this file.
>
> First it looks like when accessing some blocks through a block device,
> the same mechanisms are used as when reading a file through a file
> system: the page cache is used.
Yes. page cache of the block device is also sometimes called buffer cache,
for historical reasons.
> That means that a block could be mapped by several buffers at the same
> time.
>
> I don't see any issues to this (if we agree that the behaviour is undefined
> in that case) but looking at __block_prepare_write(), it seems that we
> don't want this to happen since it does:
>
> [...]
> if (buffer_new(bh)) {
> unmap_underlying_metadata(bh->b_bdev, bh->b_blocknr);
> [...]
> }
>
> where unmap_underlying_metadata() unmaps the blockdev buffer
> which maps b_blocknr block.
>
> This code seems to catch only the case where the buffer is new (I don't
> see why only this case is treated).
>
> Also this call seems unneeded if __block_prepare_write() is called
> when writing through the block dev since we already know that the buffer
> doesn't exist (we are here to create it).
>
> I already read the comment of the function unmap_underlying_metadata()
> but I failed to understand it...
>
> Could anybody tell me what is the actual policy ?
This is done for only newly allocated on-disk blocks, (which is what
buffer_new means, not new in-memory buffers). And it is only there to
synchronize buffercache access by the filesystem for its metadata, rather
than trying to make /dev/bdev access coherent with file access.
Basically what can happen is that a filesystem will have perhaps allocated
a block for an array of indirect pointers. The filesystem manages this
via the buffercache and writes a few pointers into it. Then suppose the file
is truncated and that block becomes unused so it can be freed by the
filesystem block allocator. And the filesystem may also call bforget to
prevent the now useless buffer from being written out in future.
Now suppose a new block required for *file* data, and the filesystem happens
to reallocate that block. So now we may still have that old buffercache and
buffer head around, but we also have this new pagecache and buffer head for
the file that points to the same block (buffer_new will be set on this new
buffer head, btw, to reflect that it is a newly allocated block).
All fine so far.
Now there is a potential problem because the old buffer can *still be under
writeback* dating back from when it was still good metadata and before
bforget was called. That's a problem because the new buffer is expecting
to be the owner and master of the block and its data.
That is what the second paragraph in the comment refers to. I don't actaully
quite know what the problem is that is described in the first paragraph.
Andrew do you know?
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Question regarding concurrent accesses through block device and fs
2009-02-19 13:44 ` Nick Piggin
@ 2009-02-20 14:10 ` Francis Moreau
2009-02-23 3:58 ` Nick Piggin
0 siblings, 1 reply; 14+ messages in thread
From: Francis Moreau @ 2009-02-20 14:10 UTC (permalink / raw)
To: Nick Piggin; +Cc: Linux Kernel Mailing List, Andrew Morton
Hello,
Thanks a lot for your explanations !
On Thu, Feb 19, 2009 at 2:44 PM, Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> On Thursday 19 February 2009 22:07:42 Francis Moreau wrote:
>>
>> First it looks like when accessing some blocks through a block device,
>> the same mechanisms are used as when reading a file through a file
>> system: the page cache is used.
>
> Yes. page cache of the block device is also sometimes called buffer cache,
> for historical reasons.
I didn't know this term before.
> This is done for only newly allocated on-disk blocks, (which is what
> buffer_new means, not new in-memory buffers). And it is only there to
> synchronize buffercache access by the filesystem for its metadata, rather
> than trying to make /dev/bdev access coherent with file access.
Well I'm (still) confused by 2 things:
- the comments of unmap_underlying_metadata() doesn't sound that we're
dealing with meta data only:
" ... we don't want any output from any buffer-cache aliases starting ... "
note the word *any*. But I must admit that I don't understand the whole
comment.
- looking at unmap_underlying_metadata(), there's no code to deal with
meta data buffers. It gets the buffer and unmap it whatever the type of
data it contains.
But at least the name of this function is now more clear.
>
> Basically what can happen is that a filesystem will have perhaps allocated
> a block for an array of indirect pointers. The filesystem manages this
> via the buffercache and writes a few pointers into it. Then suppose the file
> is truncated and that block becomes unused so it can be freed by the
> filesystem block allocator. And the filesystem may also call bforget to
> prevent the now useless buffer from being written out in future.
>
ok, so now the buffercache is discarded and its content is either discarded
or is writing back.
> Now suppose a new block required for *file* data, and the filesystem happens
> to reallocate that block. So now we may still have that old buffercache and
> buffer head around, but we also have this new pagecache and buffer head for
> the file that points to the same block (buffer_new will be set on this new
> buffer head, btw, to reflect that it is a newly allocated block).
>
ok
> All fine so far.
>
> Now there is a potential problem because the old buffer can *still be under
> writeback* dating back from when it was still good metadata and before
> bforget was called. That's a problem because the new buffer is expecting
> to be the owner and master of the block and its data.
Now I don't see the problem.
Even if the old meta data is under writeback process, the new buffer can still
be used: since it's new there's no point to do IOs to read its content. If
we need to write it to disk then the IOs will overwrite the old meta
data, there's
no risk that the old meta data overwrite the new data.
What am I missing ?
Thanks
--
Francis
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Question regarding concurrent accesses through block device and fs
2009-02-20 14:10 ` Francis Moreau
@ 2009-02-23 3:58 ` Nick Piggin
2009-03-01 14:42 ` Francis Moreau
0 siblings, 1 reply; 14+ messages in thread
From: Nick Piggin @ 2009-02-23 3:58 UTC (permalink / raw)
To: Francis Moreau; +Cc: Linux Kernel Mailing List, Andrew Morton
On Saturday 21 February 2009 01:10:24 Francis Moreau wrote:
> On Thu, Feb 19, 2009 at 2:44 PM, Nick Piggin <nickpiggin@yahoo.com.au>
> > This is done for only newly allocated on-disk blocks, (which is what
> > buffer_new means, not new in-memory buffers). And it is only there to
> > synchronize buffercache access by the filesystem for its metadata, rather
> > than trying to make /dev/bdev access coherent with file access.
>
> Well I'm (still) confused by 2 things:
>
> - the comments of unmap_underlying_metadata() doesn't sound that we're
> dealing with meta data only:
>
> " ... we don't want any output from any buffer-cache aliases starting
> ... "
>
> note the word *any*. But I must admit that I don't understand the whole
> comment.
Well buffer cache is (almost) always metadata from the filesystem pov. The
comment *could* be talking about access to /dev/bdev, but in that case the
code is doing the wrong thing WRT coherency anyway (clearing dirty bit), so
I don't see how it could be talking about that.
> - looking at unmap_underlying_metadata(), there's no code to deal with
> meta data buffers. It gets the buffer and unmap it whatever the type of
> data it contains.
That's why I say it only really works for buffer cache used by the same
filesystem that is now known to be unused.
> But at least the name of this function is now more clear.
>
> > Basically what can happen is that a filesystem will have perhaps
> > allocated a block for an array of indirect pointers. The filesystem
> > manages this via the buffercache and writes a few pointers into it. Then
> > suppose the file is truncated and that block becomes unused so it can be
> > freed by the filesystem block allocator. And the filesystem may also call
> > bforget to prevent the now useless buffer from being written out in
> > future.
>
> ok, so now the buffercache is discarded and its content is either
> discarded or is writing back.
>
> > Now suppose a new block required for *file* data, and the filesystem
> > happens to reallocate that block. So now we may still have that old
> > buffercache and buffer head around, but we also have this new pagecache
> > and buffer head for the file that points to the same block (buffer_new
> > will be set on this new buffer head, btw, to reflect that it is a newly
> > allocated block).
>
> ok
>
> > All fine so far.
> >
> > Now there is a potential problem because the old buffer can *still be
> > under writeback* dating back from when it was still good metadata and
> > before bforget was called. That's a problem because the new buffer is
> > expecting to be the owner and master of the block and its data.
>
> Now I don't see the problem.
>
> Even if the old meta data is under writeback process, the new buffer can
> still be used: since it's new there's no point to do IOs to read its
> content. If we need to write it to disk then the IOs will overwrite the old
> meta data, there's
> no risk that the old meta data overwrite the new data.
>
> What am I missing ?
That we might complete the write of the new buffer before the
old buffer is finished writing out?
Or, I suppose it also covers filesystems that do not always
discard old buffers with bforget, so they don't have dirty
bit cleared (but I don't know 100% sure if this is considered a
filesystem bug or not -- but at least unmap_underlying_metadata
protects against it).
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Question regarding concurrent accesses through block device and fs
2009-02-23 3:58 ` Nick Piggin
@ 2009-03-01 14:42 ` Francis Moreau
2009-03-01 15:32 ` Nick Piggin
0 siblings, 1 reply; 14+ messages in thread
From: Francis Moreau @ 2009-03-01 14:42 UTC (permalink / raw)
To: Nick Piggin; +Cc: Linux Kernel Mailing List, Andrew Morton
[ Sorry for being long to answer but I was off, I'm slow and there are
a lot of complex code to dig out ! ]
Nick Piggin <nickpiggin@yahoo.com.au> writes:
> On Saturday 21 February 2009 01:10:24 Francis Moreau wrote:
[...]
>> - looking at unmap_underlying_metadata(), there's no code to deal with
>> meta data buffers. It gets the buffer and unmap it whatever the type of
>> data it contains.
>
> That's why I say it only really works for buffer cache used by the same
> filesystem that is now known to be unused.
>
hum, I still don't know what you mean by this, sorry to be slow.
[...]
>> What am I missing ?
>
> That we might complete the write of the new buffer before the
> old buffer is finished writing out?
Ah yes actually I realize that I don't know where and when the inode
blocks are effectively written to the disk !
It seems that write_inode(), called after data are commited to the
disk, only marks the inode buffers as dirty but it performs no IO (at
least it looks so for ext2 when its 'do_sync' parameter is 0 which is
the case when this method is called by write_inode()).
Could you enlight me one more time ?
Thanks
--
Francis
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Question regarding concurrent accesses through block device and fs
2009-03-01 14:42 ` Francis Moreau
@ 2009-03-01 15:32 ` Nick Piggin
2009-03-01 21:07 ` Francis Moreau
0 siblings, 1 reply; 14+ messages in thread
From: Nick Piggin @ 2009-03-01 15:32 UTC (permalink / raw)
To: Francis Moreau; +Cc: Linux Kernel Mailing List, Andrew Morton
On Monday 02 March 2009 01:42:55 Francis Moreau wrote:
> [ Sorry for being long to answer but I was off, I'm slow and there are
> a lot of complex code to dig out ! ]
>
> Nick Piggin <nickpiggin@yahoo.com.au> writes:
> > On Saturday 21 February 2009 01:10:24 Francis Moreau wrote:
>
> [...]
>
> >> - looking at unmap_underlying_metadata(), there's no code to deal with
> >> meta data buffers. It gets the buffer and unmap it whatever the type
> >> of data it contains.
> >
> > That's why I say it only really works for buffer cache used by the same
> > filesystem that is now known to be unused.
>
> hum, I still don't know what you mean by this, sorry to be slow.
OK, the "buffercache", the cache of block device contents, is normally
thought of as metadata when it is being used by the filesystem (eg.
usually via bread() etc), or data when it is being read/written from
userspace via /dev/<blockdevice>.
In the former case, the buffer.c/filesystem code together know when a
metadata buffer is unused (because the filesystem has deallocated it),
so unmap_underlying_metadata will work there.
And it is insane to have a mounted filesystem and have userspace working
on the same block device, so unmap_underlying_metadata doesn't have to
care about that case. (IIRC some filesystem tools can do this, but there
are obviously a lot of tricks to it)
> >> What am I missing ?
> >
> > That we might complete the write of the new buffer before the
> > old buffer is finished writing out?
>
> Ah yes actually I realize that I don't know where and when the inode
> blocks are effectively written to the disk !
>
> It seems that write_inode(), called after data are commited to the
> disk, only marks the inode buffers as dirty but it performs no IO (at
> least it looks so for ext2 when its 'do_sync' parameter is 0 which is
> the case when this method is called by write_inode()).
>
> Could you enlight me one more time ?
Depends on the filesystem. Many do just use the buffercache as a
writeback cache for their metadata, and are happy to just let the
dirty page flushers write it out when it suits them (or when there
are explicit sync instructions given).
Most of the time, these filesystems don't really know or care when
exactly their metadata is under writeback.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Question regarding concurrent accesses through block device and fs
2009-03-01 15:32 ` Nick Piggin
@ 2009-03-01 21:07 ` Francis Moreau
2009-03-02 7:11 ` Nick Piggin
0 siblings, 1 reply; 14+ messages in thread
From: Francis Moreau @ 2009-03-01 21:07 UTC (permalink / raw)
To: Nick Piggin; +Cc: Linux Kernel Mailing List, Andrew Morton
Nick Piggin <nickpiggin@yahoo.com.au> writes:
> On Monday 02 March 2009 01:42:55 Francis Moreau wrote:
[...]
> OK, the "buffercache", the cache of block device contents, is normally
> thought of as metadata when it is being used by the filesystem (eg.
> usually via bread() etc), or data when it is being read/written from
> userspace via /dev/<blockdevice>.
>
> In the former case, the buffer.c/filesystem code together know when a
> metadata buffer is unused (because the filesystem has deallocated it),
> so unmap_underlying_metadata will work there.
>
> And it is insane to have a mounted filesystem and have userspace working
> on the same block device, so unmap_underlying_metadata doesn't have to
> care about that case. (IIRC some filesystem tools can do this, but there
> are obviously a lot of tricks to it)
Thanks for clarifying this.
[...]
> Depends on the filesystem. Many do just use the buffercache as a
> writeback cache for their metadata, and are happy to just let the
> dirty page flushers write it out when it suits them
I guess you're talking about the pdflush threads here.
This is the case where I can't find when the metadata are actually
written back to the disk by the flushers. I looked at
writback_inodes() but I fail to find this out.
Could you point out the place in the code where this happen ?
> (or when there are explicit sync instructions given).
yes I see where this happens in these cases.
> Most of the time, these filesystems don't really know or care when
> exactly their metadata is under writeback.
This sounds very weird to me but I need to learn how things work
before doing any serious comments.
thanks
--
Francis
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Question regarding concurrent accesses through block device and fs
2009-03-01 21:07 ` Francis Moreau
@ 2009-03-02 7:11 ` Nick Piggin
2009-03-02 13:30 ` Francis Moreau
0 siblings, 1 reply; 14+ messages in thread
From: Nick Piggin @ 2009-03-02 7:11 UTC (permalink / raw)
To: Francis Moreau; +Cc: Linux Kernel Mailing List, Andrew Morton
On Monday 02 March 2009 08:07:30 Francis Moreau wrote:
> Nick Piggin <nickpiggin@yahoo.com.au> writes:
> > Depends on the filesystem. Many do just use the buffercache as a
> > writeback cache for their metadata, and are happy to just let the
> > dirty page flushers write it out when it suits them
>
> I guess you're talking about the pdflush threads here.
Yeah.
> This is the case where I can't find when the metadata are actually
> written back to the disk by the flushers. I looked at
> writback_inodes() but I fail to find this out.
>
> Could you point out the place in the code where this happen ?
I guess it picks them up via their block device inodes.
> > (or when there are explicit sync instructions given).
>
> yes I see where this happens in these cases.
>
> > Most of the time, these filesystems don't really know or care when
> > exactly their metadata is under writeback.
>
> This sounds very weird to me but I need to learn how things work
> before doing any serious comments.
Why would they? They just operate on their metadata, and the buffer
cache is basically a transparent writeback cache to them. In the
same way, an application doesn't really know or care when exactly
its data is under writeback. unmap_underlying_metadata is the
important exception because Linux pagecache otherwise doesn't have
a good way to keep pagecache of different mappings coherent. So if
a block switches from buffercache to file mapping, it needs to be
made coherent.
When switching back the other way, the truncate code actually makes
sure of this, that there won't be blocks under writeout after
being deallocated.
Things do get more complicated with journalling file systems.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Question regarding concurrent accesses through block device and fs
2009-03-02 7:11 ` Nick Piggin
@ 2009-03-02 13:30 ` Francis Moreau
2009-03-03 3:52 ` Nick Piggin
0 siblings, 1 reply; 14+ messages in thread
From: Francis Moreau @ 2009-03-02 13:30 UTC (permalink / raw)
To: Nick Piggin; +Cc: Linux Kernel Mailing List, Andrew Morton
Nick Piggin <nickpiggin@yahoo.com.au> writes:
> On Monday 02 March 2009 08:07:30 Francis Moreau wrote:
>> This is the case where I can't find when the metadata are actually
>> written back to the disk by the flushers. I looked at
>> writback_inodes() but I fail to find this out.
>>
>> Could you point out the place in the code where this happen ?
>
> I guess it picks them up via their block device inodes.
Probably but I don't find the actual place.
I looked at the place where page are normally written back to disk (ie
in background_writeout()) but I can see only the writeback of data, not
metadata...
>> This sounds very weird to me but I need to learn how things work
>> before doing any serious comments.
>
> Why would they? They just operate on their metadata, and the buffer
> cache is basically a transparent writeback cache to them.
Well the fact that metadata are written back to disk at an unknown point
in the time means that we don't know in which order metadata and data
are written. So it means that data can be written before or after
metadata or they can be mixed up.
And this sounds just weird to me. But as I said I'm just a noob so I
need to think and study more on this area and I really have to see where
the actual writes of metadata happen in the code.
> In the same way, an application doesn't really know or care when
> exactly its data is under writeback.
Except when dealing with metadata of the fs, we can corrupt the whole
thing, I think.
> unmap_underlying_metadata is the important exception because Linux
> pagecache otherwise doesn't have a good way to keep pagecache of
> different mappings coherent. So if a block switches from buffercache
> to file mapping, it needs to be made coherent.
>
> When switching back the other way, the truncate code actually makes
> sure of this, that there won't be blocks under writeout after
> being deallocated.
>
> Things do get more complicated with journalling file systems.
>
I think I'll just forget about them, things are currently enough
complicated to make them more obscure ;)
thanks
--
Francis
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Question regarding concurrent accesses through block device and fs
2009-03-02 13:30 ` Francis Moreau
@ 2009-03-03 3:52 ` Nick Piggin
2009-03-12 8:05 ` Francis Moreau
0 siblings, 1 reply; 14+ messages in thread
From: Nick Piggin @ 2009-03-03 3:52 UTC (permalink / raw)
To: Francis Moreau; +Cc: Linux Kernel Mailing List, Andrew Morton
On Tuesday 03 March 2009 00:30:18 Francis Moreau wrote:
> Nick Piggin <nickpiggin@yahoo.com.au> writes:
> > On Monday 02 March 2009 08:07:30 Francis Moreau wrote:
> >> This is the case where I can't find when the metadata are actually
> >> written back to the disk by the flushers. I looked at
> >> writback_inodes() but I fail to find this out.
> >>
> >> Could you point out the place in the code where this happen ?
> >
> > I guess it picks them up via their block device inodes.
>
> Probably but I don't find the actual place.
It was an educated guess ;) I'm quite sure it does.
> I looked at the place where page are normally written back to disk (ie
> in background_writeout()) but I can see only the writeback of data, not
> metadata...
What are you expecting writeback of metadata to look like? To the
core kernel it looks the same as writeback of data.
> >> This sounds very weird to me but I need to learn how things work
> >> before doing any serious comments.
> >
> > Why would they? They just operate on their metadata, and the buffer
> > cache is basically a transparent writeback cache to them.
>
> Well the fact that metadata are written back to disk at an unknown point
> in the time means that we don't know in which order metadata and data
> are written. So it means that data can be written before or after
> metadata or they can be mixed up.
But the cache layer on top of that ensures it *appears* not to be mixed
up. A problem arises when the system crashes in the middle of this, and
we lose that information and see a mixed up filesystem. Hence journalling
filesystems.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Question regarding concurrent accesses through block device and fs
2009-03-03 3:52 ` Nick Piggin
@ 2009-03-12 8:05 ` Francis Moreau
2009-03-12 8:22 ` Nick Piggin
0 siblings, 1 reply; 14+ messages in thread
From: Francis Moreau @ 2009-03-12 8:05 UTC (permalink / raw)
To: Nick Piggin; +Cc: Linux Kernel Mailing List, Andrew Morton
Hello Nick,
Sorry for the long delay before my answer, but I don't have enough
time to dig the kernel source.
On Tue, Mar 3, 2009 at 4:52 AM, Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> On Tuesday 03 March 2009 00:30:18 Francis Moreau wrote:
>> Nick Piggin <nickpiggin@yahoo.com.au> writes:
>> > On Monday 02 March 2009 08:07:30 Francis Moreau wrote:
>> >> This is the case where I can't find when the metadata are actually
>> >> written back to the disk by the flushers. I looked at
>> >> writback_inodes() but I fail to find this out.
>> >>
>> >> Could you point out the place in the code where this happen ?
>> >
>> > I guess it picks them up via their block device inodes.
>>
>> Probably but I don't find the actual place.
>
> It was an educated guess ;) I'm quite sure it does.
>
Ok I think I got the idea now. I though block device main purpose was
to handle block nodes such as /dev/sdx but it isn't.
>
>> I looked at the place where page are normally written back to disk (ie
>> in background_writeout()) but I can see only the writeback of data, not
>> metadata...
>
> What are you expecting writeback of metadata to look like? To the
> core kernel it looks the same as writeback of data.
>
I don't know. I was just thinking that since metadata are special since they
handle critical file system information, the kernel did treat them specially.
> But the cache layer on top of that ensures it *appears* not to be mixed
> up. A problem arises when the system crashes in the middle of this, and
> we lose that information and see a mixed up filesystem. Hence journalling
> filesystems.
Ok I guess I win a new tour in the kernel code ;) to understand how the cache
layer do that.
thanks a lot.
--
Francis
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Question regarding concurrent accesses through block device and fs
2009-03-12 8:05 ` Francis Moreau
@ 2009-03-12 8:22 ` Nick Piggin
2009-03-12 9:00 ` Francis Moreau
0 siblings, 1 reply; 14+ messages in thread
From: Nick Piggin @ 2009-03-12 8:22 UTC (permalink / raw)
To: Francis Moreau; +Cc: Linux Kernel Mailing List, Andrew Morton
On Thursday 12 March 2009 19:05:39 Francis Moreau wrote:
> > It was an educated guess ;) I'm quite sure it does.
>
> Ok I think I got the idea now. I though block device main purpose was
> to handle block nodes such as /dev/sdx but it isn't.
Well, /dev/sdX access is important, at least to create and fsck the
filesystem ;) But for most Linux users, I think majority of buffercache
access will be by filesystem metadata access.
> >> I looked at the place where page are normally written back to disk (ie
> >> in background_writeout()) but I can see only the writeback of data, not
> >> metadata...
> >
> > What are you expecting writeback of metadata to look like? To the
> > core kernel it looks the same as writeback of data.
>
> I don't know. I was just thinking that since metadata are special since
> they handle critical file system information, the kernel did treat them
> specially.
It is, but you have to look in the filesystems themselves to see that.
There are some exceptions to that -- eg. sync_mapping_buffers in
buffer.c where it writes out dirty metadata buffers that the filesystem
has attached to a file. But that's fsync driven rather than background
writeout.
> > But the cache layer on top of that ensures it *appears* not to be mixed
> > up. A problem arises when the system crashes in the middle of this, and
> > we lose that information and see a mixed up filesystem. Hence journalling
> > filesystems.
>
> Ok I guess I win a new tour in the kernel code ;) to understand how the
> cache layer do that.
Ignore details like crashes, direct IO and coherency between data mappings
and buffercache where things get a bit hairy, and it's just a writeback
cache. The last thing you write to some location will be what you get back
if you read from that location -- regardless of whether it is dirty or clean
or not present when you ask for it (and has to be read from disk).
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Question regarding concurrent accesses through block device and fs
2009-03-12 8:22 ` Nick Piggin
@ 2009-03-12 9:00 ` Francis Moreau
2009-03-12 9:12 ` Nick Piggin
0 siblings, 1 reply; 14+ messages in thread
From: Francis Moreau @ 2009-03-12 9:00 UTC (permalink / raw)
To: Nick Piggin; +Cc: Linux Kernel Mailing List, Andrew Morton
On Thu, Mar 12, 2009 at 9:22 AM, Nick Piggin <nickpiggin@yahoo.com.au> wrote:
>
> Ignore details like crashes, direct IO and coherency between data mappings
> and buffercache where things get a bit hairy, and it's just a writeback
> cache. The last thing you write to some location will be what you get back
> if you read from that location -- regardless of whether it is dirty or clean
> or not present when you ask for it (and has to be read from disk).
>
Well yes but I was wondering in the special where the kernel crash or
the power supply is down how the kernel is minimizing the risk of file
system inconsistency. Hence my questions about metadata handling.
Thanks
--
Francis
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Question regarding concurrent accesses through block device and fs
2009-03-12 9:00 ` Francis Moreau
@ 2009-03-12 9:12 ` Nick Piggin
0 siblings, 0 replies; 14+ messages in thread
From: Nick Piggin @ 2009-03-12 9:12 UTC (permalink / raw)
To: Francis Moreau; +Cc: Linux Kernel Mailing List, Andrew Morton
On Thursday 12 March 2009 20:00:38 Francis Moreau wrote:
> On Thu, Mar 12, 2009 at 9:22 AM, Nick Piggin <nickpiggin@yahoo.com.au>
wrote:
> > Ignore details like crashes, direct IO and coherency between data
> > mappings and buffercache where things get a bit hairy, and it's just a
> > writeback cache. The last thing you write to some location will be what
> > you get back if you read from that location -- regardless of whether it
> > is dirty or clean or not present when you ask for it (and has to be read
> > from disk).
>
> Well yes but I was wondering in the special where the kernel crash or
> the power supply is down how the kernel is minimizing the risk of file
> system inconsistency. Hence my questions about metadata handling.
Well, journalling filesystems, other filesystems can do synchronous
metadata updates (write through) which can help too. There is really
nothing that the generic pagecache/buffercache code does to try to
handle this because it is far to filesystem specific.
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2009-03-12 9:13 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <m2hc2yulrw.fsf@gmail.com>
2009-02-19 11:07 ` Question regarding concurrent accesses through block device and fs Francis Moreau
2009-02-19 13:44 ` Nick Piggin
2009-02-20 14:10 ` Francis Moreau
2009-02-23 3:58 ` Nick Piggin
2009-03-01 14:42 ` Francis Moreau
2009-03-01 15:32 ` Nick Piggin
2009-03-01 21:07 ` Francis Moreau
2009-03-02 7:11 ` Nick Piggin
2009-03-02 13:30 ` Francis Moreau
2009-03-03 3:52 ` Nick Piggin
2009-03-12 8:05 ` Francis Moreau
2009-03-12 8:22 ` Nick Piggin
2009-03-12 9:00 ` Francis Moreau
2009-03-12 9:12 ` Nick Piggin
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox