(unknown)

linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* (unknown)
@ 2010-05-14 21:39 Jiaying Zhang
  2010-05-14 22:07 ` your mail tytso
  0 siblings, 1 reply; 8+ messages in thread
From: Jiaying Zhang @ 2010-05-14 21:39 UTC (permalink / raw)
  To: curtw, fmayhar, mrubin, tytso
  Cc: "[PATCH]", fix, the, extent, validity, checking, in,
	e2fsck, linux-ext4

This patch changes e2fsck to use the same checking on the validity of an extent
as the kernel ext4 is using.

Signed-off-by: Jiaying Zhang <jiayingz@google.com>

diff --git a/e2fsck/pass1.c b/e2fsck/pass1.c
index 3c6f91c..c5dc01a 100644
--- a/e2fsck/pass1.c
+++ b/e2fsck/pass1.c
@@ -1690,8 +1690,8 @@ static void scan_extent_node(e2fsck_t ctx, struct problem_context *pctx,
 		is_dir = LINUX_S_ISDIR(pctx->inode->i_mode);
 
 		problem = 0;
-		if (extent.e_pblk < ctx->fs->super->s_first_data_block ||
-		    extent.e_pblk >= ext2fs_blocks_count(ctx->fs->super))
+		if (extent.e_pblk <= ctx->fs->super->s_first_data_block ||
+		    extent.e_pblk > ext2fs_blocks_count(ctx->fs->super))
 			problem = PR_1_EXTENT_BAD_START_BLK;
 		else if (extent.e_lblk < start_block)
 			problem = PR_1_OUT_OF_ORDER_EXTENTS;

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: your mail
  2010-05-14 21:39 (unknown) Jiaying Zhang
@ 2010-05-14 22:07 ` tytso
  0 siblings, 0 replies; 8+ messages in thread
From: tytso @ 2010-05-14 22:07 UTC (permalink / raw)
  To: Jiaying Zhang; +Cc: curtw, fmayhar, mrubin, linux-ext4

On Fri, May 14, 2010 at 02:39:45PM -0700, Jiaying Zhang wrote:
> This patch changes e2fsck to use the same checking on the validity
> of an extent as the kernel ext4 is using.

Actually, the better fix is to explicitly test for extent.e_blk == 0.
The kernel test works because at the moment nothing creates file
systems where s_first_data_block is anything other than 0 or 1.  But
technically speaking, having an extent which begins at
s_first_data_block isn't actually _wrong_.  It might overlap with fs
metadata, but pass1b will handle that.  The reason why it doesn't in
the case of 0 is because 0 is a special case and also means "there's
no block present" when returned by ext2fs_block_iterate.

Arguably the kernel should be changed to something similar, but in
practice it won't make a difference in practice.  E2fsck can do a
slightly better job recovering in the case of 1k block filesystems, so
this patch is slightly better.

						- Ted

commit e6238d3708d328851bfdff7580d1b8504c7cf2e4
Author: Theodore Ts'o <tytso@mit.edu>
Date:   Fri May 14 18:03:14 2010 -0400

    e2fsck: Explicitly reject extents that begin at physical block 0 as illegal

    In the case where s_first_data_block is 1, we need to explictly reject
    an extent whose starting physical block is zero.

    Thanks to Jiaying Zhang <jiayingz@google.com> for finding this bug.

    Addresses-Google-Bug: #2573806

    Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

diff --git a/e2fsck/pass1.c b/e2fsck/pass1.c
index 5e2ecc7..c35937f 100644
--- a/e2fsck/pass1.c
+++ b/e2fsck/pass1.c
@@ -1694,7 +1694,8 @@ static void scan_extent_node(e2fsck_t ctx, struct problem_context *pctx,
 		is_dir = LINUX_S_ISDIR(pctx->inode->i_mode);

 		problem = 0;
-		if (extent.e_pblk < ctx->fs->super->s_first_data_block ||
+		if (extent.e_pblk == 0 ||
+		    extent.e_pblk < ctx->fs->super->s_first_data_block ||
 		    extent.e_pblk >= ctx->fs->super->s_blocks_count)
 			problem = PR_1_EXTENT_BAD_START_BLK;
 		else if (extent.e_lblk < start_block)

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
@ 2011-05-03 11:01 Surbhi Palande
  2011-05-03 13:08 ` (unknown), Surbhi Palande
  0 siblings, 1 reply; 8+ messages in thread
From: Surbhi Palande @ 2011-05-03 11:01 UTC (permalink / raw)
  To: Toshiyuki Okajima
  Cc: Jan Kara, Ted Ts'o, Masayoshi MIZUMA, Andreas Dilger,
	linux-ext4, linux-fsdevel, sandeen

On 04/18/2011 12:05 PM, Toshiyuki Okajima wrote:
> Hi,
>
> (2011/04/16 2:13), Jan Kara wrote:
>> Hello,
>>
>> On Fri 15-04-11 22:39:07, Toshiyuki Okajima wrote:
>>>> For ext3 or ext4 without delayed allocation we block inside writepage()
>>>> function. But as I wrote to Dave Chinner, ->page_mkwrite() should
>>>> probably
>>>> get modified to block while minor-faulting the page on frozen fs
>>>> because
>>>> when blocks are already allocated we may skip starting a transaction
>>>> and so
>>>> we could possibly modify the filesystem.
>>> OK. I think ->page_mkwrite() should also block writing the
>>> minor-faulting pages.
>>>
>>> (minor-pagefault)
>>> -> do_wp_page()
>>> -> page_mkwrite(= ext4_mkwrite())
>>> => BLOCK!
>>>
>>> (major-pagefault)
>>> -> do_liner_fault()
>>> -> page_mkwrite(= ext4_mkwrite())
>>> => BLOCK!
>>>
>>>>
>>>>>>> Mizuma-san's reproducer also writes the data which maps to the
>>>>>>> file (mmap).
>>>>>>> The original problem happens after the fsfreeze operation is done.
>>>>>>> I understand the normal write operation (not mmap) can be blocked
>>>>>>> while
>>>>>>> fsfreezing. So, I guess we don't always block all the write
>>>>>>> operation
>>>>>>> while fsfreezing.
>>>>>> Technically speaking, we block all the transaction starts which
>>>>>> means we
>>>>>> end up blocking all the writes from going to disk. But that does
>>>>>> not mean
>>>>>> we block all the writes from going to in-memory cache - as you
>>>>>> properly
>>>>>> note the mmap case is one of such exceptions.
>>>>> Hm, I also think we can allow the writes to in-memory cache but we
>>>>> can't allow
>>>>> the writes to disk while fsfreezing. I am considering that mmap
>>>>> path can
>>>>> write to disk while fsfreezing because this deadlock problem
>>>>> happens after
>>>>> fsfreeze operation is done...
>>>> I'm sorry I don't understand now - are you speaking about the case
>>>> above
>>>> when writepage() does not wait for filesystem being frozen or something
>>>> else?
>>> Sorry, I didn't understand around the page fault path.
>>> So, I had read the kernel source code around it, then I maybe
>>> understand...
>>>
>>> I worry whether we can update the file data in mmap case while
>>> fsfreezing.
>>> Of course, I understand that we can write to in-memory cache, and it
>>> is not a
>>> problem. However, if we can write to disk while fsfreezing, it is a
>>> problem.
>>> So, I summarize the cases whether we can write to disk or not.
>>>
>>> --------------------------------------------------------------------------
>>>
>>> Cases (Whether we can write the data mmapped to the file on the disk
>>> while fsfreezing)
>>>
>>> [1] One of the page which has been mmapped is not bound. And
>>> the page is not allocated yet. (major fault?)
>>>
>>> (1) user dirtys a page
>>> (2) a page fault occurs (do_page_fault)
>>> (3) __do_falut is called.
>>> (4) ext4_page_mkwrite is called
>>> (5) ext4_write_begin is called
>>> (6) ext4_journal_start_sb => We can STOP!
>>>
>>> [2] One of the page which has been mmapped is not bound. But
>>> the page is already allocated, and the buffer_heads of the page
>>> are not mapped (BH_Mapped). (minor fault?)
>>>
>>> (1) user dirtys a page
>>> (2) a page fault occurs (do_page_fault)
>>> (3) do_wp_page is called.
>>> (4) ext4_page_mkwrite is called
>>> (5) ext4_write_begin is called
>>> (6) ext4_journal_start_sb => We can STOP!

What happens in the case as follows:

Task 1: Mmapped writes
t1)ext4_page_mkwrite()
   t2) ext4_write_begin() (FS is thawed so we proceed)
   t3) ext4_write_end() (journal is stopped now)
-----Pre-empted-----


Task 2: Freeze Task
t4) freezes the super block...
...(continues)....
tn) the page cache is clean and the F.S is frozen. Freeze has completed 
execution.

Task 1: Mmapped writes
tn+1) ext4_page_mkwrite() returns 0.
tn+2) __do_fault() gets control, code gets executed.
tn+3) _do_fault() marks the page dirty if the intent is to write to a 
file based page which faulted.

So you end up dirtying the page cache when the F.S is frozen? No?


Warm Regards,
Surbhi.







>>>
>>> [3] One of the page which has been mmapped is not bound. But
>>> the page is already allocated, and the buffer_heads of the page
>>> are mapped (BH_Mapped). (minor fault?)
>>>
>>> (1) user dirtys a page
>>> (2) a page fault occurs (do_page_fault)
>>> (3) do_wp_page is called.
>>> (4) ext4_page_mkwrite is called
>>> * Cannot block the dirty page to be written because all bh is mapped.
>>> (5) user munmaps the page (munmap)
>>> (6) zap_pte_range dirtys the page (struct page) which is pte_dirtyed.
>>> (7) writeback thread writes the page (struct page) to disk
>>> => We cannot STOP!
>>>
>>> [4] One of the page which has been mmapped is bound. And
>>> the page is already allocated.
>>>
>>> (1) user dirtys a page
>>> ( ) no page fault occurs
>>> (2) user munmaps the page (munmap)
>>> (3) zap_pte_range dirtys the page (struct page) which is pte_dirtyed.
>>> (4) writeback thread writes the page (struct page) to disk
>>> => We cannot STOP!
>>> --------------------------------------------------------------------------
>>>
>>>
>>> So, we can block the cases [1], [2].
>>> But I think we cannot block the cases [3], [4] now.
>>> If fixing the page_mkwrite, we can also block the case [3].
>>> But the case [4] is not blocked because no page fault occurs
>>> when we dirty the mmapped page.
>>>
>>> Therefore, to repair this problem, we need to fix the cases [3], [4].
>>> I think we must modify the writeback thread to fix the case [4].
>> The trick here is that when we write a page to disk, we write-protect
>> the page (you seem to call this that "the page is bound", I'm not sure
>> why).
> Hm, I want to understand how to write-protect the page under fsfreezing.
> But, anyway, I understand we don't need to consider the case [4].
>
>> So we are guaranteed to receive a minor fault (case [3]) if user tries to
>> modify a page after we finish writeback while freezing the filesystem.
>> So principially all we need to do is just wait in ext4_page_mkwrite().
> OK. I understand.
> Are there any concrete ideas to fix this?
> For ext4, we can rescue from the case [3] by modifying ext4_page_mkwrite().
> But for ext3 or other FSs, we must implement ->page_mkwrite() to prevent
> it?
>
> Thanks,
> Toshiyuki Okajima
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 8+ messages in thread

* (unknown), 
  2011-05-03 11:01 [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock Surbhi Palande
@ 2011-05-03 13:08 ` Surbhi Palande
  2011-05-03 13:46   ` your mail Jan Kara
  0 siblings, 1 reply; 8+ messages in thread
From: Surbhi Palande @ 2011-05-03 13:08 UTC (permalink / raw)
  To: jack
  Cc: toshi.okajima, tytso, m.mizuma, adilger.kernel, linux-ext4,
	linux-fsdevel, sandeen

On munmap() zap_pte_range() is called which dirties the PTE dirty pages as
Toshiyuki pointed out.

zap_pte_range()
  mapping->a_ops->set_page_dirty (= ext4_journalled_set_page_dirty)  

So, I think that it is here that we should do the checking for a ext4 F.S
frozen state and also prevent a parallel ext4 F.S freeze from happening.

Attaching a patch for initial review. Please do let me know your thoughts! 

Thanks a lot!

Warm Regards,
Surbhi.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: your mail
  2011-05-03 13:08 ` (unknown), Surbhi Palande
@ 2011-05-03 13:46   ` Jan Kara
  2011-05-03 13:56     ` Surbhi Palande
  0 siblings, 1 reply; 8+ messages in thread
From: Jan Kara @ 2011-05-03 13:46 UTC (permalink / raw)
  To: Surbhi Palande
  Cc: jack, toshi.okajima, tytso, m.mizuma, adilger.kernel, linux-ext4,
	linux-fsdevel, sandeen

On Tue 03-05-11 16:08:36, Surbhi Palande wrote:
> On munmap() zap_pte_range() is called which dirties the PTE dirty pages as
> Toshiyuki pointed out.
> 
> zap_pte_range()
>   mapping->a_ops->set_page_dirty (= ext4_journalled_set_page_dirty)  
> 
> So, I think that it is here that we should do the checking for a ext4 F.S
> frozen state and also prevent a parallel ext4 F.S freeze from happening.
> 
> Attaching a patch for initial review. Please do let me know your thoughts! 
  This is definitely the wrong place. ->set_page_dirty() callbacks are
called with various locks held and the page need not be locked (thus
dereferencing page->mapping is oopsable). Moreover this particular callback
is called only in data=journal mode.

Believe me, the right place is page_mkwrite() - you have to catch the
read-only => read-write page transition. Once the page is mapped
read-write, you've already lost the race.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: your mail
  2011-05-03 13:46   ` your mail Jan Kara
@ 2011-05-03 13:56     ` Surbhi Palande
  2011-05-03 15:26       ` Surbhi Palande
  2011-05-03 15:36       ` Jan Kara
  0 siblings, 2 replies; 8+ messages in thread
From: Surbhi Palande @ 2011-05-03 13:56 UTC (permalink / raw)
  To: Jan Kara
  Cc: toshi.okajima, tytso, m.mizuma, adilger.kernel, linux-ext4,
	linux-fsdevel, sandeen

On 05/03/2011 04:46 PM, Jan Kara wrote:
> On Tue 03-05-11 16:08:36, Surbhi Palande wrote:

Sorry for missing the subject line :(
>> On munmap() zap_pte_range() is called which dirties the PTE dirty pages as
>> Toshiyuki pointed out.
>>
>> zap_pte_range()
>>    mapping->a_ops->set_page_dirty (= ext4_journalled_set_page_dirty)
>>
>> So, I think that it is here that we should do the checking for a ext4 F.S
>> frozen state and also prevent a parallel ext4 F.S freeze from happening.
>>
>> Attaching a patch for initial review. Please do let me know your thoughts!
>    This is definitely the wrong place. ->set_page_dirty() callbacks are
> called with various locks held and the page need not be locked (thus
> dereferencing page->mapping is oopsable). Moreover this particular callback
> is called only in data=journal mode.
Ok! Thanks for that!

>
> Believe me, the right place is page_mkwrite() - you have to catch the
> read-only =>  read-write page transition. Once the page is mapped
> read-write, you've already lost the race.

My only point is:
1) something should prevent the freeze from happening. We cant merely 
check the vfs_check_frozen()?

And this should be done where the page is marked dirty.Also, I thought 
that the page is marked read-write only in the page table in the 
__do_page_fault()? i.e the zap_pte_range() marks them dirty in the page 
cache? Is this understanding right?

IMHO, whatever code dirties the page in the page cache should call a F.S 
specific function and let it _prevent_ a fsfreeze while the page is 
getting dirtied, so that a freeze called after this point flushes this page!

Warm Regards,
Surbhi.










>
> 								Honza


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: your mail
  2011-05-03 13:56     ` Surbhi Palande
@ 2011-05-03 15:26       ` Surbhi Palande
  2011-05-03 15:36       ` Jan Kara
  1 sibling, 0 replies; 8+ messages in thread
From: Surbhi Palande @ 2011-05-03 15:26 UTC (permalink / raw)
  To: surbhi.palande
  Cc: Jan Kara, toshi.okajima, tytso, m.mizuma, adilger.kernel,
	linux-ext4, linux-fsdevel, sandeen

On 05/03/2011 04:56 PM, Surbhi Palande wrote:
> On 05/03/2011 04:46 PM, Jan Kara wrote:
>> On Tue 03-05-11 16:08:36, Surbhi Palande wrote:
>
> Sorry for missing the subject line :(
>>> On munmap() zap_pte_range() is called which dirties the PTE dirty
>>> pages as
>>> Toshiyuki pointed out.
>>>
>>> zap_pte_range()
>>> mapping->a_ops->set_page_dirty (= ext4_journalled_set_page_dirty)
>>>
>>> So, I think that it is here that we should do the checking for a ext4
>>> F.S
>>> frozen state and also prevent a parallel ext4 F.S freeze from happening.
>>>
>>> Attaching a patch for initial review. Please do let me know your
>>> thoughts!
>> This is definitely the wrong place. ->set_page_dirty() callbacks are
>> called with various locks held and the page need not be locked (thus
>> dereferencing page->mapping is oopsable). Moreover this particular
>> callback
>> is called only in data=journal mode.
> Ok! Thanks for that!
>
>>
>> Believe me, the right place is page_mkwrite() - you have to catch the
>> read-only => read-write page transition. Once the page is mapped
>> read-write, you've already lost the race.
Also, we then need to prevent a munmap()/zap_pte_range() call from 
dirtying a mmapped file page when the F.S is frozen?

Warm Regards,
Surbhi.

>
> My only point is:
> 1) something should prevent the freeze from happening. We cant merely
> check the vfs_check_frozen()?
>
> And this should be done where the page is marked dirty.Also, I thought
> that the page is marked read-write only in the page table in the
> __do_page_fault()? i.e the zap_pte_range() marks them dirty in the page
> cache? Is this understanding right?
>
> IMHO, whatever code dirties the page in the page cache should call a F.S
> specific function and let it _prevent_ a fsfreeze while the page is
> getting dirtied, so that a freeze called after this point flushes this
> page!
>
> Warm Regards,
> Surbhi.
>
>
>
>
>
>
>
>
>
>
>>
>> Honza
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: your mail
  2011-05-03 13:56     ` Surbhi Palande
  2011-05-03 15:26       ` Surbhi Palande
@ 2011-05-03 15:36       ` Jan Kara
  2011-05-03 15:43         ` Surbhi Palande
  1 sibling, 1 reply; 8+ messages in thread
From: Jan Kara @ 2011-05-03 15:36 UTC (permalink / raw)
  To: Surbhi Palande
  Cc: Jan Kara, toshi.okajima, tytso, m.mizuma, adilger.kernel,
	linux-ext4, linux-fsdevel, sandeen

On Tue 03-05-11 16:56:57, Surbhi Palande wrote:
> On 05/03/2011 04:46 PM, Jan Kara wrote:
> >On Tue 03-05-11 16:08:36, Surbhi Palande wrote:
> 
> Sorry for missing the subject line :(
> >>On munmap() zap_pte_range() is called which dirties the PTE dirty pages as
> >>Toshiyuki pointed out.
> >>
> >>zap_pte_range()
> >>   mapping->a_ops->set_page_dirty (= ext4_journalled_set_page_dirty)
> >>
> >>So, I think that it is here that we should do the checking for a ext4 F.S
> >>frozen state and also prevent a parallel ext4 F.S freeze from happening.
> >>
> >>Attaching a patch for initial review. Please do let me know your thoughts!
> >   This is definitely the wrong place. ->set_page_dirty() callbacks are
> >called with various locks held and the page need not be locked (thus
> >dereferencing page->mapping is oopsable). Moreover this particular callback
> >is called only in data=journal mode.
> Ok! Thanks for that!
> 
> >
> >Believe me, the right place is page_mkwrite() - you have to catch the
> >read-only =>  read-write page transition. Once the page is mapped
> >read-write, you've already lost the race.
> 
> My only point is:
> 1) something should prevent the freeze from happening. We cant
> merely check the vfs_check_frozen()?
  Yes, I agree - see my other email with patches.

> And this should be done where the page is marked dirty.Also, I
> thought that the page is marked read-write only in the page table in
> the __do_page_fault()? i.e the zap_pte_range() marks them dirty in
> the page cache? Is this understanding right?
  The page can become dirty either because it was written via standard
write - write_begin is responsible for reliable check here - or it was
written via mmap - here we rely on page_mkwrite to do a reliable check -
it is analogous to write_begin callback. There should be no other way
to dirty a page.

With dirty bits it is a bit complicated. We have two of them in fact. One
in page table entry maintained by mmu and one in page structure maintained
by kernel. Some functions (such as zap_pte_range()) copy the dirty bits
from page table into struct page. This is a lazy process so page can in
principle have new data without a dirty bit set in struct page because we
have not yet copied the dirty bit from page table. Only at moments where it
is important (like when we want to unmap the page, or throw away the page,
or so), we make sure struct page and page table bits are in sync.

Another subtle thing you need not be aware of it that when we clear page
dirty bit, we also writeprotect the page. So we are guaranteed to get a
page fault when the page is written to again.

> IMHO, whatever code dirties the page in the page cache should call a
> F.S specific function and let it _prevent_ a fsfreeze while the page
> is getting dirtied, so that a freeze called after this point flushes
> this page!
  Agreed, that's what code in write_begin() and page_mkwrite() should
achieve.
								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: your mail
  2011-05-03 15:36       ` Jan Kara
@ 2011-05-03 15:43         ` Surbhi Palande
  2011-05-04 19:24           ` Jan Kara
  0 siblings, 1 reply; 8+ messages in thread
From: Surbhi Palande @ 2011-05-03 15:43 UTC (permalink / raw)
  To: Jan Kara
  Cc: toshi.okajima, tytso, m.mizuma, adilger.kernel, linux-ext4,
	linux-fsdevel, sandeen

On 05/03/2011 06:36 PM, Jan Kara wrote:
> On Tue 03-05-11 16:56:57, Surbhi Palande wrote:
>> On 05/03/2011 04:46 PM, Jan Kara wrote:
>>> On Tue 03-05-11 16:08:36, Surbhi Palande wrote:
>>
>> Sorry for missing the subject line :(
>>>> On munmap() zap_pte_range() is called which dirties the PTE dirty pages as
>>>> Toshiyuki pointed out.
>>>>
>>>> zap_pte_range()
>>>>    mapping->a_ops->set_page_dirty (= ext4_journalled_set_page_dirty)
>>>>
>>>> So, I think that it is here that we should do the checking for a ext4 F.S
>>>> frozen state and also prevent a parallel ext4 F.S freeze from happening.
>>>>
>>>> Attaching a patch for initial review. Please do let me know your thoughts!
>>>    This is definitely the wrong place. ->set_page_dirty() callbacks are
>>> called with various locks held and the page need not be locked (thus
>>> dereferencing page->mapping is oopsable). Moreover this particular callback
>>> is called only in data=journal mode.
>> Ok! Thanks for that!
>>
>>>
>>> Believe me, the right place is page_mkwrite() - you have to catch the
>>> read-only =>   read-write page transition. Once the page is mapped
>>> read-write, you've already lost the race.
>>
>> My only point is:
>> 1) something should prevent the freeze from happening. We cant
>> merely check the vfs_check_frozen()?
>    Yes, I agree - see my other email with patches.
>
>> And this should be done where the page is marked dirty.Also, I
>> thought that the page is marked read-write only in the page table in
>> the __do_page_fault()? i.e the zap_pte_range() marks them dirty in
>> the page cache? Is this understanding right?
>    The page can become dirty either because it was written via standard
> write - write_begin is responsible for reliable check here - or it was
> written via mmap - here we rely on page_mkwrite to do a reliable check -
> it is analogous to write_begin callback. There should be no other way
> to dirty a page.
>
> With dirty bits it is a bit complicated. We have two of them in fact. One
> in page table entry maintained by mmu and one in page structure maintained
> by kernel. Some functions (such as zap_pte_range()) copy the dirty bits
> from page table into struct page. This is a lazy process so page can in
> principle have new data without a dirty bit set in struct page because we
> have not yet copied the dirty bit from page table. Only at moments where it
> is important (like when we want to unmap the page, or throw away the page,
> or so), we make sure struct page and page table bits are in sync.
>
> Another subtle thing you need not be aware of it that when we clear page
> dirty bit, we also writeprotect the page. So we are guaranteed to get a
> page fault when the page is written to again.
>
>> IMHO, whatever code dirties the page in the page cache should call a
>> F.S specific function and let it _prevent_ a fsfreeze while the page
>> is getting dirtied, so that a freeze called after this point flushes
>> this page!
>    Agreed, that's what code in write_begin() and page_mkwrite() should
> achieve.
> 								Honza
Thanks a lot for the wonderful explanation :)

How about the revert : i.e calling  jbd2_journal_unlock_updates() from 
ext4_unfreeze() instead of the ext4_freeze()? Do you agree to that?


Warm Regards,
Surbhi.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: your mail
  2011-05-03 15:43         ` Surbhi Palande
@ 2011-05-04 19:24           ` Jan Kara
  0 siblings, 0 replies; 8+ messages in thread
From: Jan Kara @ 2011-05-04 19:24 UTC (permalink / raw)
  To: Surbhi Palande
  Cc: Jan Kara, toshi.okajima, tytso, m.mizuma, adilger.kernel,
	linux-ext4, linux-fsdevel, sandeen

On Tue 03-05-11 18:43:48, Surbhi Palande wrote:
> On 05/03/2011 06:36 PM, Jan Kara wrote:
> >On Tue 03-05-11 16:56:57, Surbhi Palande wrote:
> >>On 05/03/2011 04:46 PM, Jan Kara wrote:
> >>>On Tue 03-05-11 16:08:36, Surbhi Palande wrote:
> >>
> >>Sorry for missing the subject line :(
> >>>>On munmap() zap_pte_range() is called which dirties the PTE dirty pages as
> >>>>Toshiyuki pointed out.
> >>>>
> >>>>zap_pte_range()
> >>>>   mapping->a_ops->set_page_dirty (= ext4_journalled_set_page_dirty)
> >>>>
> >>>>So, I think that it is here that we should do the checking for a ext4 F.S
> >>>>frozen state and also prevent a parallel ext4 F.S freeze from happening.
> >>>>
> >>>>Attaching a patch for initial review. Please do let me know your thoughts!
> >>>   This is definitely the wrong place. ->set_page_dirty() callbacks are
> >>>called with various locks held and the page need not be locked (thus
> >>>dereferencing page->mapping is oopsable). Moreover this particular callback
> >>>is called only in data=journal mode.
> >>Ok! Thanks for that!
> >>
> >>>
> >>>Believe me, the right place is page_mkwrite() - you have to catch the
> >>>read-only =>   read-write page transition. Once the page is mapped
> >>>read-write, you've already lost the race.
> >>
> >>My only point is:
> >>1) something should prevent the freeze from happening. We cant
> >>merely check the vfs_check_frozen()?
> >   Yes, I agree - see my other email with patches.
> >
> >>And this should be done where the page is marked dirty.Also, I
> >>thought that the page is marked read-write only in the page table in
> >>the __do_page_fault()? i.e the zap_pte_range() marks them dirty in
> >>the page cache? Is this understanding right?
> >   The page can become dirty either because it was written via standard
> >write - write_begin is responsible for reliable check here - or it was
> >written via mmap - here we rely on page_mkwrite to do a reliable check -
> >it is analogous to write_begin callback. There should be no other way
> >to dirty a page.
> >
> >With dirty bits it is a bit complicated. We have two of them in fact. One
> >in page table entry maintained by mmu and one in page structure maintained
> >by kernel. Some functions (such as zap_pte_range()) copy the dirty bits
> >from page table into struct page. This is a lazy process so page can in
> >principle have new data without a dirty bit set in struct page because we
> >have not yet copied the dirty bit from page table. Only at moments where it
> >is important (like when we want to unmap the page, or throw away the page,
> >or so), we make sure struct page and page table bits are in sync.
> >
> >Another subtle thing you need not be aware of it that when we clear page
> >dirty bit, we also writeprotect the page. So we are guaranteed to get a
> >page fault when the page is written to again.
> >
> >>IMHO, whatever code dirties the page in the page cache should call a
> >>F.S specific function and let it _prevent_ a fsfreeze while the page
> >>is getting dirtied, so that a freeze called after this point flushes
> >>this page!
> >   Agreed, that's what code in write_begin() and page_mkwrite() should
> >achieve.
> >								Honza
> Thanks a lot for the wonderful explanation :)
> 
> How about the revert : i.e calling  jbd2_journal_unlock_updates()
> from ext4_unfreeze() instead of the ext4_freeze()? Do you agree to
> that?
  Sorry, I don't agree with revert. We could talk about changing
jbd2_journal_unlock_updates() to not return with mutex held (and handle
synchronization of locked journal operations differently) as an alternative
to doing "freeze" reference counting. But returning with mutex held to user
space is no-go. It will cause problems in lockdep, violates kernel locking
rules, and generally is a bad programming ;).

								Honza

-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2011-05-04 19:24 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-05-14 21:39 (unknown) Jiaying Zhang
2010-05-14 22:07 ` your mail tytso
  -- strict thread matches above, loose matches on Subject: below --
2011-05-03 11:01 [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock Surbhi Palande
2011-05-03 13:08 ` (unknown), Surbhi Palande
2011-05-03 13:46   ` your mail Jan Kara
2011-05-03 13:56     ` Surbhi Palande
2011-05-03 15:26       ` Surbhi Palande
2011-05-03 15:36       ` Jan Kara
2011-05-03 15:43         ` Surbhi Palande
2011-05-04 19:24           ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).