* Re: your mail
2010-06-16 16:33 (unknown), Jan Kara
@ 2010-06-16 22:15 ` Dave Chinner
2010-06-22 2:59 ` Wu Fengguang
1 sibling, 0 replies; 13+ messages in thread
From: Dave Chinner @ 2010-06-16 22:15 UTC (permalink / raw)
To: Jan Kara; +Cc: linux-fsdevel, linux-mm, Andrew Morton, npiggin
On Wed, Jun 16, 2010 at 06:33:49PM +0200, Jan Kara wrote:
> Hello,
>
> here is the fourth version of the writeback livelock avoidance patches
> for data integrity writes. To quickly summarize the idea: we tag dirty
> pages at the beginning of write_cache_pages with a new TOWRITE tag and
> then write only tagged pages to avoid parallel writers to livelock us.
> See changelogs of the patches for more details.
> I have tested the patches with fsx and a test program I wrote which
> checks that if we crash after fsync, the data is indeed on disk.
> If there are no more concerns, can these patches get merged?
Has it been run through xfstests? I'd suggest doing that at least
with XFS as there are several significant sync sanity tests for XFS
in the suite...
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: your mail
2010-06-16 16:33 (unknown), Jan Kara
2010-06-16 22:15 ` your mail Dave Chinner
@ 2010-06-22 2:59 ` Wu Fengguang
2010-06-22 13:54 ` Jan Kara
1 sibling, 1 reply; 13+ messages in thread
From: Wu Fengguang @ 2010-06-22 2:59 UTC (permalink / raw)
To: Jan Kara; +Cc: linux-fsdevel, linux-mm, Andrew Morton, npiggin
> - use tagging also for WB_SYNC_NONE writeback - there's problem with an
> interaction with wbc->nr_to_write. If we tag all dirty pages, we can
> spend too much time tagging when we write only a few pages in the end
> because of nr_to_write. If we tag only say nr_to_write pages, we may
> not have enough pages tagged because some pages are written out by
> someone else and so we would have to restart and tagging would become
This could be addressed by ignoring nr_to_write for the WB_SYNC_NONE
writeback triggered by sync(). write_cache_pages() already ignored
nr_to_write for WB_SYNC_ALL.
> essentially useless. So my option is - switch to tagging for WB_SYNC_NONE
> writeback if we can get rid of nr_to_write. But that's a story for
> a different patch set.
Besides introducing overheads, it will be a policy change in which the
system loses control to somehow "throttle" writeback of huge files.
So it may be safer to enlarge nr_to_write instead of canceling it totally.
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: your mail
2010-06-22 2:59 ` Wu Fengguang
@ 2010-06-22 13:54 ` Jan Kara
2010-06-22 14:12 ` Wu Fengguang
0 siblings, 1 reply; 13+ messages in thread
From: Jan Kara @ 2010-06-22 13:54 UTC (permalink / raw)
To: Wu Fengguang; +Cc: Jan Kara, linux-fsdevel, linux-mm, Andrew Morton, npiggin
On Tue 22-06-10 10:59:41, Wu Fengguang wrote:
> > - use tagging also for WB_SYNC_NONE writeback - there's problem with an
> > interaction with wbc->nr_to_write. If we tag all dirty pages, we can
> > spend too much time tagging when we write only a few pages in the end
> > because of nr_to_write. If we tag only say nr_to_write pages, we may
> > not have enough pages tagged because some pages are written out by
> > someone else and so we would have to restart and tagging would become
>
> This could be addressed by ignoring nr_to_write for the WB_SYNC_NONE
> writeback triggered by sync(). write_cache_pages() already ignored
> nr_to_write for WB_SYNC_ALL.
We could do that but frankly, I'm not very fond of adding more special
cases to writeback code than strictly necessary...
> > essentially useless. So my option is - switch to tagging for WB_SYNC_NONE
> > writeback if we can get rid of nr_to_write. But that's a story for
> > a different patch set.
>
> Besides introducing overheads, it will be a policy change in which the
> system loses control to somehow "throttle" writeback of huge files.
Yes, but if we guarantee we cannot livelock on a single file, do we care?
Memory management does not care because it's getting rid of dirty pages
which is what it wants. User might care but actually writing out files in
the order they were dirtied (i.e., the order user written them) is quite
natural so it's not a "surprising" behavior. And I don't think we can
assume that data in those small files are more valuable than data in the
large file and thus should be written earlier...
With the overhead you are right that tagging is more expensive than
checking nr_to_write limit...
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: your mail
2010-06-22 13:54 ` Jan Kara
@ 2010-06-22 14:12 ` Wu Fengguang
0 siblings, 0 replies; 13+ messages in thread
From: Wu Fengguang @ 2010-06-22 14:12 UTC (permalink / raw)
To: Jan Kara
Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Andrew Morton,
npiggin@suse.de
On Tue, Jun 22, 2010 at 09:54:58PM +0800, Jan Kara wrote:
> On Tue 22-06-10 10:59:41, Wu Fengguang wrote:
> > > - use tagging also for WB_SYNC_NONE writeback - there's problem with an
> > > interaction with wbc->nr_to_write. If we tag all dirty pages, we can
> > > spend too much time tagging when we write only a few pages in the end
> > > because of nr_to_write. If we tag only say nr_to_write pages, we may
> > > not have enough pages tagged because some pages are written out by
> > > someone else and so we would have to restart and tagging would become
> >
> > This could be addressed by ignoring nr_to_write for the WB_SYNC_NONE
> > writeback triggered by sync(). write_cache_pages() already ignored
> > nr_to_write for WB_SYNC_ALL.
> We could do that but frankly, I'm not very fond of adding more special
> cases to writeback code than strictly necessary...
So do me. However for this case we only need to broaden the special case test:
if (nr_to_write > 0) {
nr_to_write--;
if (nr_to_write == 0 &&
- wbc->sync_mode == WB_SYNC_NONE) {
+ !wbc->for_sync) {
> > > essentially useless. So my option is - switch to tagging for WB_SYNC_NONE
> > > writeback if we can get rid of nr_to_write. But that's a story for
> > > a different patch set.
> >
> > Besides introducing overheads, it will be a policy change in which the
> > system loses control to somehow "throttle" writeback of huge files.
> Yes, but if we guarantee we cannot livelock on a single file, do we care?
> Memory management does not care because it's getting rid of dirty pages
> which is what it wants. User might care but actually writing out files in
> the order they were dirtied (i.e., the order user written them) is quite
> natural so it's not a "surprising" behavior. And I don't think we can
> assume that data in those small files are more valuable than data in the
> large file and thus should be written earlier...
It could be a surprising behavior when, there is a 4GB dirty file and
100 small dirty files. The user may expect the 100 small dirty files
be synced to disk after 30s. However without nr_to_write, they could
be delayed by the 4GB file for 40 more seconds. Now if something goes
wrong in between and the small dirty files happen to include some
.c/.tex/.txt/... files.
Small files are more likely your precious documents that are typed in
word-by-word and perhaps the most important data you want to protect.
Naturally you'll want them enjoy more priority than large files.
> With the overhead you are right that tagging is more expensive than
> checking nr_to_write limit...
Thanks,
Fengguang
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: your mail
2011-05-03 13:08 ` (unknown), Surbhi Palande
@ 2011-05-03 13:46 ` Jan Kara
2011-05-03 13:56 ` Surbhi Palande
0 siblings, 1 reply; 13+ messages in thread
From: Jan Kara @ 2011-05-03 13:46 UTC (permalink / raw)
To: Surbhi Palande
Cc: jack, toshi.okajima, tytso, m.mizuma, adilger.kernel, linux-ext4,
linux-fsdevel, sandeen
On Tue 03-05-11 16:08:36, Surbhi Palande wrote:
> On munmap() zap_pte_range() is called which dirties the PTE dirty pages as
> Toshiyuki pointed out.
>
> zap_pte_range()
> mapping->a_ops->set_page_dirty (= ext4_journalled_set_page_dirty)
>
> So, I think that it is here that we should do the checking for a ext4 F.S
> frozen state and also prevent a parallel ext4 F.S freeze from happening.
>
> Attaching a patch for initial review. Please do let me know your thoughts!
This is definitely the wrong place. ->set_page_dirty() callbacks are
called with various locks held and the page need not be locked (thus
dereferencing page->mapping is oopsable). Moreover this particular callback
is called only in data=journal mode.
Believe me, the right place is page_mkwrite() - you have to catch the
read-only => read-write page transition. Once the page is mapped
read-write, you've already lost the race.
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: your mail
2011-05-03 13:46 ` your mail Jan Kara
@ 2011-05-03 13:56 ` Surbhi Palande
2011-05-03 15:26 ` Surbhi Palande
2011-05-03 15:36 ` Jan Kara
0 siblings, 2 replies; 13+ messages in thread
From: Surbhi Palande @ 2011-05-03 13:56 UTC (permalink / raw)
To: Jan Kara
Cc: toshi.okajima, tytso, m.mizuma, adilger.kernel, linux-ext4,
linux-fsdevel, sandeen
On 05/03/2011 04:46 PM, Jan Kara wrote:
> On Tue 03-05-11 16:08:36, Surbhi Palande wrote:
Sorry for missing the subject line :(
>> On munmap() zap_pte_range() is called which dirties the PTE dirty pages as
>> Toshiyuki pointed out.
>>
>> zap_pte_range()
>> mapping->a_ops->set_page_dirty (= ext4_journalled_set_page_dirty)
>>
>> So, I think that it is here that we should do the checking for a ext4 F.S
>> frozen state and also prevent a parallel ext4 F.S freeze from happening.
>>
>> Attaching a patch for initial review. Please do let me know your thoughts!
> This is definitely the wrong place. ->set_page_dirty() callbacks are
> called with various locks held and the page need not be locked (thus
> dereferencing page->mapping is oopsable). Moreover this particular callback
> is called only in data=journal mode.
Ok! Thanks for that!
>
> Believe me, the right place is page_mkwrite() - you have to catch the
> read-only => read-write page transition. Once the page is mapped
> read-write, you've already lost the race.
My only point is:
1) something should prevent the freeze from happening. We cant merely
check the vfs_check_frozen()?
And this should be done where the page is marked dirty.Also, I thought
that the page is marked read-write only in the page table in the
__do_page_fault()? i.e the zap_pte_range() marks them dirty in the page
cache? Is this understanding right?
IMHO, whatever code dirties the page in the page cache should call a F.S
specific function and let it _prevent_ a fsfreeze while the page is
getting dirtied, so that a freeze called after this point flushes this page!
Warm Regards,
Surbhi.
>
> Honza
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: your mail
2011-05-03 13:56 ` Surbhi Palande
@ 2011-05-03 15:26 ` Surbhi Palande
2011-05-03 15:36 ` Jan Kara
1 sibling, 0 replies; 13+ messages in thread
From: Surbhi Palande @ 2011-05-03 15:26 UTC (permalink / raw)
To: surbhi.palande
Cc: Jan Kara, toshi.okajima, tytso, m.mizuma, adilger.kernel,
linux-ext4, linux-fsdevel, sandeen
On 05/03/2011 04:56 PM, Surbhi Palande wrote:
> On 05/03/2011 04:46 PM, Jan Kara wrote:
>> On Tue 03-05-11 16:08:36, Surbhi Palande wrote:
>
> Sorry for missing the subject line :(
>>> On munmap() zap_pte_range() is called which dirties the PTE dirty
>>> pages as
>>> Toshiyuki pointed out.
>>>
>>> zap_pte_range()
>>> mapping->a_ops->set_page_dirty (= ext4_journalled_set_page_dirty)
>>>
>>> So, I think that it is here that we should do the checking for a ext4
>>> F.S
>>> frozen state and also prevent a parallel ext4 F.S freeze from happening.
>>>
>>> Attaching a patch for initial review. Please do let me know your
>>> thoughts!
>> This is definitely the wrong place. ->set_page_dirty() callbacks are
>> called with various locks held and the page need not be locked (thus
>> dereferencing page->mapping is oopsable). Moreover this particular
>> callback
>> is called only in data=journal mode.
> Ok! Thanks for that!
>
>>
>> Believe me, the right place is page_mkwrite() - you have to catch the
>> read-only => read-write page transition. Once the page is mapped
>> read-write, you've already lost the race.
Also, we then need to prevent a munmap()/zap_pte_range() call from
dirtying a mmapped file page when the F.S is frozen?
Warm Regards,
Surbhi.
>
> My only point is:
> 1) something should prevent the freeze from happening. We cant merely
> check the vfs_check_frozen()?
>
> And this should be done where the page is marked dirty.Also, I thought
> that the page is marked read-write only in the page table in the
> __do_page_fault()? i.e the zap_pte_range() marks them dirty in the page
> cache? Is this understanding right?
>
> IMHO, whatever code dirties the page in the page cache should call a F.S
> specific function and let it _prevent_ a fsfreeze while the page is
> getting dirtied, so that a freeze called after this point flushes this
> page!
>
> Warm Regards,
> Surbhi.
>
>
>
>
>
>
>
>
>
>
>>
>> Honza
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: your mail
2011-05-03 13:56 ` Surbhi Palande
2011-05-03 15:26 ` Surbhi Palande
@ 2011-05-03 15:36 ` Jan Kara
2011-05-03 15:43 ` Surbhi Palande
1 sibling, 1 reply; 13+ messages in thread
From: Jan Kara @ 2011-05-03 15:36 UTC (permalink / raw)
To: Surbhi Palande
Cc: Jan Kara, toshi.okajima, tytso, m.mizuma, adilger.kernel,
linux-ext4, linux-fsdevel, sandeen
On Tue 03-05-11 16:56:57, Surbhi Palande wrote:
> On 05/03/2011 04:46 PM, Jan Kara wrote:
> >On Tue 03-05-11 16:08:36, Surbhi Palande wrote:
>
> Sorry for missing the subject line :(
> >>On munmap() zap_pte_range() is called which dirties the PTE dirty pages as
> >>Toshiyuki pointed out.
> >>
> >>zap_pte_range()
> >> mapping->a_ops->set_page_dirty (= ext4_journalled_set_page_dirty)
> >>
> >>So, I think that it is here that we should do the checking for a ext4 F.S
> >>frozen state and also prevent a parallel ext4 F.S freeze from happening.
> >>
> >>Attaching a patch for initial review. Please do let me know your thoughts!
> > This is definitely the wrong place. ->set_page_dirty() callbacks are
> >called with various locks held and the page need not be locked (thus
> >dereferencing page->mapping is oopsable). Moreover this particular callback
> >is called only in data=journal mode.
> Ok! Thanks for that!
>
> >
> >Believe me, the right place is page_mkwrite() - you have to catch the
> >read-only => read-write page transition. Once the page is mapped
> >read-write, you've already lost the race.
>
> My only point is:
> 1) something should prevent the freeze from happening. We cant
> merely check the vfs_check_frozen()?
Yes, I agree - see my other email with patches.
> And this should be done where the page is marked dirty.Also, I
> thought that the page is marked read-write only in the page table in
> the __do_page_fault()? i.e the zap_pte_range() marks them dirty in
> the page cache? Is this understanding right?
The page can become dirty either because it was written via standard
write - write_begin is responsible for reliable check here - or it was
written via mmap - here we rely on page_mkwrite to do a reliable check -
it is analogous to write_begin callback. There should be no other way
to dirty a page.
With dirty bits it is a bit complicated. We have two of them in fact. One
in page table entry maintained by mmu and one in page structure maintained
by kernel. Some functions (such as zap_pte_range()) copy the dirty bits
from page table into struct page. This is a lazy process so page can in
principle have new data without a dirty bit set in struct page because we
have not yet copied the dirty bit from page table. Only at moments where it
is important (like when we want to unmap the page, or throw away the page,
or so), we make sure struct page and page table bits are in sync.
Another subtle thing you need not be aware of it that when we clear page
dirty bit, we also writeprotect the page. So we are guaranteed to get a
page fault when the page is written to again.
> IMHO, whatever code dirties the page in the page cache should call a
> F.S specific function and let it _prevent_ a fsfreeze while the page
> is getting dirtied, so that a freeze called after this point flushes
> this page!
Agreed, that's what code in write_begin() and page_mkwrite() should
achieve.
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: your mail
2011-05-03 15:36 ` Jan Kara
@ 2011-05-03 15:43 ` Surbhi Palande
2011-05-04 19:24 ` Jan Kara
0 siblings, 1 reply; 13+ messages in thread
From: Surbhi Palande @ 2011-05-03 15:43 UTC (permalink / raw)
To: Jan Kara
Cc: toshi.okajima, tytso, m.mizuma, adilger.kernel, linux-ext4,
linux-fsdevel, sandeen
On 05/03/2011 06:36 PM, Jan Kara wrote:
> On Tue 03-05-11 16:56:57, Surbhi Palande wrote:
>> On 05/03/2011 04:46 PM, Jan Kara wrote:
>>> On Tue 03-05-11 16:08:36, Surbhi Palande wrote:
>>
>> Sorry for missing the subject line :(
>>>> On munmap() zap_pte_range() is called which dirties the PTE dirty pages as
>>>> Toshiyuki pointed out.
>>>>
>>>> zap_pte_range()
>>>> mapping->a_ops->set_page_dirty (= ext4_journalled_set_page_dirty)
>>>>
>>>> So, I think that it is here that we should do the checking for a ext4 F.S
>>>> frozen state and also prevent a parallel ext4 F.S freeze from happening.
>>>>
>>>> Attaching a patch for initial review. Please do let me know your thoughts!
>>> This is definitely the wrong place. ->set_page_dirty() callbacks are
>>> called with various locks held and the page need not be locked (thus
>>> dereferencing page->mapping is oopsable). Moreover this particular callback
>>> is called only in data=journal mode.
>> Ok! Thanks for that!
>>
>>>
>>> Believe me, the right place is page_mkwrite() - you have to catch the
>>> read-only => read-write page transition. Once the page is mapped
>>> read-write, you've already lost the race.
>>
>> My only point is:
>> 1) something should prevent the freeze from happening. We cant
>> merely check the vfs_check_frozen()?
> Yes, I agree - see my other email with patches.
>
>> And this should be done where the page is marked dirty.Also, I
>> thought that the page is marked read-write only in the page table in
>> the __do_page_fault()? i.e the zap_pte_range() marks them dirty in
>> the page cache? Is this understanding right?
> The page can become dirty either because it was written via standard
> write - write_begin is responsible for reliable check here - or it was
> written via mmap - here we rely on page_mkwrite to do a reliable check -
> it is analogous to write_begin callback. There should be no other way
> to dirty a page.
>
> With dirty bits it is a bit complicated. We have two of them in fact. One
> in page table entry maintained by mmu and one in page structure maintained
> by kernel. Some functions (such as zap_pte_range()) copy the dirty bits
> from page table into struct page. This is a lazy process so page can in
> principle have new data without a dirty bit set in struct page because we
> have not yet copied the dirty bit from page table. Only at moments where it
> is important (like when we want to unmap the page, or throw away the page,
> or so), we make sure struct page and page table bits are in sync.
>
> Another subtle thing you need not be aware of it that when we clear page
> dirty bit, we also writeprotect the page. So we are guaranteed to get a
> page fault when the page is written to again.
>
>> IMHO, whatever code dirties the page in the page cache should call a
>> F.S specific function and let it _prevent_ a fsfreeze while the page
>> is getting dirtied, so that a freeze called after this point flushes
>> this page!
> Agreed, that's what code in write_begin() and page_mkwrite() should
> achieve.
> Honza
Thanks a lot for the wonderful explanation :)
How about the revert : i.e calling jbd2_journal_unlock_updates() from
ext4_unfreeze() instead of the ext4_freeze()? Do you agree to that?
Warm Regards,
Surbhi.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: your mail
2011-05-03 15:43 ` Surbhi Palande
@ 2011-05-04 19:24 ` Jan Kara
0 siblings, 0 replies; 13+ messages in thread
From: Jan Kara @ 2011-05-04 19:24 UTC (permalink / raw)
To: Surbhi Palande
Cc: Jan Kara, toshi.okajima, tytso, m.mizuma, adilger.kernel,
linux-ext4, linux-fsdevel, sandeen
On Tue 03-05-11 18:43:48, Surbhi Palande wrote:
> On 05/03/2011 06:36 PM, Jan Kara wrote:
> >On Tue 03-05-11 16:56:57, Surbhi Palande wrote:
> >>On 05/03/2011 04:46 PM, Jan Kara wrote:
> >>>On Tue 03-05-11 16:08:36, Surbhi Palande wrote:
> >>
> >>Sorry for missing the subject line :(
> >>>>On munmap() zap_pte_range() is called which dirties the PTE dirty pages as
> >>>>Toshiyuki pointed out.
> >>>>
> >>>>zap_pte_range()
> >>>> mapping->a_ops->set_page_dirty (= ext4_journalled_set_page_dirty)
> >>>>
> >>>>So, I think that it is here that we should do the checking for a ext4 F.S
> >>>>frozen state and also prevent a parallel ext4 F.S freeze from happening.
> >>>>
> >>>>Attaching a patch for initial review. Please do let me know your thoughts!
> >>> This is definitely the wrong place. ->set_page_dirty() callbacks are
> >>>called with various locks held and the page need not be locked (thus
> >>>dereferencing page->mapping is oopsable). Moreover this particular callback
> >>>is called only in data=journal mode.
> >>Ok! Thanks for that!
> >>
> >>>
> >>>Believe me, the right place is page_mkwrite() - you have to catch the
> >>>read-only => read-write page transition. Once the page is mapped
> >>>read-write, you've already lost the race.
> >>
> >>My only point is:
> >>1) something should prevent the freeze from happening. We cant
> >>merely check the vfs_check_frozen()?
> > Yes, I agree - see my other email with patches.
> >
> >>And this should be done where the page is marked dirty.Also, I
> >>thought that the page is marked read-write only in the page table in
> >>the __do_page_fault()? i.e the zap_pte_range() marks them dirty in
> >>the page cache? Is this understanding right?
> > The page can become dirty either because it was written via standard
> >write - write_begin is responsible for reliable check here - or it was
> >written via mmap - here we rely on page_mkwrite to do a reliable check -
> >it is analogous to write_begin callback. There should be no other way
> >to dirty a page.
> >
> >With dirty bits it is a bit complicated. We have two of them in fact. One
> >in page table entry maintained by mmu and one in page structure maintained
> >by kernel. Some functions (such as zap_pte_range()) copy the dirty bits
> >from page table into struct page. This is a lazy process so page can in
> >principle have new data without a dirty bit set in struct page because we
> >have not yet copied the dirty bit from page table. Only at moments where it
> >is important (like when we want to unmap the page, or throw away the page,
> >or so), we make sure struct page and page table bits are in sync.
> >
> >Another subtle thing you need not be aware of it that when we clear page
> >dirty bit, we also writeprotect the page. So we are guaranteed to get a
> >page fault when the page is written to again.
> >
> >>IMHO, whatever code dirties the page in the page cache should call a
> >>F.S specific function and let it _prevent_ a fsfreeze while the page
> >>is getting dirtied, so that a freeze called after this point flushes
> >>this page!
> > Agreed, that's what code in write_begin() and page_mkwrite() should
> >achieve.
> > Honza
> Thanks a lot for the wonderful explanation :)
>
> How about the revert : i.e calling jbd2_journal_unlock_updates()
> from ext4_unfreeze() instead of the ext4_freeze()? Do you agree to
> that?
Sorry, I don't agree with revert. We could talk about changing
jbd2_journal_unlock_updates() to not return with mutex held (and handle
synchronization of locked journal operations differently) as an alternative
to doing "freeze" reference counting. But returning with mutex held to user
space is no-go. It will cause problems in lockdep, violates kernel locking
rules, and generally is a bad programming ;).
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: your mail
2021-08-16 2:46 Kari Argillander
@ 2021-08-16 12:27 ` Christoph Hellwig
0 siblings, 0 replies; 13+ messages in thread
From: Christoph Hellwig @ 2021-08-16 12:27 UTC (permalink / raw)
To: Kari Argillander
Cc: Konstantin Komarov, Christoph Hellwig, ntfs3, linux-kernel,
linux-fsdevel, Pali Rohár, Matthew Wilcox
On Mon, Aug 16, 2021 at 05:46:59AM +0300, Kari Argillander wrote:
> I would like really like to get fsparam_flag_no also for no_acs_rules
> but then we have to make new name for it. Other possibility is to
> modify mount api so it mount option can be no/no_. I think that would
> maybe be good change.
I don't think adding another no_ alias is a good idea. I'd suggest
to just rename the existing flag before the ntfs3 driver ever hits
mainline.
^ permalink raw reply [flat|nested] 13+ messages in thread
* (no subject)
@ 2021-08-21 8:59 Kari Argillander
2021-08-22 13:13 ` your mail CGEL
0 siblings, 1 reply; 13+ messages in thread
From: Kari Argillander @ 2021-08-21 8:59 UTC (permalink / raw)
To: cgel.zte
Cc: viro, christian.brauner, jamorris, gladkov.alexey, yang.yang29,
tj, paul.gortmaker, linux-fsdevel, linux-kernel, Zeal Robot
Bcc:
Subject: Re: [PATCH] proc: prevent mount proc on same mountpoint in one pid
namespace
Reply-To:
In-Reply-To: <20210821083105.30336-1-yang.yang29@zte.com.cn>
On Sat, Aug 21, 2021 at 01:31:05AM -0700, cgel.zte@gmail.com wrote:
> From: Yang Yang <yang.yang29@zte.com.cn>
>
> Patch "proc: allow to mount many instances of proc in one pid namespace"
> aims to mount many instances of proc on different mountpoint, see
> tools/testing/selftests/proc/proc-multiple-procfs.c.
>
> But there is a side-effects, user can mount many instances of proc on
> the same mountpoint in one pid namespace, which is not allowed before.
> This duplicate mount makes no sense but wastes memory and CPU, and user
> may be confused why kernel allows it.
>
> The logic of this patch is: when try to mount proc on /mnt, check if
> there is a proc instance mount on /mnt in the same pid namespace. If
> answer is yes, return -EBUSY.
>
> Since this check can't be done in proc_get_tree(), which call
> get_tree_nodev() and will create new super_block unconditionally.
> And other nodev fs may faces the same case, so add a new hook in
> fs_context_operations.
>
> Reported-by: Zeal Robot <zealci@zte.com.cn>
> Signed-off-by: Yang Yang <yang.yang29@zte.com.cn>
> ---
> fs/namespace.c | 9 +++++++++
> fs/proc/root.c | 15 +++++++++++++++
> include/linux/fs_context.h | 1 +
> 3 files changed, 25 insertions(+)
>
> diff --git a/fs/namespace.c b/fs/namespace.c
> index f79d9471cb76..84da649a70c5 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -2878,6 +2878,7 @@ static int do_new_mount_fc(struct fs_context *fc, struct path *mountpoint,
> static int do_new_mount(struct path *path, const char *fstype, int sb_flags,
> int mnt_flags, const char *name, void *data)
> {
> + int (*check_mntpoint)(struct fs_context *fc, struct path *path);
> struct file_system_type *type;
> struct fs_context *fc;
> const char *subtype = NULL;
> @@ -2906,6 +2907,13 @@ static int do_new_mount(struct path *path, const char *fstype, int sb_flags,
> if (IS_ERR(fc))
> return PTR_ERR(fc);
>
> + /* check if there is a same super_block mount on path*/
> + check_mntpoint = fc->ops->check_mntpoint;
> + if (check_mntpoint)
> + err = check_mntpoint(fc, path);
> + if (err < 0)
> + goto err_fc;
> +
> if (subtype)
> err = vfs_parse_fs_string(fc, "subtype",
> subtype, strlen(subtype));
> @@ -2920,6 +2928,7 @@ static int do_new_mount(struct path *path, const char *fstype, int sb_flags,
> if (!err)
> err = do_new_mount_fc(fc, path, mnt_flags);
>
> +err_fc:
> put_fs_context(fc);
> return err;
> }
> diff --git a/fs/proc/root.c b/fs/proc/root.c
> index c7e3b1350ef8..0971d6b0bec2 100644
> --- a/fs/proc/root.c
> +++ b/fs/proc/root.c
> @@ -237,11 +237,26 @@ static void proc_fs_context_free(struct fs_context *fc)
> kfree(ctx);
> }
>
> +static int proc_check_mntpoint(struct fs_context *fc, struct path *path)
> +{
> + struct super_block *mnt_sb = path->mnt->mnt_sb;
> + struct proc_fs_info *fs_info;
> +
> + if (strcmp(mnt_sb->s_type->name, "proc") == 0) {
> + fs_info = mnt_sb->s_fs_info;
> + if (fs_info->pid_ns == task_active_pid_ns(current) &&
> + path->mnt->mnt_root == path->dentry)
> + return -EBUSY;
> + }
> + return 0;
> +}
> +
> static const struct fs_context_operations proc_fs_context_ops = {
> .free = proc_fs_context_free,
> .parse_param = proc_parse_param,
> .get_tree = proc_get_tree,
> .reconfigure = proc_reconfigure,
> + .check_mntpoint = proc_check_mntpoint,
> };
>
> static int proc_init_fs_context(struct fs_context *fc)
> diff --git a/include/linux/fs_context.h b/include/linux/fs_context.h
> index 6b54982fc5f3..090a05fb2d7d 100644
> --- a/include/linux/fs_context.h
> +++ b/include/linux/fs_context.h
> @@ -119,6 +119,7 @@ struct fs_context_operations {
> int (*parse_monolithic)(struct fs_context *fc, void *data);
> int (*get_tree)(struct fs_context *fc);
> int (*reconfigure)(struct fs_context *fc);
> + int (*check_mntpoint)(struct fs_context *fc, struct path *path);
Don't you think this should be it's own patch. It is after all internal
api change. This also needs documentation. It would be confusing if
someone convert to new mount api and there is one line which just
address some proc stuff but even commit message does not address does
every fs needs to add this.
Documentation is very good shape right now and we are in face that
everyone is migrating to use new mount api so everyting should be well
documented.
> };
>
> /*
> --
> 2.25.1
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: your mail
2021-08-21 8:59 Kari Argillander
@ 2021-08-22 13:13 ` CGEL
0 siblings, 0 replies; 13+ messages in thread
From: CGEL @ 2021-08-22 13:13 UTC (permalink / raw)
To: Kari Argillander
Cc: viro, christian.brauner, jamorris, gladkov.alexey, yang.yang29,
tj, paul.gortmaker, linux-fsdevel, linux-kernel, Zeal Robot
O
Sat, Aug 21, 2021 at 11:59:39AM +0300, Kari Argillander wrote:
> Bcc:
> Subject: Re: [PATCH] proc: prevent mount proc on same mountpoint in one pid
> namespace
> Reply-To:
> In-Reply-To: <20210821083105.30336-1-yang.yang29@zte.com.cn>
>
> On Sat, Aug 21, 2021 at 01:31:05AM -0700, cgel.zte@gmail.com wrote:
> > From: Yang Yang <yang.yang29@zte.com.cn>
> >
> > Patch "proc: allow to mount many instances of proc in one pid namespace"
> > aims to mount many instances of proc on different mountpoint, see
> > tools/testing/selftests/proc/proc-multiple-procfs.c.
> >
> > But there is a side-effects, user can mount many instances of proc on
> > the same mountpoint in one pid namespace, which is not allowed before.
> > This duplicate mount makes no sense but wastes memory and CPU, and user
> > may be confused why kernel allows it.
> >
> > The logic of this patch is: when try to mount proc on /mnt, check if
> > there is a proc instance mount on /mnt in the same pid namespace. If
> > answer is yes, return -EBUSY.
> >
> > Since this check can't be done in proc_get_tree(), which call
> > get_tree_nodev() and will create new super_block unconditionally.
> > And other nodev fs may faces the same case, so add a new hook in
> > fs_context_operations.
> >
> > Reported-by: Zeal Robot <zealci@zte.com.cn>
> > Signed-off-by: Yang Yang <yang.yang29@zte.com.cn>
> > ---
> > fs/namespace.c | 9 +++++++++
> > fs/proc/root.c | 15 +++++++++++++++
> > include/linux/fs_context.h | 1 +
> > 3 files changed, 25 insertions(+)
> >
> > diff --git a/fs/namespace.c b/fs/namespace.c
> > index f79d9471cb76..84da649a70c5 100644
> > --- a/fs/namespace.c
> > +++ b/fs/namespace.c
> > @@ -2878,6 +2878,7 @@ static int do_new_mount_fc(struct fs_context *fc, struct path *mountpoint,
> > static int do_new_mount(struct path *path, const char *fstype, int sb_flags,
> > int mnt_flags, const char *name, void *data)
> > {
> > + int (*check_mntpoint)(struct fs_context *fc, struct path *path);
> > struct file_system_type *type;
> > struct fs_context *fc;
> > const char *subtype = NULL;
> > @@ -2906,6 +2907,13 @@ static int do_new_mount(struct path *path, const char *fstype, int sb_flags,
> > if (IS_ERR(fc))
> > return PTR_ERR(fc);
> >
> > + /* check if there is a same super_block mount on path*/
> > + check_mntpoint = fc->ops->check_mntpoint;
> > + if (check_mntpoint)
> > + err = check_mntpoint(fc, path);
> > + if (err < 0)
> > + goto err_fc;
> > +
> > if (subtype)
> > err = vfs_parse_fs_string(fc, "subtype",
> > subtype, strlen(subtype));
> > @@ -2920,6 +2928,7 @@ static int do_new_mount(struct path *path, const char *fstype, int sb_flags,
> > if (!err)
> > err = do_new_mount_fc(fc, path, mnt_flags);
> >
> > +err_fc:
> > put_fs_context(fc);
> > return err;
> > }
> > diff --git a/fs/proc/root.c b/fs/proc/root.c
> > index c7e3b1350ef8..0971d6b0bec2 100644
> > --- a/fs/proc/root.c
> > +++ b/fs/proc/root.c
> > @@ -237,11 +237,26 @@ static void proc_fs_context_free(struct fs_context *fc)
> > kfree(ctx);
> > }
> >
> > +static int proc_check_mntpoint(struct fs_context *fc, struct path *path)
> > +{
> > + struct super_block *mnt_sb = path->mnt->mnt_sb;
> > + struct proc_fs_info *fs_info;
> > +
> > + if (strcmp(mnt_sb->s_type->name, "proc") == 0) {
> > + fs_info = mnt_sb->s_fs_info;
> > + if (fs_info->pid_ns == task_active_pid_ns(current) &&
> > + path->mnt->mnt_root == path->dentry)
> > + return -EBUSY;
> > + }
> > + return 0;
> > +}
> > +
> > static const struct fs_context_operations proc_fs_context_ops = {
> > .free = proc_fs_context_free,
> > .parse_param = proc_parse_param,
> > .get_tree = proc_get_tree,
> > .reconfigure = proc_reconfigure,
> > + .check_mntpoint = proc_check_mntpoint,
> > };
> >
> > static int proc_init_fs_context(struct fs_context *fc)
> > diff --git a/include/linux/fs_context.h b/include/linux/fs_context.h
> > index 6b54982fc5f3..090a05fb2d7d 100644
> > --- a/include/linux/fs_context.h
> > +++ b/include/linux/fs_context.h
> > @@ -119,6 +119,7 @@ struct fs_context_operations {
> > int (*parse_monolithic)(struct fs_context *fc, void *data);
> > int (*get_tree)(struct fs_context *fc);
> > int (*reconfigure)(struct fs_context *fc);
> > + int (*check_mntpoint)(struct fs_context *fc, struct path *path);
>
> Don't you think this should be it's own patch. It is after all internal
> api change. This also needs documentation. It would be confusing if
> someone convert to new mount api and there is one line which just
> address some proc stuff but even commit message does not address does
> every fs needs to add this.
>
> Documentation is very good shape right now and we are in face that
> everyone is migrating to use new mount api so everyting should be well
> documented.
> i
Thanks for your reply!
I will take commit message more carefully next time.
Sinece I am not quit sure about this patch, so I didn't write
Documentation for patch v1. AIViro had made it clear, so this
patch is abondoned.
> > };
> >
> > /*
> > --
> > 2.25.1
> >
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2021-08-22 13:14 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-08-21 8:59 Kari Argillander
2021-08-22 13:13 ` your mail CGEL
-- strict thread matches above, loose matches on Subject: below --
2021-08-16 2:46 Kari Argillander
2021-08-16 12:27 ` your mail Christoph Hellwig
2011-05-03 11:01 [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock Surbhi Palande
2011-05-03 13:08 ` (unknown), Surbhi Palande
2011-05-03 13:46 ` your mail Jan Kara
2011-05-03 13:56 ` Surbhi Palande
2011-05-03 15:26 ` Surbhi Palande
2011-05-03 15:36 ` Jan Kara
2011-05-03 15:43 ` Surbhi Palande
2011-05-04 19:24 ` Jan Kara
2010-06-16 16:33 (unknown), Jan Kara
2010-06-16 22:15 ` your mail Dave Chinner
2010-06-22 2:59 ` Wu Fengguang
2010-06-22 13:54 ` Jan Kara
2010-06-22 14:12 ` Wu Fengguang
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).