From: Nick Piggin <nickpiggin@yahoo.com.au>
To: William Lee Irwin III <wli@holomorphy.com>
Cc: linux-kernel <linux-kernel@vger.kernel.org>,
Linux Memory Management <linux-mm@kvack.org>,
Hugh Dickins <hugh@veritas.com>,
Badari Pulavarty <pbadari@us.ibm.com>
Subject: Re: [patch][rfc] 5/5: core remove PageReserved
Date: Fri, 24 Jun 2005 09:21:31 +1000 [thread overview]
Message-ID: <42BB43FB.1060609@yahoo.com.au> (raw)
In-Reply-To: <20050623220812.GD3334@holomorphy.com>
William Lee Irwin III wrote:
> William Lee Irwin III wrote:
>
>>>As exciting as this is, !!(vma->vm_flags & VM_RESERVED) could trivially
>>>go into struct zap_details without excess args or diff.
>
>
> On Thu, Jun 23, 2005 at 08:32:05PM +1000, Nick Piggin wrote:
>
>>Actually, it isn't trivial. I thought of trying that.
>
>
> You're probably thinking of the zap_page_range() vs. unmap_page_range()
> topmost-level entrypoints as being the only places to set the flag in
> details. Unconditionally clobbering the field in details in
> zap_pud_range() should resolve any issues associated with that.
>
> This is obviously a style and/or diff-minimization issue.
>
I'm thinking of exit_mmap and unmap_vmas, that don't pass in a
details field at all, and all the tests for details being marked
unlikely. I ended up thinking it was less ugly this way.
>
> William Lee Irwin III wrote:
>
>>>The refcounting and PG_reserved activity in memmap_init_zone() is
>>>superfluous. bootmem.c does all the necessary accounting internally.
>
>
> On Thu, Jun 23, 2005 at 08:32:05PM +1000, Nick Piggin wrote:
>
>>Well not superfluous yet. It is kept around to support all the arch
>>code that still uses it (not much, mainly reserved memory reporting).
>
>
> That activity is outside memmap_init_zone(). Specifically, the page
> structures are not used for anything whatsoever before bootmem puts
> them into the buddy allocators. It is not particularly interesting
> or difficult to defer even the initialization to the precise point
> in time bootmem cares to perform buddy bitmap insertion. This goes
> for all fields of struct page. What's notable about PG_reserved and
> refcounts here is that memmap_init_zone() goes about flipping bits one
> way where free_all_bootmem_core() undoes all its work.
>
This patch doesn't care how it works, that would be for a later patch.
>
> William Lee Irwin III wrote:
>
>>>There is no error returned here to be handled by the caller.
>
>
> On Thu, Jun 23, 2005 at 08:32:05PM +1000, Nick Piggin wrote:
>
>>That's OK, the pte has been cleared. Nothing else we can do.
>
>
> This is called in the interior of a loop, which may be beneficial to
> terminate if this intended semantic is to be enforced. Furthermore, no
> error is propagated to the caller, which is not the desired effect in
> the stated error reporting scheme. So the code is inconsistent with
> explicitly stated intention.
>
No, the error reporting scheme says it doesn't handle any error,
that is all. What we have here in terms of behaviour is exactly
what used to happen, that is - do something saneish on error.
Changing behaviour would be outside the scope of this patch, but
be my guest.
>
> William Lee Irwin III wrote:
>
>>>This has no effect but to artificially constrain the interface. There
>>>is no a priori reason to avoid the use of install_page() to deposit
>>>mappings to normal pages in VM_RESERVED vmas. It's only the reverse
>>>being "banned" here. Similar comments also apply to zap_pte().
>
>
> On Thu, Jun 23, 2005 at 08:32:05PM +1000, Nick Piggin wrote:
>
>>No, install_page is playing with the page (eg. page_add_file_rmap)
>>which is explicity banned even before my PageReserved removal. It
>>is unclear that this ever safely worked for normal pages, and will
>>hit NULL page dereferences if trying to do it with iomem.
>
>
> You are going on about the fact that install_page() can't be used on
> memory outside mem_map[] as it requires a page structure, and can't be
> used on reserved pages because page_add_file_rmap() will BUG. This case
> is not being discussed.
>
And that it isn't allowed to touch struct page of physical pages
in a VM_RESERVED region.
> The issue at stake is inserting normal pages into a VM_RESERVED vma.
> These will arise as e.g. kernel-allocated buffers managed by normal
> reference counting. remap_pfn_range() can't do it; it refuses to
> operate on "valid memory". install_page() now won't do it; it refuses
> to touch a VM_RESERVED vma. So this creates a giant semantic hole,
> and potentially breaks working code (i.e. if you were going to do
> this you would need not only a replacement but also a sweep to adjust
> all the drivers doing it or prove their nonexistence).
>
I think you'll find that remap_pfn_range will be happy to operate
on valid memory, and that any driver trying to use install_page
on VM_RESERVED probably needs fixing anyway.
>
> William Lee Irwin III wrote:
>
>>>An answer should be devised for this. My numerous SCSI CD-ROM devices
>>>(I have 5 across several different machines of several different arches)
>>>are rather unlikely to be happy with /* FIXME: XXX ... as an answer.
>
>
> On Thu, Jun 23, 2005 at 08:32:05PM +1000, Nick Piggin wrote:
>
>>The worst it will do is dirty a VM_RESERVED page. So it is going
>>to work unless you're thinking about doing something crazy like
>>mmap /dev/mem and send your CDROM some pages from there. But yeah,
>>I have to find someone who knows what they're doing to look at this.
>
>
> Then you should replace FIXME: XXX with this explanation. By and large
> the presence of "FIXME: XXX" is a sign there is a gigantic hole in the
> code. It should definitely not be done with new code, but rather,
> exclusively confined to documenting discoveries of preexisting brokenness.
> After all, if a patch is introducing broken code, why would we merge it?
> Best to adjust the commentary and avoid that question altogether.
>
We wouldn't merge it. Hence this is an rfc and I explicitly said
it is not for merging.
> There are actually larger questions about this than the reserved page
> handling. If e.g. pagecache pages need to be dirtied the raw bitflag
> toggling is probably not how it should be done.
>
Yep.
>
> William Lee Irwin III wrote:
>
>>>snd_malloc_pages() marks all pages it allocates PG_reserved. This
>>>merits some commentary, and likely the removal of the superfluous
>>>PG_reserved usage.
>
>
> On Thu, Jun 23, 2005 at 08:32:05PM +1000, Nick Piggin wrote:
>
>>Sure, but not in this patch. The aim here is just to eliminate special
>>casing of refcounting. Other PG_reserved usage can stay around for the
>>moment (and is actually good for catching errors).
>
>
> Unfortunately for this scheme, it's very much a case of putting the
> cart before the horse. PG_reserved is toggled at random in this driver
> after this change, to no useful effect (debugging or otherwise). And
> this really goes for the whole affair. Diddling the core first is just
> going to create bugs. Converting the users first is the way these
> things need to be done. When complete, nothing needs the core flags
> twiddling anymore and you just nuke the flag twiddling from the core.
>
I'm sorry, I don't see how 'diddling' the core will create bugs.
This is a fine way to do it, and "converting" users first (whatever
that means) is not possible because VM_RESERVED handling in core
code is not up to the task of replacing PageReserved without this
patch.
>
> William Lee Irwin III wrote:
>
>>>This is user-triggerable where the driver honors mmap() protection
>>>flags blindly.
>
>
> On Thu, Jun 23, 2005 at 08:32:05PM +1000, Nick Piggin wrote:
>
>>If the user is allowed write access to VM_RESERVED memory, then I
>>suspect there is a lot worse they can do than flood the log.
>>But the check isn't going to stay around forever.
>
>
> This doesn't really do a whole lot of good for the unwitting user
> who invokes a privileged application relying on such kernel behavior.
> User-visible changes need to be taken on with somewhat more care
> (okay, vastly more, along with long-term backward compatibility).
>
It does a great lot of good, because they can tell us about it
we'll fix it. Search the kernel sources and you'll find other
examples that look almost exactly like this one.
>
> William Lee Irwin III wrote:
>
>>>This behavioral change needs to be commented on. There are some additional
>>>difficulties when memory holes are unintentionally covered by mem_map[];
>>>It is beneficial otherwise. It's likely to triplefault on such holes.
>
>
> On Thu, Jun 23, 2005 at 08:32:05PM +1000, Nick Piggin wrote:
>
>>It seems the author of this code themselves didn't really understand
>>what was going on here, so I'm buggered if I can be bothered :)
>>Remember though, PageReserved can stay around for as long as we need,
>>so this hunk can be trivially reverted if it is an immediate problem.
>
>
> This doesn't really fly. A fair number of drivers are poorly-understood
> and numerous gyrations have to be gone through to avoid breaking their
> possible assumptions until at long last clarifications are made (that
> is, if they're ever made). swsusp is not demotable to below their
> status on a whim.
>
Yep, that's why I'm going to ask some swsusp developers to have
a look at it.
I wouldn't pretend to be able to fix every bug everywhere in the
kernel myself.
Thanks,
Nick
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
WARNING: multiple messages have this Message-ID (diff)
From: Nick Piggin <nickpiggin@yahoo.com.au>
To: William Lee Irwin III <wli@holomorphy.com>
Cc: linux-kernel <linux-kernel@vger.kernel.org>,
Linux Memory Management <linux-mm@kvack.org>,
Hugh Dickins <hugh@veritas.com>,
Badari Pulavarty <pbadari@us.ibm.com>
Subject: Re: [patch][rfc] 5/5: core remove PageReserved
Date: Fri, 24 Jun 2005 09:21:31 +1000 [thread overview]
Message-ID: <42BB43FB.1060609@yahoo.com.au> (raw)
In-Reply-To: <20050623220812.GD3334@holomorphy.com>
William Lee Irwin III wrote:
> William Lee Irwin III wrote:
>
>>>As exciting as this is, !!(vma->vm_flags & VM_RESERVED) could trivially
>>>go into struct zap_details without excess args or diff.
>
>
> On Thu, Jun 23, 2005 at 08:32:05PM +1000, Nick Piggin wrote:
>
>>Actually, it isn't trivial. I thought of trying that.
>
>
> You're probably thinking of the zap_page_range() vs. unmap_page_range()
> topmost-level entrypoints as being the only places to set the flag in
> details. Unconditionally clobbering the field in details in
> zap_pud_range() should resolve any issues associated with that.
>
> This is obviously a style and/or diff-minimization issue.
>
I'm thinking of exit_mmap and unmap_vmas, that don't pass in a
details field at all, and all the tests for details being marked
unlikely. I ended up thinking it was less ugly this way.
>
> William Lee Irwin III wrote:
>
>>>The refcounting and PG_reserved activity in memmap_init_zone() is
>>>superfluous. bootmem.c does all the necessary accounting internally.
>
>
> On Thu, Jun 23, 2005 at 08:32:05PM +1000, Nick Piggin wrote:
>
>>Well not superfluous yet. It is kept around to support all the arch
>>code that still uses it (not much, mainly reserved memory reporting).
>
>
> That activity is outside memmap_init_zone(). Specifically, the page
> structures are not used for anything whatsoever before bootmem puts
> them into the buddy allocators. It is not particularly interesting
> or difficult to defer even the initialization to the precise point
> in time bootmem cares to perform buddy bitmap insertion. This goes
> for all fields of struct page. What's notable about PG_reserved and
> refcounts here is that memmap_init_zone() goes about flipping bits one
> way where free_all_bootmem_core() undoes all its work.
>
This patch doesn't care how it works, that would be for a later patch.
>
> William Lee Irwin III wrote:
>
>>>There is no error returned here to be handled by the caller.
>
>
> On Thu, Jun 23, 2005 at 08:32:05PM +1000, Nick Piggin wrote:
>
>>That's OK, the pte has been cleared. Nothing else we can do.
>
>
> This is called in the interior of a loop, which may be beneficial to
> terminate if this intended semantic is to be enforced. Furthermore, no
> error is propagated to the caller, which is not the desired effect in
> the stated error reporting scheme. So the code is inconsistent with
> explicitly stated intention.
>
No, the error reporting scheme says it doesn't handle any error,
that is all. What we have here in terms of behaviour is exactly
what used to happen, that is - do something saneish on error.
Changing behaviour would be outside the scope of this patch, but
be my guest.
>
> William Lee Irwin III wrote:
>
>>>This has no effect but to artificially constrain the interface. There
>>>is no a priori reason to avoid the use of install_page() to deposit
>>>mappings to normal pages in VM_RESERVED vmas. It's only the reverse
>>>being "banned" here. Similar comments also apply to zap_pte().
>
>
> On Thu, Jun 23, 2005 at 08:32:05PM +1000, Nick Piggin wrote:
>
>>No, install_page is playing with the page (eg. page_add_file_rmap)
>>which is explicity banned even before my PageReserved removal. It
>>is unclear that this ever safely worked for normal pages, and will
>>hit NULL page dereferences if trying to do it with iomem.
>
>
> You are going on about the fact that install_page() can't be used on
> memory outside mem_map[] as it requires a page structure, and can't be
> used on reserved pages because page_add_file_rmap() will BUG. This case
> is not being discussed.
>
And that it isn't allowed to touch struct page of physical pages
in a VM_RESERVED region.
> The issue at stake is inserting normal pages into a VM_RESERVED vma.
> These will arise as e.g. kernel-allocated buffers managed by normal
> reference counting. remap_pfn_range() can't do it; it refuses to
> operate on "valid memory". install_page() now won't do it; it refuses
> to touch a VM_RESERVED vma. So this creates a giant semantic hole,
> and potentially breaks working code (i.e. if you were going to do
> this you would need not only a replacement but also a sweep to adjust
> all the drivers doing it or prove their nonexistence).
>
I think you'll find that remap_pfn_range will be happy to operate
on valid memory, and that any driver trying to use install_page
on VM_RESERVED probably needs fixing anyway.
>
> William Lee Irwin III wrote:
>
>>>An answer should be devised for this. My numerous SCSI CD-ROM devices
>>>(I have 5 across several different machines of several different arches)
>>>are rather unlikely to be happy with /* FIXME: XXX ... as an answer.
>
>
> On Thu, Jun 23, 2005 at 08:32:05PM +1000, Nick Piggin wrote:
>
>>The worst it will do is dirty a VM_RESERVED page. So it is going
>>to work unless you're thinking about doing something crazy like
>>mmap /dev/mem and send your CDROM some pages from there. But yeah,
>>I have to find someone who knows what they're doing to look at this.
>
>
> Then you should replace FIXME: XXX with this explanation. By and large
> the presence of "FIXME: XXX" is a sign there is a gigantic hole in the
> code. It should definitely not be done with new code, but rather,
> exclusively confined to documenting discoveries of preexisting brokenness.
> After all, if a patch is introducing broken code, why would we merge it?
> Best to adjust the commentary and avoid that question altogether.
>
We wouldn't merge it. Hence this is an rfc and I explicitly said
it is not for merging.
> There are actually larger questions about this than the reserved page
> handling. If e.g. pagecache pages need to be dirtied the raw bitflag
> toggling is probably not how it should be done.
>
Yep.
>
> William Lee Irwin III wrote:
>
>>>snd_malloc_pages() marks all pages it allocates PG_reserved. This
>>>merits some commentary, and likely the removal of the superfluous
>>>PG_reserved usage.
>
>
> On Thu, Jun 23, 2005 at 08:32:05PM +1000, Nick Piggin wrote:
>
>>Sure, but not in this patch. The aim here is just to eliminate special
>>casing of refcounting. Other PG_reserved usage can stay around for the
>>moment (and is actually good for catching errors).
>
>
> Unfortunately for this scheme, it's very much a case of putting the
> cart before the horse. PG_reserved is toggled at random in this driver
> after this change, to no useful effect (debugging or otherwise). And
> this really goes for the whole affair. Diddling the core first is just
> going to create bugs. Converting the users first is the way these
> things need to be done. When complete, nothing needs the core flags
> twiddling anymore and you just nuke the flag twiddling from the core.
>
I'm sorry, I don't see how 'diddling' the core will create bugs.
This is a fine way to do it, and "converting" users first (whatever
that means) is not possible because VM_RESERVED handling in core
code is not up to the task of replacing PageReserved without this
patch.
>
> William Lee Irwin III wrote:
>
>>>This is user-triggerable where the driver honors mmap() protection
>>>flags blindly.
>
>
> On Thu, Jun 23, 2005 at 08:32:05PM +1000, Nick Piggin wrote:
>
>>If the user is allowed write access to VM_RESERVED memory, then I
>>suspect there is a lot worse they can do than flood the log.
>>But the check isn't going to stay around forever.
>
>
> This doesn't really do a whole lot of good for the unwitting user
> who invokes a privileged application relying on such kernel behavior.
> User-visible changes need to be taken on with somewhat more care
> (okay, vastly more, along with long-term backward compatibility).
>
It does a great lot of good, because they can tell us about it
we'll fix it. Search the kernel sources and you'll find other
examples that look almost exactly like this one.
>
> William Lee Irwin III wrote:
>
>>>This behavioral change needs to be commented on. There are some additional
>>>difficulties when memory holes are unintentionally covered by mem_map[];
>>>It is beneficial otherwise. It's likely to triplefault on such holes.
>
>
> On Thu, Jun 23, 2005 at 08:32:05PM +1000, Nick Piggin wrote:
>
>>It seems the author of this code themselves didn't really understand
>>what was going on here, so I'm buggered if I can be bothered :)
>>Remember though, PageReserved can stay around for as long as we need,
>>so this hunk can be trivially reverted if it is an immediate problem.
>
>
> This doesn't really fly. A fair number of drivers are poorly-understood
> and numerous gyrations have to be gone through to avoid breaking their
> possible assumptions until at long last clarifications are made (that
> is, if they're ever made). swsusp is not demotable to below their
> status on a whim.
>
Yep, that's why I'm going to ask some swsusp developers to have
a look at it.
I wouldn't pretend to be able to fix every bug everywhere in the
kernel myself.
Thanks,
Nick
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
next prev parent reply other threads:[~2005-06-23 23:22 UTC|newest]
Thread overview: 32+ messages / expand[flat|nested] mbox.gz Atom feed top
2005-06-23 7:05 [patch][rfc] 0/5: remove PageReserved Nick Piggin
2005-06-23 7:05 ` Nick Piggin
2005-06-23 7:06 ` [patch][rfc] 1/5: comment for mm/rmap.c Nick Piggin
2005-06-23 7:06 ` [patch][rfc] 2/5: micro optimisation " Nick Piggin
2005-06-23 7:07 ` [patch][rfc] 3/5: remove atomic bitop when freeing page Nick Piggin
2005-06-23 7:07 ` [patch][rfc] 4/5: remap ZERO_PAGE mappings Nick Piggin
2005-06-23 7:08 ` [patch][rfc] 5/5: core remove PageReserved Nick Piggin
2005-06-23 9:51 ` William Lee Irwin III
2005-06-23 9:51 ` William Lee Irwin III
2005-06-23 10:32 ` Nick Piggin
2005-06-23 10:32 ` Nick Piggin
2005-06-23 22:08 ` William Lee Irwin III
2005-06-23 22:08 ` William Lee Irwin III
2005-06-23 23:21 ` Nick Piggin [this message]
2005-06-23 23:21 ` Nick Piggin
2005-06-24 0:59 ` William Lee Irwin III
2005-06-24 0:59 ` William Lee Irwin III
2005-06-24 1:17 ` Nick Piggin
2005-06-24 1:17 ` Nick Piggin
2005-06-24 1:47 ` Nick Piggin
2005-06-24 1:47 ` Nick Piggin
2005-06-24 1:25 ` Nick Piggin
2005-06-24 1:25 ` Nick Piggin
2005-06-24 4:50 ` Andrew Morton
2005-06-24 4:50 ` Andrew Morton
2005-06-24 8:24 ` William Lee Irwin III
2005-06-24 8:24 ` William Lee Irwin III
2005-06-26 8:41 ` Nick Piggin
2005-06-26 8:41 ` Nick Piggin
2005-06-23 7:26 ` [patch][rfc] 2/5: micro optimisation for mm/rmap.c William Lee Irwin III
2005-06-23 7:26 ` William Lee Irwin III
2005-06-23 7:33 ` Nick Piggin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=42BB43FB.1060609@yahoo.com.au \
--to=nickpiggin@yahoo.com.au \
--cc=hugh@veritas.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=pbadari@us.ibm.com \
--cc=wli@holomorphy.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.