Re: test13-pre5

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Re: test13-pre5
  2000-12-28 20:25 test13-pre5 Linus Torvalds
@ 2000-12-28 18:39 ` Marcelo Tosatti
  2000-12-28 20:59   ` test13-pre5 Linus Torvalds
  2000-12-28 21:27 ` test13-pre5 (via82cxxx_audio.c) Jonathan Hudson
  2000-12-29 18:17 ` test13-pre5 Tom Rini
  2 siblings, 1 reply; 62+ messages in thread
From: Marcelo Tosatti @ 2000-12-28 18:39 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Kernel Mailing List


On Thu, 28 Dec 2000, Linus Torvalds wrote:

> This still doesn't tell "sync()" about dirty pages (ie the "innd loses the
> active file after a reboot" bug), but now the places that mark pages dirty
> are under control. Next step..

Do you really want to split the per-address-space pages list in dirty and
clean lists for 2.4 ?

Or do you think walking the current per-address-space page list searching
for dirty pages and syncing them is ok?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: test13-pre5
  2000-12-28 20:59   ` test13-pre5 Linus Torvalds
@ 2000-12-28 19:22     ` Marcelo Tosatti
  2000-12-28 21:21       ` [wildly off-topic] test13-pre5 Rik van Riel
  2000-12-28 21:39     ` test13-pre5 Daniel Phillips
                       ` (3 subsequent siblings)
  4 siblings, 1 reply; 62+ messages in thread
From: Marcelo Tosatti @ 2000-12-28 19:22 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Kernel Mailing List


On Thu, 28 Dec 2000, Linus Torvalds wrote:

> If somebody (you? hint, hint) wants to do this,

Ok, I'll do it because I love Tove. 

> I'd be very happy - I can do it myself, but because it's my birthday
> I'm supposed to drag myself off the computer soon and be social, or
> Tove will be grumpy.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [wildly off-topic] Re: test13-pre5
  2000-12-28 21:21       ` [wildly off-topic] test13-pre5 Rik van Riel
@ 2000-12-28 19:31         ` Marcelo Tosatti
  0 siblings, 0 replies; 62+ messages in thread
From: Marcelo Tosatti @ 2000-12-28 19:31 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Linus Torvalds, Kernel Mailing List


On Thu, 28 Dec 2000, Rik van Riel wrote:

> On Thu, 28 Dec 2000, Marcelo Tosatti wrote:
> > On Thu, 28 Dec 2000, Linus Torvalds wrote:
> > 
> > > If somebody (you? hint, hint) wants to do this,
> > 
> > Ok, I'll do it because I love Tove. 
> 
> Marcelo, you should buy some glasses ;)
> 
> Tove != Tux
> 
> It's ok and probably safe to love Tux, the nice cuddly
> penguin everybody loves.
> 
> However, loving the (6-time ??) Finnish female karate
> champion, who happens to be married to Linus is probably
> quite a bit less safe ...

Marcelo runs like hell. 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 62+ messages in thread

* test13-pre5
@ 2000-12-28 20:25 Linus Torvalds
  2000-12-28 18:39 ` test13-pre5 Marcelo Tosatti
                   ` (2 more replies)
  0 siblings, 3 replies; 62+ messages in thread
From: Linus Torvalds @ 2000-12-28 20:25 UTC (permalink / raw)
  To: Kernel Mailing List


The main notables are the network fixes (uninitialized skb->dev could and
did cause oopses in ip_defrag) and the mm fixes (dirty pages without
mappings etc, causing problems in page_launder).

The mm cleanups also include removing "swapout()" as a VM operation, as
nobody can sanely do anything more than just marking the page dirty anyway
(the real work is done by writepage() these days), and doing that
explicitly simplifies VM scanning considerably.

This still doesn't tell "sync()" about dirty pages (ie the "innd loses the
active file after a reboot" bug), but now the places that mark pages dirty
are under control. Next step..

		Linus

-----

 - pre5:
   - NIIBE Yutaka: SuperH update
   - Geert Uytterhoeven: m68k update
   - David Miller: TCP RTO calc fix, UDP multicast fix etc
   - Duncan Laurie: ServerWorks PIRQ routing definition.
   - mm PageDirty cleanups, added sanity checks, and don't lose the bit. 

 - pre4:
   - Christoph Rohland: shmfs cleanup
   - Nicolas Pitre: don't forget loop.c flags
   - Geert Uytterhoeven: new-style m68k Makefiles
   - Neil Brown: knfsd cleanups, raid5 re-org
   - Andrea Arkangeli: update to LVM-0.9
   - LC Chang: sis900 driver doc update
   - David Miller: netfilter oops fix
   - Andrew Grover: acpi update

 - pre3:
   - Christian Jullien: smc9194: proper dev_kfree_skb_irq
   - Cort Dougan: new-style PowerPC Makefiles
   - Andrew Morton, Petr Vandrovec: fix run_task_queue
   - Christoph Rohland: shmfs for shared memory handling

 - pre2:
   - Kai Germaschewski: ISDN update (including Makefiles)
   - Jens Axboe: cdrom updates
   - Petr Vandrovec; Matrox G450 support
   - Bill Nottingham: fix FAT32 filesystems on 64-bit platforms
   - David Miller: sparc (and other) Makefile fixup
   - Andrea Arkangeli: alpha SMP TLB context fix (and cleanups)
   - Niels Kristian Bech Jensen: checkconfig, USB warnings
   - Andrew Grover: large ACPI update

 - pre1:
   - me: drop support for old-style Makefiles entirely. Big.
   - me: check b_end_io at the IO submission path
   - me: fix "ptep_mkdirty()" (so that swapoff() works correctly)
   - fix fault case in copy_from_user() with a constant size, where
     ((size & 3) == 3)


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: test13-pre5
  2000-12-28 18:39 ` test13-pre5 Marcelo Tosatti
@ 2000-12-28 20:59   ` Linus Torvalds
  2000-12-28 19:22     ` test13-pre5 Marcelo Tosatti
                       ` (4 more replies)
  0 siblings, 5 replies; 62+ messages in thread
From: Linus Torvalds @ 2000-12-28 20:59 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Kernel Mailing List


On Thu, 28 Dec 2000, Marcelo Tosatti wrote:
> 
> On Thu, 28 Dec 2000, Linus Torvalds wrote:
> 
> > This still doesn't tell "sync()" about dirty pages (ie the "innd loses the
> > active file after a reboot" bug), but now the places that mark pages dirty
> > are under control. Next step..
> 
> Do you really want to split the per-address-space pages list in dirty and
> clean lists for 2.4 ?
> 
> Or do you think walking the current per-address-space page list searching
> for dirty pages and syncing them is ok?

There are a few issues:

 - fdatasync/fsync is often quite critical for databases. It's _possibly_
   ok to just walk all the pages for an inode, but I'm fairly certain that
   this is an area where if we don't have a separate dirty queue we _will_
   need to add one later.

 - global dirty list for global syn(). We don't have one, and I don't
   think we want one. We could add a few lists, and split up the active
   list into "active" and "active_dirty", for example, but I don't like
   the implications that would probably have for the LRU ordering.

 - we absolutely do _not_ want to make "struct page" bigger. We can't
   afford to just throw away another 8 bytes per page on adding a new list
   structure, I feel. Even if this would be the simplest solution.

So right now I think the right idea is to

 - split up "address_space->pages" into "->clean_pages" and
   "->dirty_pages".  This is fairly easily done, it requires small changes
   like making "truncate_inode_pages()" instead be
   "truncate_list_pages()", and making "truncate_inode_pages()" call that
   for both the dirty and the clean lists. That's about 10 lines of diff
   (I already tried this), and that's about the biggest example of
   something like that. Most other areas don't much care about the inode
   page lists.

 - make "SetPageDirty()" do something like

	if (!test_and_set(PG_dirty, &page->flags)) {
		spin_lock(&page_cache_lock);
		list_del(page->list);
		list_add(page->list, page->mapping->dirty_pages);
		spin_unlock(&page_cache_lock);
	}

   This will require making sure that every place that does a
   SetPageDirty() will be ok with this (ie double-check that they all have
   a mapping: right now the free_pte() code in mm/memory.c doesn't care,
   because it knew that it coul dmark even anonymous pages dirty and
   they'd just get ignored.

 - make a filemap_fdatasync() that walks the dirty pages and does a
   writepage() on them all and moves them to the clean list.

 - make fsync() and fdatasync() call the above function before they even
   call the low-level per-FS code.

 - make sync_inodes() use that same filemap_fdatasync() function so that
   the sync() case is handled.

All done. It looks something like 5-10 places, most of which are about 10
lines of diff each, if even that.

The only real worry would be that the locking isn't rigth, but getting the
pagemap lock should be the safe thing, and from a lock contention
standpoint it should be ok (if we move a lot of pages back and forth, lock
contention is going to be the least of our worries, because it implies
that we'd be doing a LOT of IO to actually write the pages out).

If somebody (you? hint, hint) wants to do this, I'd be very happy - I can
do it myself, but because it's my birthday I'm supposed to drag myself off
the computer soon and be social, or Tove will be grumpy.

And you don't want Tove grumpy.

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [wildly off-topic] Re: test13-pre5
  2000-12-28 19:22     ` test13-pre5 Marcelo Tosatti
@ 2000-12-28 21:21       ` Rik van Riel
  2000-12-28 19:31         ` Marcelo Tosatti
  0 siblings, 1 reply; 62+ messages in thread
From: Rik van Riel @ 2000-12-28 21:21 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Linus Torvalds, Kernel Mailing List

On Thu, 28 Dec 2000, Marcelo Tosatti wrote:
> On Thu, 28 Dec 2000, Linus Torvalds wrote:
> 
> > If somebody (you? hint, hint) wants to do this,
> 
> Ok, I'll do it because I love Tove. 

Marcelo, you should buy some glasses ;)

Tove != Tux

It's ok and probably safe to love Tux, the nice cuddly
penguin everybody loves.

However, loving the (6-time ??) Finnish female karate
champion, who happens to be married to Linus is probably
quite a bit less safe ...

cheers,

Rik
--
Hollywood goes for world dumbination,
	Trailer at 11.

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com.br/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: test13-pre5 (via82cxxx_audio.c)
  2000-12-28 20:25 test13-pre5 Linus Torvalds
  2000-12-28 18:39 ` test13-pre5 Marcelo Tosatti
@ 2000-12-28 21:27 ` Jonathan Hudson
  2000-12-29 18:17 ` test13-pre5 Tom Rini
  2 siblings, 0 replies; 62+ messages in thread
From: Jonathan Hudson @ 2000-12-28 21:27 UTC (permalink / raw)
  To: linux-kernel


In article <Pine.LNX.4.10.10012281220470.17769-100000@penguin.transmeta.com>,
	Linus Torvalds <torvalds@transmeta.com> writes:
LT> 
LT> The mm cleanups also include removing "swapout()" as a VM operation, as

swapout was not removed from drivers/sound/via82cxxx_audio.c; the
following does so (compiles and produces sound, someone who
understands this please check).


--- drivers/sound/via82cxxx_audio.c.orig	Thu Dec 28 21:02:03 2000
+++ drivers/sound/via82cxxx_audio.c	Thu Dec 28 21:12:58 2000
@@ -1727,20 +1727,8 @@
 }
 
 
-#ifndef VM_RESERVE
-static int via_mm_swapout (struct page *page, struct file *filp)
-{
-	return 0;
-}
-#endif /* VM_RESERVE */
-
-
 struct vm_operations_struct via_mm_ops = {
 	nopage:		via_mm_nopage,
-
-#ifndef VM_RESERVE
-	swapout:	via_mm_swapout,
-#endif
 };
 
 
 
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: test13-pre5
  2000-12-28 20:59   ` test13-pre5 Linus Torvalds
  2000-12-28 19:22     ` test13-pre5 Marcelo Tosatti
@ 2000-12-28 21:39     ` Daniel Phillips
  2000-12-28 22:14     ` test13-pre5 Marcelo Tosatti
                       ` (2 subsequent siblings)
  4 siblings, 0 replies; 62+ messages in thread
From: Daniel Phillips @ 2000-12-28 21:39 UTC (permalink / raw)
  To: Linus Torvalds, linux-kernel

Linus Torvalds wrote:
>  - global dirty list for global syn(). We don't have one, and I don't
>    think we want one. We could add a few lists, and split up the active
>    list into "active" and "active_dirty", for example, but I don't like
>    the implications that would probably have for the LRU ordering.

This has been the subject of a lot of flam^H^H^H^H discussion on
#kernelnewbies about this and it's still an open question.  The only way
to see if a separate active_dirty hurts or helps is to try it.  Later.
:-)

I don't see how a separate active_dirty list can hurt LRU ordering.  We
could still take the pages off the two lists in the same order we did
with one list if we wanted to, or at least, statistically the same in
turns of number, age, time since entering the list, etc.  That better
not cause radically different or undesireable behaviour or something is
really broken.

By breaking active into two lists we'd get a very interesting tuning
parameter to play with: the relative rate at which pages are moved from
active to inactive.  Beyond that, the active_dirty list could be pressed
into service quite easily as a page-oriented version of kflushd, and
would obviously be useful as a global sync list.

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: test13-pre5
  2000-12-28 20:59   ` test13-pre5 Linus Torvalds
  2000-12-28 19:22     ` test13-pre5 Marcelo Tosatti
  2000-12-28 21:39     ` test13-pre5 Daniel Phillips
@ 2000-12-28 22:14     ` Marcelo Tosatti
  2000-12-29  1:10       ` test13-pre5 Linus Torvalds
  2000-12-29 12:06       ` test13-pre5 Daniel Phillips
  2000-12-28 22:17     ` test13-pre5 Andi Kleen
  2000-12-29 16:49     ` [PATCH] filemap_fdatasync & related changes Marcelo Tosatti
  4 siblings, 2 replies; 62+ messages in thread
From: Marcelo Tosatti @ 2000-12-28 22:14 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Kernel Mailing List



On Thu, 28 Dec 2000, Linus Torvalds wrote:

>  - make "SetPageDirty()" do something like
> 
> 	if (!test_and_set(PG_dirty, &page->flags)) {
> 		spin_lock(&page_cache_lock);
> 		list_del(page->list);
> 		list_add(page->list, page->mapping->dirty_pages);
> 		spin_unlock(&page_cache_lock);
> 	}
> 
>    This will require making sure that every place that does a
>    SetPageDirty() will be ok with this (ie double-check that they all have
>    a mapping: right now the free_pte() code in mm/memory.c doesn't care,
>    because it knew that it coul dmark even anonymous pages dirty and
>    they'd just get ignored.
>  - make a filemap_fdatasync() that walks the dirty pages and does a
>    writepage() on them all and moves them to the clean list.

We also want to move the page to the per-address-space clean list in
ClearPageDirty I suppose.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: test13-pre5
  2000-12-28 20:59   ` test13-pre5 Linus Torvalds
                       ` (2 preceding siblings ...)
  2000-12-28 22:14     ` test13-pre5 Marcelo Tosatti
@ 2000-12-28 22:17     ` Andi Kleen
  2000-12-28 22:33       ` test13-pre5 David S. Miller
  2000-12-28 23:15       ` test13-pre5 Linus Torvalds
  2000-12-29 16:49     ` [PATCH] filemap_fdatasync & related changes Marcelo Tosatti
  4 siblings, 2 replies; 62+ messages in thread
From: Andi Kleen @ 2000-12-28 22:17 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Marcelo Tosatti, Kernel Mailing List

On Thu, Dec 28, 2000 at 12:59:22PM -0800, Linus Torvalds wrote:
>  - we absolutely do _not_ want to make "struct page" bigger. We can't
>    afford to just throw away another 8 bytes per page on adding a new list
>    structure, I feel. Even if this would be the simplest solution.

BTW..

The current 2.4 struct page could be already shortened a lot, saving a lot
of cache.
(first number for 32bit, second for 64bit) 

- Do not compile virtual in when the kernel does not support highmem
(saves 4/8 bytes) 
- Instead of having a zone pointer mask use a 8 or 16 byte index into a 
zone table. On a modern CPU it is much cheaper to do the and/shifts than
to do even a single cache miss during page aging. On a lot of systems 
that zone index could be hardcoded to 0 anyways, giving better code.
- Instead of using 4/8 bytes for the age use only 16bit (FreeBSD which
has the same swapping algorithm even only uses 8bit) 
- Remove the waitqueue debugging (obvious @)
- flags can be __u32 on 64bit hosts, sharing 64bit with something that
is tolerant to async updates (e.g. the zone table index or the index) 
- index could be probably u32 instead of unsigned long, saving 4 bytes
on i386

Would you consider patches for any of these points? 


-Andi


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: test13-pre5
  2000-12-28 22:17     ` test13-pre5 Andi Kleen
@ 2000-12-28 22:33       ` David S. Miller
  2000-12-28 22:58         ` test13-pre5 Andi Kleen
  2000-12-29 15:46         ` test13-pre5 Mark Hemment
  2000-12-28 23:15       ` test13-pre5 Linus Torvalds
  1 sibling, 2 replies; 62+ messages in thread
From: David S. Miller @ 2000-12-28 22:33 UTC (permalink / raw)
  To: ak; +Cc: torvalds, marcelo, linux-kernel

   Date: 	Thu, 28 Dec 2000 23:17:22 +0100
   From: Andi Kleen <ak@suse.de>

   Would you consider patches for any of these points? 

To me it seems just as important to make sure struct page is
a power of 2 in size, with the waitq debugging turned off this
is true for both 32-bit and 64-bit hosts last time I checked.

Later,
David S. Miller
davem@redhat.com

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: test13-pre5
  2000-12-28 22:58         ` test13-pre5 Andi Kleen
@ 2000-12-28 22:54           ` David S. Miller
  2000-12-28 23:17             ` test13-pre5 Andi Kleen
  2000-12-28 23:03           ` test13-pre5 Rik van Riel
  1 sibling, 1 reply; 62+ messages in thread
From: David S. Miller @ 2000-12-28 22:54 UTC (permalink / raw)
  To: ak; +Cc: ak, torvalds, marcelo, linux-kernel

   Date: Thu, 28 Dec 2000 23:58:36 +0100
   From: Andi Kleen <ak@suse.de>

   Why exactly a power of two ? To get rid of ->index ? 

To make things like "page - mem_map" et al. use shifts instead of
expensive multiplies...

Later,
David S. Miller
davem@redhat.com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: test13-pre5
  2000-12-28 22:33       ` test13-pre5 David S. Miller
@ 2000-12-28 22:58         ` Andi Kleen
  2000-12-28 22:54           ` test13-pre5 David S. Miller
  2000-12-28 23:03           ` test13-pre5 Rik van Riel
  2000-12-29 15:46         ` test13-pre5 Mark Hemment
  1 sibling, 2 replies; 62+ messages in thread
From: Andi Kleen @ 2000-12-28 22:58 UTC (permalink / raw)
  To: David S. Miller; +Cc: ak, torvalds, marcelo, linux-kernel

On Thu, Dec 28, 2000 at 02:33:07PM -0800, David S. Miller wrote:
>    Date: 	Thu, 28 Dec 2000 23:17:22 +0100
>    From: Andi Kleen <ak@suse.de>
> 
>    Would you consider patches for any of these points? 
> 
> To me it seems just as important to make sure struct page is
> a power of 2 in size, with the waitq debugging turned off this
> is true for both 32-bit and 64-bit hosts last time I checked.

Why exactly a power of two ? To get rid of ->index ? 


-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: test13-pre5
  2000-12-28 22:58         ` test13-pre5 Andi Kleen
  2000-12-28 22:54           ` test13-pre5 David S. Miller
@ 2000-12-28 23:03           ` Rik van Riel
  1 sibling, 0 replies; 62+ messages in thread
From: Rik van Riel @ 2000-12-28 23:03 UTC (permalink / raw)
  To: Andi Kleen; +Cc: David S. Miller, torvalds, marcelo, linux-kernel

On Thu, 28 Dec 2000, Andi Kleen wrote:
> On Thu, Dec 28, 2000 at 02:33:07PM -0800, David S. Miller wrote:
> >    Date: 	Thu, 28 Dec 2000 23:17:22 +0100
> >    From: Andi Kleen <ak@suse.de>
> > 
> >    Would you consider patches for any of these points? 
> > 
> > To me it seems just as important to make sure struct page is
> > a power of 2 in size, with the waitq debugging turned off this
> > is true for both 32-bit and 64-bit hosts last time I checked.
> 
> Why exactly a power of two ? To get rid of ->index ? 

Most likely to minimise the number of cache misses needed
to access a complete page_struct.

Then again, I guess 48 bytes would _also_ guarantee that
we never need more than 2 cache misses to access every
part of the page_struct.

And the memory wasted in the page_struct may well be a
bigger factor than the cache misses on lots of systems...

(time for another CONFIG option? ;))

regards,

Rik
--
Hollywood goes for world dumbination,
	Trailer at 11.

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com.br/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: test13-pre5
  2000-12-28 23:17             ` test13-pre5 Andi Kleen
@ 2000-12-28 23:14               ` David S. Miller
  2000-12-28 23:39                 ` test13-pre5 Andi Kleen
  2000-12-28 23:25               ` test13-pre5 Rik van Riel
  2000-12-28 23:36               ` test13-pre5 Linus Torvalds
  2 siblings, 1 reply; 62+ messages in thread
From: David S. Miller @ 2000-12-28 23:14 UTC (permalink / raw)
  To: ak; +Cc: ak, torvalds, marcelo, linux-kernel

   Date: Fri, 29 Dec 2000 00:17:21 +0100
   From: Andi Kleen <ak@suse.de>

   On Thu, Dec 28, 2000 at 02:54:52PM -0800, David S. Miller wrote:
   > To make things like "page - mem_map" et al. use shifts instead of
   > expensive multiplies...

   I thought that is what ->index is for ? 

It is for the page cache identity Andi... you know, page_hash(mapping, index)...

And the add/sub/shift expansion of a multiply/divide by constant even
in its' most optimal form is often not trivial, it is something on the
order of 7 instructions with waitq debugging enabled last time I
checked.

Later,
David S. Miller
davem@redhat.com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: test13-pre5
  2000-12-28 22:17     ` test13-pre5 Andi Kleen
  2000-12-28 22:33       ` test13-pre5 David S. Miller
@ 2000-12-28 23:15       ` Linus Torvalds
  2000-12-28 23:25         ` test13-pre5 Andi Kleen
  2000-12-30 13:24         ` test13-pre5 Geert Uytterhoeven
  1 sibling, 2 replies; 62+ messages in thread
From: Linus Torvalds @ 2000-12-28 23:15 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Marcelo Tosatti, Kernel Mailing List

On Thu, 28 Dec 2000, Andi Kleen wrote:
> 
> BTW..
> 
> The current 2.4 struct page could be already shortened a lot, saving a lot
> of cache.

Not that much, but some.

> (first number for 32bit, second for 64bit) 
> 
> - Do not compile virtual in when the kernel does not support highmem
> (saves 4/8 bytes) 

Even on UP, "virtual" often helps. The conversion from "struct page" to
the linear address is quite common, and if "struct page" isn't a
power-of-two it gets slow.

> - Instead of having a zone pointer mask use a 8 or 16 byte index into a 
> zone table. On a modern CPU it is much cheaper to do the and/shifts than
> to do even a single cache miss during page aging. On a lot of systems 
> that zone index could be hardcoded to 0 anyways, giving better code.
> - Instead of using 4/8 bytes for the age use only 16bit (FreeBSD which
> has the same swapping algorithm even only uses 8bit) 

This would be good, but can be hard.

FreeBSD doesn't try to be portable any more, but Linux does, and there are
architectures where 8- and 16-bit accesses aren't atomic but have to be
done with read-modify-write cycles.

And even for fields like "age", where we don't care whether the age itself
is 100% accurate, we _do_ care that the fields close-by don't get strange
effects from updating "age". We used to have exactly this problem on alpha
back in the 2.1.x timeframe.

This is why a lot of fields are 32-bit, even though we wouldn't need more
than 8 or 16 bits of them.

> - Remove the waitqueue debugging (obvious @)

Not obvious enough. There are magic things that could be done, like hiding
the wait-queue lock bit in the waitqueue lists themselves etc. That could
be done with some per-architecture magic etc.

> - flags can be __u32 on 64bit hosts, sharing 64bit with something that
> is tolerant to async updates (e.g. the zone table index or the index) 
> - index could be probably u32 instead of unsigned long, saving 4 bytes
> on i386

It already _is_ 32-bit on x86. 

Only the LSF patches made it 64-bit. That never made it into the standard
kernel.

Sure, we could make it "u32" and thus force it to be 32-bit even on 64-bit
architectures, but some day somebody will want to have more than 46 bits
of file mappings, and which 46 bits is _huge_ on a 32-bit machine, on a
64-bit one in 5 years it will not be entirely unreasonable. 

Anyway, I don't want to increase "struct page" in size, but I also don't
think it's worth micro-optimizing some of these things if the code gets
harder to maintain (like the partial-word stuff would be).

The biggest win by far would come from increasing the page-size, something
we can do even in software. Having a "kernel page size" of 8kB even on x86
would basically cut the overhead in half. As that would also improve some
other things (like having better throughput due to bigger contiguous
chunks), that's something I'd like to see some day.

(And user space wouldn't ever have to know - we could map in "half pages"
aka "hardware pages" without mappign the whole page).

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: test13-pre5
  2000-12-28 22:54           ` test13-pre5 David S. Miller
@ 2000-12-28 23:17             ` Andi Kleen
  2000-12-28 23:14               ` test13-pre5 David S. Miller
                                 ` (2 more replies)
  0 siblings, 3 replies; 62+ messages in thread
From: Andi Kleen @ 2000-12-28 23:17 UTC (permalink / raw)
  To: David S. Miller; +Cc: ak, torvalds, marcelo, linux-kernel

On Thu, Dec 28, 2000 at 02:54:52PM -0800, David S. Miller wrote:
>    Date: Thu, 28 Dec 2000 23:58:36 +0100
>    From: Andi Kleen <ak@suse.de>
> 
>    Why exactly a power of two ? To get rid of ->index ? 
> 
> To make things like "page - mem_map" et al. use shifts instead of
> expensive multiplies...

I thought that is what ->index is for ? 

Also gcc seems to be already quite clever at dividing through small
integers, e.g. using mul and shift and sub, so it may not be even worth to reach
for a real power-of-two. 

I suspect doing the arithmetics is at least faster than eating the 
cache misses because of ->index. 

-Andikkk


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: test13-pre5
  2000-12-28 23:15       ` test13-pre5 Linus Torvalds
@ 2000-12-28 23:25         ` Andi Kleen
  2000-12-28 23:37           ` test13-pre5 Linus Torvalds
  2000-12-30 13:24         ` test13-pre5 Geert Uytterhoeven
  1 sibling, 1 reply; 62+ messages in thread
From: Andi Kleen @ 2000-12-28 23:25 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andi Kleen, Marcelo Tosatti, Kernel Mailing List

On Thu, Dec 28, 2000 at 03:15:01PM -0800, Linus Torvalds wrote:
> > (first number for 32bit, second for 64bit) 
> > 
> > - Do not compile virtual in when the kernel does not support highmem
> > (saves 4/8 bytes) 
> 
> Even on UP, "virtual" often helps. The conversion from "struct page" to
> the linear address is quite common, and if "struct page" isn't a
> power-of-two it gets slow.

Are you sure? Last time I checked gcc did a very good job at optimizing
small divisions with small integers, without using div. It just has to 
be a good integer with not too many set bits.

> is 100% accurate, we _do_ care that the fields close-by don't get strange
> effects from updating "age". We used to have exactly this problem on alpha
> back in the 2.1.x timeframe.

When it is shared with a constant field (like zone index) it shouldn't
matter, no ? At worst you can see outdated data, and when the outdated
data is constant all is fine.

> > - flags can be __u32 on 64bit hosts, sharing 64bit with something that
> > is tolerant to async updates (e.g. the zone table index or the index) 
> > - index could be probably u32 instead of unsigned long, saving 4 bytes
> > on i386
> 
> It already _is_ 32-bit on x86. 

Oops. It was a typo. I meant to write "saving 4 bytes on 64bit"

> Anyway, I don't want to increase "struct page" in size, but I also don't
> think it's worth micro-optimizing some of these things if the code gets
> harder to maintain (like the partial-word stuff would be).

Ok pity :-/

Hopefully all the "goto out" micro optimizations can be taken out then too,
I recently found out that gcc 2.97's block moving pass has the tendency
to move the outlined blocks inline again ;) 



-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: test13-pre5
  2000-12-28 23:17             ` test13-pre5 Andi Kleen
  2000-12-28 23:14               ` test13-pre5 David S. Miller
@ 2000-12-28 23:25               ` Rik van Riel
  2000-12-28 23:36               ` test13-pre5 Linus Torvalds
  2 siblings, 0 replies; 62+ messages in thread
From: Rik van Riel @ 2000-12-28 23:25 UTC (permalink / raw)
  To: Andi Kleen; +Cc: David S. Miller, torvalds, marcelo, linux-kernel

On Fri, 29 Dec 2000, Andi Kleen wrote:
> On Thu, Dec 28, 2000 at 02:54:52PM -0800, David S. Miller wrote:
> >    Date: Thu, 28 Dec 2000 23:58:36 +0100
> >    From: Andi Kleen <ak@suse.de>
> > 
> >    Why exactly a power of two ? To get rid of ->index ? 
> > 
> > To make things like "page - mem_map" et al. use shifts instead of
> > expensive multiplies...
> 
> I thought that is what ->index is for ? 

Nope, ->index is there to identify which offset the page
has in ->mapping, read mm/filemap.c::__find_page_nolock()
for more info.

> Also gcc seems to be already quite clever at dividing through
> small integers, e.g. using mul and shift and sub, so it may not
> be even worth to reach for a real power-of-two.
> 
> I suspect doing the arithmetics is at least faster than eating the 
> cache misses because of ->index. 

I'm pretty confident that arithmetic is faster than cache
misses ... but an unlucky size of the page struct will cause
extra cache misses due to misalignment.

regards,

Rik
--
Hollywood goes for world dumbination,
	Trailer at 11.

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com.br/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: test13-pre5
  2000-12-28 23:17             ` test13-pre5 Andi Kleen
  2000-12-28 23:14               ` test13-pre5 David S. Miller
  2000-12-28 23:25               ` test13-pre5 Rik van Riel
@ 2000-12-28 23:36               ` Linus Torvalds
  2 siblings, 0 replies; 62+ messages in thread
From: Linus Torvalds @ 2000-12-28 23:36 UTC (permalink / raw)
  To: Andi Kleen; +Cc: David S. Miller, marcelo, linux-kernel



On Fri, 29 Dec 2000, Andi Kleen wrote:

> On Thu, Dec 28, 2000 at 02:54:52PM -0800, David S. Miller wrote:
> >    Date: Thu, 28 Dec 2000 23:58:36 +0100
> >    From: Andi Kleen <ak@suse.de>
> > 
> >    Why exactly a power of two ? To get rid of ->index ? 
> > 
> > To make things like "page - mem_map" et al. use shifts instead of
> > expensive multiplies...
> 
> I thought that is what ->index is for ? 

No. "index" only gives the virtual index.

"page - mem_map" is how you get the _physical_ index in the zone in
question, which is common for physical tranlations (ie "pte_page()",
"page_to_virt()" or "page_to_phys()")

> Also gcc seems to be already quite clever at dividing through small
> integers, e.g. using mul and shift and sub, so it may not be even worth to reach
> for a real power-of-two. 

Look at the code - it's a big multiply to do a divide by 68 or similar.
Quite expensive.

Doing "page->address - TASK_SIZE" on x86 for the non-highmem case would
probably be faster.

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: test13-pre5
  2000-12-28 23:25         ` test13-pre5 Andi Kleen
@ 2000-12-28 23:37           ` Linus Torvalds
  2000-12-28 23:43             ` test13-pre5 Andi Kleen
  2000-12-29  0:49             ` test13-pre5 Stefan Traby
  0 siblings, 2 replies; 62+ messages in thread
From: Linus Torvalds @ 2000-12-28 23:37 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Marcelo Tosatti, Kernel Mailing List



On Fri, 29 Dec 2000, Andi Kleen wrote:
> 
> Hopefully all the "goto out" micro optimizations can be taken out then too,

"goto out" often generates much more readable code, so the optimization is
secondary.

> I recently found out that gcc 2.97's block moving pass has the tendency
> to move the outlined blocks inline again ;) 

Too bad. Maybe somebody should tell gcc maintainers about programmers that
know more than the compiler again.

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: test13-pre5
  2000-12-28 23:14               ` test13-pre5 David S. Miller
@ 2000-12-28 23:39                 ` Andi Kleen
  0 siblings, 0 replies; 62+ messages in thread
From: Andi Kleen @ 2000-12-28 23:39 UTC (permalink / raw)
  To: David S. Miller; +Cc: ak, torvalds, marcelo, linux-kernel

On Thu, Dec 28, 2000 at 03:14:56PM -0800, David S. Miller wrote:
>    Date: Fri, 29 Dec 2000 00:17:21 +0100
>    From: Andi Kleen <ak@suse.de>
> 
>    On Thu, Dec 28, 2000 at 02:54:52PM -0800, David S. Miller wrote:
>    > To make things like "page - mem_map" et al. use shifts instead of
>    > expensive multiplies...
> 
>    I thought that is what ->index is for ? 
> 
> It is for the page cache identity Andi... you know, page_hash(mapping, index)...

Oops, I confused it with the 2.0 page->map_nr, which did exactly that.

I should have known better.  Thanks for correcting this brainfart.

> And the add/sub/shift expansion of a multiply/divide by constant even
> in its' most optimal form is often not trivial, it is something on the
> order of 7 instructions with waitq debugging enabled last time I
> checked.

Wonder if it looks better with wq debugging turned off or a compressed
->zone.


-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: test13-pre5
  2000-12-28 23:37           ` test13-pre5 Linus Torvalds
@ 2000-12-28 23:43             ` Andi Kleen
  2000-12-29  0:49             ` test13-pre5 Stefan Traby
  1 sibling, 0 replies; 62+ messages in thread
From: Andi Kleen @ 2000-12-28 23:43 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andi Kleen, Marcelo Tosatti, Kernel Mailing List

On Thu, Dec 28, 2000 at 03:37:51PM -0800, Linus Torvalds wrote:
> 
> 
> On Fri, 29 Dec 2000, Andi Kleen wrote:
> > 
> > Hopefully all the "goto out" micro optimizations can be taken out then too,
> 
> "goto out" often generates much more readable code, so the optimization is
> secondary.

I was more thinking of cases like the scheduler's gotos, which has gotten
rather spagetti recently. Admittedly classic goto out is often more readable
than many nested if()s with error handling.

> 
> > I recently found out that gcc 2.97's block moving pass has the tendency
> > to move the outlined blocks inline again ;) 
> 
> Too bad. Maybe somebody should tell gcc maintainers about programmers that
> know more than the compiler again.

In x86-64 which relies on 2.97 I'm using __builtin_expect, defined to 
likely() and unlikely(), which seems to generate good code.


-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: test13-pre5
  2000-12-28 23:37           ` test13-pre5 Linus Torvalds
  2000-12-28 23:43             ` test13-pre5 Andi Kleen
@ 2000-12-29  0:49             ` Stefan Traby
  2000-12-29  1:23               ` test13-pre5 Linus Torvalds
  1 sibling, 1 reply; 62+ messages in thread
From: Stefan Traby @ 2000-12-29  0:49 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andi Kleen, Marcelo Tosatti, Kernel Mailing List

On Thu, Dec 28, 2000 at 03:37:51PM -0800, Linus Torvalds wrote:

> Too bad. Maybe somebody should tell gcc maintainers about programmers that
> know more than the compiler again.

I know that {p,}gcc-2.95.2{,.1} are not officially supported.

Did you know that it's impossible to compile nfsv4 because of
register allocation problems with long long since (long long) month ?

The following does not hurt, it's just a fix for a broken
compiler:

--- linux/fs/lockd/xdr4.c.orig  Fri Dec 29 01:35:32 2000
+++ linux/fs/lockd/xdr4.c       Fri Dec 29 01:36:36 2000
@@ -156,7 +156,7 @@
 nlm4_encode_lock(u32 *p, struct nlm_lock *lock)
 {
        struct file_lock        *fl = &lock->fl;
-       __s64                   start, len;
+       volatile __s64                  start, len;

        if (!(p = xdr_encode_string(p, lock->caller))
         || !(p = nlm4_encode_fh(p, &lock->fh))

Here is an example without this patch (pgcc-2.95.2.1 this time which
is bug-compatible to gcc-2.95.2.1).

gcc -D__KERNEL__ -I/usr/src/linux/include -Wall -Wstrict-prototypes -O2 -fomit-frame-pointer -fno-strict-aliasing -pipe -mpreferred-stack-boundary=2 -march=i686 -DMODULE   -c -o xdr4.o xdr4.c
xdr4.c: In function `nlm4_encode_lock':
xdr4.c:181: internal error--insn does not satisfy its constraints:
(insn/i 313 585 315 (set (reg:SI 1 %edx)
        (subreg:SI (lshiftrt:DI (reg:DI 0 %eax)
                (const_int 32 [0x20])) 0)) 323 {lshrdi3_const_int_subreg} (nil)
    (nil))
gcc: Internal compiler error: program cpp got fatal signal 13
make[2]: *** [xdr4.o] Error 1
make[2]: Leaving directory `/.localvol000/usr/src/linux-2.4.0-test13pre5/fs/lockd'
make[1]: *** [_modsubdir_lockd] Error 2
make[1]: Leaving directory `/.localvol000/usr/src/linux-2.4.0-test13pre5/fs'
make: *** [_mod_fs] Error 2

The question is: Is it worth to apply ?

-- 

  ciao - 
    Stefan

"                export PS1="((((((((((((rms))))))))))))# "              "

Stefan Traby                Linux/ia32               fax:  +43-3133-6107-9
Mitterlasznitzstr. 13       Linux/alpha            phone:  +43-3133-6107-2
8302 Nestelbach             Linux/sparc       http://www.hello-penguin.com
Austria                                    mailto://st.traby@opengroup.org
Europe                                   mailto://stefan@hello-penguin.com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: test13-pre5
  2000-12-29  1:23               ` test13-pre5 Linus Torvalds
@ 2000-12-29  1:06                 ` Albert Cranford
  2000-12-29 14:53                   ` test13-pre5 Stefan Traby
  0 siblings, 1 reply; 62+ messages in thread
From: Albert Cranford @ 2000-12-29  1:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Stefan Traby, Andi Kleen, Marcelo Tosatti, Kernel Mailing List

Simply executing
         *p++ = htonl(fl->fl_pid);
before 
         start = loff_t_to_s64(fl->fl_start);
also works.
Later,
Albert

Linus Torvalds wrote:
> 
> On Fri, 29 Dec 2000, Stefan Traby wrote:
> > On Thu, Dec 28, 2000 at 03:37:51PM -0800, Linus Torvalds wrote:
> >
> > > Too bad. Maybe somebody should tell gcc maintainers about programmers that
> > > know more than the compiler again.
> >
> > I know that {p,}gcc-2.95.2{,.1} are not officially supported.
> 
> Hmm, I use gcc-2.95.2 myself on some machines, and while I'm not 100%
> comfortable with it, it does count as "supported" even if it has known
> problems with "long long". pgcc isn't.
> 
> > Did you know that it's impossible to compile nfsv4 because of
> > register allocation problems with long long since (long long) month ?
> 
> lockd v4 (for NFS v3), I assume.
> 
> No, I wasn't aware of this particular bug.
> 
> > The following does not hurt, it's just a fix for a broken
> > compiler:
> 
> Ugh, that's ugly.
> 
> Can you test if it is sufficient to just simplify the math a bit, instead
> of uglyfing that function more? The nlm4_encode_lock() function already
> tests for NLM4_OFFSET_MAX explicitly for both start and end, so it should
> be ok to just re-code the function to not do the extra "loff_t_to_s64()"
> stuff, and simplify it enough that the compile rwill be happy to compile
> the simpler function. Something along the lines of
> 
>         if (.. NLM4_OFFSET_MAX tests ..)
>                 ..
> 
>         *p++ = htonl(fl->fl_pid);
> 
>         start = fl->fl_start;
>         len = fl->fl_end - start;
>         if (fl->fl_end == OFFSET_MAX)
>                 len = 0;
> 
>         p = xdr_encode_hyper(p, start);
>         p = xdr_encode_hyper(p, len);
> 
>         return p;
> 
> Where it tries to minimize the liveness of the 64-bit values, and tries to
> avoid extra complications.
> 
>                 Linus
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> Please read the FAQ at http://www.tux.org/lkml/

-- 
Albert Cranford Deerfield Beach FL USA
ac9410@bellsouth.net
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: test13-pre5
  2000-12-28 22:14     ` test13-pre5 Marcelo Tosatti
@ 2000-12-29  1:10       ` Linus Torvalds
  2000-12-29 12:06       ` test13-pre5 Daniel Phillips
  1 sibling, 0 replies; 62+ messages in thread
From: Linus Torvalds @ 2000-12-29  1:10 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Kernel Mailing List

On Thu, 28 Dec 2000, Marcelo Tosatti wrote:
> 
> We also want to move the page to the per-address-space clean list in
> ClearPageDirty I suppose.

I would actually advice against this.

 - it's ok to have too many pages on the dirty list (think o fthe dirty
   list as a "these pages _can_ be dirty")

 - whenever we do a ClearPageDirty() we're likely to remove the page from
   the lists altogether, so it's not worth it doing extra work.

The exception, of course, is the actual "filemap_fdatasync()" function,
but that one would probably look something like

	spin_lock(&page_cache_lock);
	while (!list_empty(&mapping->dirty_pages)) {
		struct page *page = list_entry(mapping->dirty_pages.next, struct page, list);

		list_del(&page->list);
		list_add(&page->list, &mapping->clean_pages);

		if (!PageDirty(page))
			continue;
		page_get(page);
		spin_unlock(&page_cache_lock);

		lock_page(page);
		if (PageDirty(page)) {
			ClearPageDirty(page);
			page->mapping->writepage(page);
		}
		UnlockPage(page);
		page_cache_put(page);
		spin_lock(&page_cache_lock);
	}
	spin_unlock(&page_cache_lock);

and again note how we can move it to the clean list early and we don't
have to keep the PageDirty bit 100% in sync with which list is it on. If
somebody marks it dirty later on (and the dirty bit is still set), that
somebody won't move it back to the dirty list (because it noticved that
the dirty bit is already set), but that's ok: as long as we do the
"ClearPageDirty(page);" call before startign the actual writeout(), we're
fine.

So the "mapping->dirty_pages" list is maybe not so much a _dirty_ list, as
a "scheduled for writeout" list. Marking the page clean doesn't remove it
from that list - it can happily stay on the list and then when the
writeout is started we'd just skip it.

	Ok?

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: test13-pre5
  2000-12-29  0:49             ` test13-pre5 Stefan Traby
@ 2000-12-29  1:23               ` Linus Torvalds
  2000-12-29  1:06                 ` test13-pre5 Albert Cranford
  0 siblings, 1 reply; 62+ messages in thread
From: Linus Torvalds @ 2000-12-29  1:23 UTC (permalink / raw)
  To: Stefan Traby; +Cc: Andi Kleen, Marcelo Tosatti, Kernel Mailing List

On Fri, 29 Dec 2000, Stefan Traby wrote:
> On Thu, Dec 28, 2000 at 03:37:51PM -0800, Linus Torvalds wrote:
> 
> > Too bad. Maybe somebody should tell gcc maintainers about programmers that
> > know more than the compiler again.
> 
> I know that {p,}gcc-2.95.2{,.1} are not officially supported.

Hmm, I use gcc-2.95.2 myself on some machines, and while I'm not 100%
comfortable with it, it does count as "supported" even if it has known
problems with "long long". pgcc isn't.

> Did you know that it's impossible to compile nfsv4 because of
> register allocation problems with long long since (long long) month ?

lockd v4 (for NFS v3), I assume. 

No, I wasn't aware of this particular bug. 

> The following does not hurt, it's just a fix for a broken
> compiler:

Ugh, that's ugly.

Can you test if it is sufficient to just simplify the math a bit, instead
of uglyfing that function more? The nlm4_encode_lock() function already
tests for NLM4_OFFSET_MAX explicitly for both start and end, so it should
be ok to just re-code the function to not do the extra "loff_t_to_s64()"
stuff, and simplify it enough that the compile rwill be happy to compile
the simpler function. Something along the lines of

	if (.. NLM4_OFFSET_MAX tests ..)
		..

	*p++ = htonl(fl->fl_pid);

	start = fl->fl_start;
	len = fl->fl_end - start;
	if (fl->fl_end == OFFSET_MAX)
		len = 0;

	p = xdr_encode_hyper(p, start);
	p = xdr_encode_hyper(p, len);

	return p;

Where it tries to minimize the liveness of the 64-bit values, and tries to
avoid extra complications.

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: test13-pre5
  2000-12-28 22:14     ` test13-pre5 Marcelo Tosatti
  2000-12-29  1:10       ` test13-pre5 Linus Torvalds
@ 2000-12-29 12:06       ` Daniel Phillips
  1 sibling, 0 replies; 62+ messages in thread
From: Daniel Phillips @ 2000-12-29 12:06 UTC (permalink / raw)
  To: Marcelo Tosatti, linux-kernel

Marcelo Tosatti wrote:
> 
> On Thu, 28 Dec 2000, Linus Torvalds wrote:
> 
> >  - make "SetPageDirty()" do something like
> >
> >       if (!test_and_set(PG_dirty, &page->flags)) {
> >               spin_lock(&page_cache_lock);
> >               list_del(page->list);
> >               list_add(page->list, page->mapping->dirty_pages);
> >               spin_unlock(&page_cache_lock);
> >       }
> 
> We also want to move the page to the per-address-space clean list in
> ClearPageDirty I suppose.

I'd like to suggest taking this opportunity to regularize the notation
by going to set_page_dirty/clear_page_dirty which will call
SetPageDirty/ClearPageDirty.

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: test13-pre5
  2000-12-29  1:06                 ` test13-pre5 Albert Cranford
@ 2000-12-29 14:53                   ` Stefan Traby
  0 siblings, 0 replies; 62+ messages in thread
From: Stefan Traby @ 2000-12-29 14:53 UTC (permalink / raw)
  To: Albert Cranford
  Cc: Linus Torvalds, Stefan Traby, Andi Kleen, Marcelo Tosatti,
	Kernel Mailing List

On Fri, Dec 29, 2000 at 01:06:30AM +0000, Albert Cranford wrote:
> Simply executing
>          *p++ = htonl(fl->fl_pid);
> before 
>          start = loff_t_to_s64(fl->fl_start);
> also works.

Yes, confirmed.
Since you're located in Florida I vote for this and I hope that
Linus will elect it. :)

--- linux/fs/lockd/xdr4.c.orig	Fri Dec 29 01:35:32 2000
+++ linux/fs/lockd/xdr4.c	Fri Dec 29 14:56:07 2000
@@ -167,13 +167,13 @@
 	 || (fl->fl_end > NLM4_OFFSET_MAX && fl->fl_end != OFFSET_MAX))
 		return NULL;

+	*p++ = htonl(fl->fl_pid);
 	start = loff_t_to_s64(fl->fl_start);
 	if (fl->fl_end == OFFSET_MAX)
 		len = 0;
 	else
 		len = loff_t_to_s64(fl->fl_end - fl->fl_start + 1);

-	*p++ = htonl(fl->fl_pid);
 	p = xdr_encode_hyper(p, start);
 	p = xdr_encode_hyper(p, len);

-- 

  ciao - 
    Stefan

"                export PS1="((((((((((((rms))))))))))))# "              "

Stefan Traby                Linux/ia32               fax:  +43-3133-6107-9
Mitterlasznitzstr. 13       Linux/alpha            phone:  +43-3133-6107-2
8302 Nestelbach             Linux/sparc       http://www.hello-penguin.com
Austria                                    mailto://st.traby@opengroup.org
Europe                                   mailto://stefan@hello-penguin.com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: test13-pre5
  2000-12-28 22:33       ` test13-pre5 David S. Miller
  2000-12-28 22:58         ` test13-pre5 Andi Kleen
@ 2000-12-29 15:46         ` Mark Hemment
  2000-12-29 16:30           ` test13-pre5 Tim Wright
  2000-12-29 21:23           ` test13-pre5 David S. Miller
  1 sibling, 2 replies; 62+ messages in thread
From: Mark Hemment @ 2000-12-29 15:46 UTC (permalink / raw)
  To: David S. Miller; +Cc: ak, torvalds, marcelo, linux-kernel

Hi,

On Thu, 28 Dec 2000, David S. Miller wrote:
>    Date: 	Thu, 28 Dec 2000 23:17:22 +0100
>    From: Andi Kleen <ak@suse.de>
> 
>    Would you consider patches for any of these points? 
> 
> To me it seems just as important to make sure struct page is
> a power of 2 in size, with the waitq debugging turned off this
> is true for both 32-bit and 64-bit hosts last time I checked.

  Checking test11 (which I'm running here), even with waitq debugging
turned off, on 32-bit (IA32) the struct page is 68bytes (since
the "age" member was re-introduced a while back).

  For my development testing, I'm running a _heavily_ hacked kernel.  One
of these hacks is to pull the wait_queue_head out of struct page; the
waitq-heads are in a separate allocated area of memory, with a waitq-head
pointer embedded in the page structure (allocated/initialised in
free_area_init_core()).  This gives a page structure of 60bytes, giving me
one free double-word to play with (which I'm using as a pointer to a
release function).

  Infact, there doesn't need to be a waitq-head allocated for each page
structure - they can share; with a performance overhead on a false
wakeup in __wait_on_page().
  Note, for those of us running on 32bit with lots of physical memory, the
available virtual address-space is of major consideration.  Reducing the
size of the page structure is more than just reducing cache misses - it
gives us more virtual to play with...

Mark

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: test13-pre5
  2000-12-29 15:46         ` test13-pre5 Mark Hemment
@ 2000-12-29 16:30           ` Tim Wright
  2000-12-29 17:54             ` test13-pre5 Mark Hemment
  2000-12-29 21:23           ` test13-pre5 David S. Miller
  1 sibling, 1 reply; 62+ messages in thread
From: Tim Wright @ 2000-12-29 16:30 UTC (permalink / raw)
  To: Mark Hemment; +Cc: David S. Miller, ak, torvalds, marcelo, linux-kernel

On Fri, Dec 29, 2000 at 03:46:22PM +0000, Mark Hemment wrote:
>   Note, for those of us running on 32bit with lots of physical memory, the
> available virtual address-space is of major consideration.  Reducing the
> size of the page structure is more than just reducing cache misses - it
> gives us more virtual to play with...
> 
> Mark
> 

Yes, this is a very important point if we ever want to make serious use
of large memory machines on ia32. We ran into this with DYNIX/ptx when the
P6 added 36-bit physical addressing. Conserving KVA (kernel virtual address
space), became a very high priority. Eventually, we had to add code to play
silly segment games and "magically" materialize and dematerialize a 4GB
kernel virtual address space instead of the 1GB. This only comes into play
with really large amounts of memory, and is almost certainly not worth the
agony of implementation on Linux, but we'll need to be careful elsewhere to
conserve it as much as possible.

Regards,

Tim

-- 
Tim Wright - timw@splhi.com or timw@aracnet.com or twright@us.ibm.com
IBM Linux Technology Center, Beaverton, Oregon
"Nobody ever said I was charming, they said "Rimmer, you're a git!"" RD VI
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH] filemap_fdatasync & related changes
  2000-12-28 20:59   ` test13-pre5 Linus Torvalds
                       ` (3 preceding siblings ...)
  2000-12-28 22:17     ` test13-pre5 Andi Kleen
@ 2000-12-29 16:49     ` Marcelo Tosatti
  2000-12-29 21:13       ` Linus Torvalds
  2001-01-03 16:37       ` Chris Mason
  4 siblings, 2 replies; 62+ messages in thread
From: Marcelo Tosatti @ 2000-12-29 16:49 UTC (permalink / raw)
  To: Linus Torvalds, Alexander Viro; +Cc: Kernel Mailing List


On Thu, 28 Dec 2000, Linus Torvalds wrote:

> All done. It looks something like 5-10 places, most of which are about 10
> lines of diff each, if even that.
> 
> The only real worry would be that the locking isn't rigth, but getting the
> pagemap lock should be the safe thing, and from a lock contention
> standpoint it should be ok (if we move a lot of pages back and forth, lock
> contention is going to be the least of our worries, because it implies
> that we'd be doing a LOT of IO to actually write the pages out).
> 
> If somebody (you? hint, hint) wants to do this, I'd be very happy - I can
> do it myself, but because it's my birthday I'm supposed to drag myself off
> the computer soon and be social, or Tove will be grumpy.

Ok, here it is.

I hope I got all locking and all special cases right.

Comments ?


diff -Nur --exclude-from=/home/marcelo/exclude linux.orig/fs/buffer.c linux/fs/buffer.c
--- linux.orig/fs/buffer.c	Fri Dec 29 15:30:52 2000
+++ linux/fs/buffer.c	Fri Dec 29 16:21:17 2000
@@ -320,6 +320,46 @@
 	return 0;
 }
 
+/**
+ *      filemap_fdatasync - walk the list of dirty pages of the given address space
+ *     	and writepage() all of them.
+ * 
+ *      @mapping: address space structure to write
+ *
+ */
+void filemap_fdatasync(struct address_space * mapping)
+{
+	int (*writepage)(struct page *) = mapping->a_ops->writepage;
+
+	spin_lock(&pagecache_lock);
+
+        while (!list_empty(&mapping->dirty_pages)) {
+		struct page *page = list_entry(mapping->dirty_pages.next, 
+				struct page, list);
+
+		list_del(&page->list);
+		list_add(&page->list, &mapping->clean_pages);
+
+		if (!PageDirty(page))
+			continue;
+
+		page_cache_get(page);
+		spin_unlock(&pagecache_lock);
+
+		lock_page(page);
+
+		if (PageDirty(page)) {
+			ClearPageDirty(page);
+			writepage(page);
+		}
+
+		UnlockPage(page);
+		page_cache_release(page);
+		spin_lock(&pagecache_lock);
+	}
+	spin_unlock(&pagecache_lock);
+}
+
 /*
  *	filp may be NULL if called via the msync of a vma.
  */
@@ -367,6 +407,8 @@
 	if (!file->f_op || !file->f_op->fsync)
 		goto out_putf;
 
+	filemap_fdatasync(inode->i_mapping);
+
 	/* We need to protect against concurrent writers.. */
 	down(&inode->i_sem);
 	err = file->f_op->fsync(file, dentry, 0);
@@ -396,6 +438,8 @@
 	err = -EINVAL;
 	if (!file->f_op || !file->f_op->fsync)
 		goto out_putf;
+
+	filemap_fdatasync(inode->i_mapping);
 
 	down(&inode->i_sem);
 	err = file->f_op->fsync(file, dentry, 1);
diff -Nur --exclude-from=/home/marcelo/exclude linux.orig/fs/inode.c linux/fs/inode.c
--- linux.orig/fs/inode.c	Fri Dec  1 05:50:27 2000
+++ linux/fs/inode.c	Fri Dec 29 16:14:09 2000
@@ -100,7 +100,8 @@
 		memset(inode, 0, sizeof(*inode));
 		init_waitqueue_head(&inode->i_wait);
 		INIT_LIST_HEAD(&inode->i_hash);
-		INIT_LIST_HEAD(&inode->i_data.pages);
+		INIT_LIST_HEAD(&inode->i_data.clean_pages);
+		INIT_LIST_HEAD(&inode->i_data.dirty_pages);
 		INIT_LIST_HEAD(&inode->i_dentry);
 		INIT_LIST_HEAD(&inode->i_dirty_buffers);
 		sema_init(&inode->i_sem, 1);
@@ -206,6 +207,8 @@
 		inode->i_state |= I_LOCK;
 		inode->i_state &= ~I_DIRTY;
 		spin_unlock(&inode_lock);
+
+		filemap_fdatasync(inode->i_mapping);
 
 		write_inode(inode, sync);
 
diff -Nur --exclude-from=/home/marcelo/exclude linux.orig/mm/filemap.c linux/mm/filemap.c
--- linux.orig/mm/filemap.c	Fri Dec 29 15:30:55 2000
+++ linux/mm/filemap.c	Fri Dec 29 16:18:01 2000
@@ -71,7 +71,7 @@
 
 static inline void add_page_to_inode_queue(struct address_space *mapping, struct page * page)
 {
-	struct list_head *head = &mapping->pages;
+	struct list_head *head = &mapping->clean_pages;
 
 	mapping->nrpages++;
 	list_add(&page->list, head);
@@ -144,7 +144,7 @@
 	struct list_head *head, *curr;
 	struct page * page;
 
-	head = &inode->i_mapping->pages;
+	head = &inode->i_mapping->clean_pages;
 
 	spin_lock(&pagecache_lock);
 	spin_lock(&pagemap_lru_lock);
@@ -204,26 +204,12 @@
 	page_cache_release(page);
 }
 
-/**
- * truncate_inode_pages - truncate *all* the pages from an offset
- * @mapping: mapping to truncate
- * @lstart: offset from with to truncate
- *
- * Truncate the page cache at a set offset, removing the pages
- * that are beyond that offset (and zeroing out partial pages).
- * If any page is locked we wait for it to become unlocked.
- */
-void truncate_inode_pages(struct address_space * mapping, loff_t lstart)
+void truncate_list_pages(struct list_head *head, unsigned long start, unsigned partial)
 {
-	struct list_head *head, *curr;
+	struct list_head *curr;
 	struct page * page;
-	unsigned partial = lstart & (PAGE_CACHE_SIZE - 1);
-	unsigned long start;
-
-	start = (lstart + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
 
 repeat:
-	head = &mapping->pages;
 	spin_lock(&pagecache_lock);
 	curr = head->next;
 	while (curr != head) {
@@ -267,6 +253,25 @@
 	spin_unlock(&pagecache_lock);
 }
 
+
+/**
+ * truncate_inode_pages - truncate *all* the pages from an offset
+ * @mapping: mapping to truncate
+ * @lstart: offset from with to truncate
+ *
+ * Truncate the page cache at a set offset, removing the pages
+ * that are beyond that offset (and zeroing out partial pages).
+ * If any page is locked we wait for it to become unlocked.
+ */
+void truncate_inode_pages(struct address_space * mapping, loff_t lstart) 
+{
+	unsigned long start = (lstart + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
+	unsigned partial = lstart & (PAGE_CACHE_SIZE - 1);
+
+	truncate_list_pages(&mapping->clean_pages, start, partial);
+	truncate_list_pages(&mapping->dirty_pages, start, partial);
+}
+
 static inline struct page * __find_page_nolock(struct address_space *mapping, unsigned long offset, struct page *page)
 {
 	goto inside;
@@ -328,14 +333,12 @@
 	return error;
 }
 
-static int do_buffer_fdatasync(struct inode *inode, unsigned long start, unsigned long end, int (*fn)(struct page *))
+static int do_buffer_fdatasync(struct list_head *head, unsigned long start, unsigned long end, int (*fn)(struct page *))
 {
-	struct list_head *head, *curr;
+	struct list_head *curr;
 	struct page *page;
 	int retval = 0;
 
-	head = &inode->i_mapping->pages;
-
 	spin_lock(&pagecache_lock);
 	curr = head->next;
 	while (curr != head) {
@@ -374,8 +377,18 @@
 {
 	int retval;
 
-	retval = do_buffer_fdatasync(inode, start_idx, end_idx, writeout_one_page);
-	retval |= do_buffer_fdatasync(inode, start_idx, end_idx, waitfor_one_page);
+	/* writeout dirty buffers on pages from both clean and dirty lists */
+	retval = do_buffer_fdatasync(&inode->i_mapping->dirty_pages, start_idx, 
+			end_idx, writeout_one_page);
+	retval |= do_buffer_fdatasync(&inode->i_mapping->clean_pages, start_idx, 
+			end_idx, writeout_one_page);
+
+	/* now wait for locked buffers on pages from both clean and dirty lists */
+	retval |= do_buffer_fdatasync(&inode->i_mapping->dirty_pages, start_idx, 
+			end_idx, writeout_one_page);
+	retval |= do_buffer_fdatasync(&inode->i_mapping->clean_pages, start_idx, 
+			end_idx, waitfor_one_page);
+
 	return retval;
 }
 
diff -Nur --exclude-from=/home/marcelo/exclude linux.orig/mm/memory.c linux/mm/memory.c
--- linux.orig/mm/memory.c	Mon Dec 11 19:45:42 2000
+++ linux/mm/memory.c	Fri Dec 29 16:02:23 2000
@@ -259,22 +259,22 @@
 /*
  * Return indicates whether a page was freed so caller can adjust rss
  */
-static inline int free_pte(pte_t page)
+static inline int free_pte(pte_t pte)
 {
-	if (pte_present(page)) {
-		struct page *ptpage = pte_page(page);
-		if ((!VALID_PAGE(ptpage)) || PageReserved(ptpage))
+	if (pte_present(pte)) {
+		struct page *page = pte_page(pte);
+		if ((!VALID_PAGE(page)) || PageReserved(page))
 			return 0;
 		/* 
 		 * free_page() used to be able to clear swap cache
 		 * entries.  We may now have to do it manually.  
 		 */
-		if (pte_dirty(page))
-			SetPageDirty(ptpage);
-		free_page_and_swap_cache(ptpage);
+		if (pte_dirty(pte) && page->mapping)
+			SetPageDirty(page);
+		free_page_and_swap_cache(page);
 		return 1;
 	}
-	swap_free(pte_to_swp_entry(page));
+	swap_free(pte_to_swp_entry(pte));
 	return 0;
 }
 
diff -Nur --exclude-from=/home/marcelo/exclude linux.orig/mm/swap_state.c linux/mm/swap_state.c
--- linux.orig/mm/swap_state.c	Fri Dec 29 15:30:55 2000
+++ linux/mm/swap_state.c	Fri Dec 29 16:10:15 2000
@@ -29,9 +29,13 @@
 };
 
 struct address_space swapper_space = {
-	{				/* pages	*/
-		&swapper_space.pages,	/*        .next */
-		&swapper_space.pages	/*	  .prev */
+	{				/* clean_pages	*/
+		&swapper_space.clean_pages,	/*        .next */
+		&swapper_space.clean_pages	/*	  .prev */
+	},
+	{				/* dirty_pages	*/
+		&swapper_space.dirty_pages,	/*        .next */
+		&swapper_space.dirty_pages	/*	  .prev */
 	},
 	0,				/* nrpages	*/
 	&swap_aops,
diff -Nur --exclude-from=/home/marcelo/exclude linux.orig/include/linux/fs.h linux/include/linux/fs.h
--- linux.orig/include/linux/fs.h	Fri Dec 29 15:30:54 2000
+++ linux/include/linux/fs.h	Fri Dec 29 15:53:08 2000
@@ -363,8 +363,9 @@
 };
 
 struct address_space {
-	struct list_head	pages;		/* list of pages */
-	unsigned long		nrpages;	/* number of pages */
+	struct list_head	clean_pages;	/* list of clean pages */
+	struct list_head	dirty_pages;	/* list of dirty pages */
+	unsigned long		nrpages;	/* number of total pages */
 	struct address_space_operations *a_ops;	/* methods */
 	void			*host;		/* owner: inode, block_device */
 	struct vm_area_struct	*i_mmap;	/* list of private mappings */
@@ -1064,6 +1065,7 @@
 extern int fsync_inode_buffers(struct inode *);
 extern int osync_inode_buffers(struct inode *);
 extern int inode_has_buffers(struct inode *);
+extern void filemap_fdatasync(struct address_space *);
 extern void sync_supers(kdev_t);
 extern int bmap(struct inode *, int);
 extern int notify_change(struct dentry *, struct iattr *);
diff -Nur --exclude-from=/home/marcelo/exclude linux.orig/include/linux/mm.h linux/include/linux/mm.h
--- linux.orig/include/linux/mm.h	Fri Dec 29 15:30:54 2000
+++ linux/include/linux/mm.h	Fri Dec 29 16:07:39 2000
@@ -181,7 +181,14 @@
 #define SetPageUptodate(page)	set_bit(PG_uptodate, &(page)->flags)
 #define ClearPageUptodate(page)	clear_bit(PG_uptodate, &(page)->flags)
 #define PageDirty(page)		test_bit(PG_dirty, &(page)->flags)
-#define SetPageDirty(page)	set_bit(PG_dirty, &(page)->flags)
+#define SetPageDirty(page) \
+	if (!test_and_set_bit(PG_dirty, &page->flags)) { \
+		spin_lock(&pagecache_lock); \
+		list_del(&page->list); \
+		list_add(&page->list, &page->mapping->dirty_pages); \
+		spin_unlock(&pagecache_lock); \
+	}
+
 #define ClearPageDirty(page)	clear_bit(PG_dirty, &(page)->flags)
 #define PageLocked(page)	test_bit(PG_locked, &(page)->flags)
 #define LockPage(page)		set_bit(PG_locked, &(page)->flags)


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: test13-pre5
  2000-12-29 16:30           ` test13-pre5 Tim Wright
@ 2000-12-29 17:54             ` Mark Hemment
  0 siblings, 0 replies; 62+ messages in thread
From: Mark Hemment @ 2000-12-29 17:54 UTC (permalink / raw)
  To: Tim Wright; +Cc: linux-kernel

On Fri, 29 Dec 2000, Tim Wright wrote:
> Yes, this is a very important point if we ever want to make serious use
> of large memory machines on ia32. We ran into this with DYNIX/ptx when the
> P6 added 36-bit physical addressing. Conserving KVA (kernel virtual address
> space), became a very high priority. Eventually, we had to add code to play
> silly segment games and "magically" materialize and dematerialize a 4GB
> kernel virtual address space instead of the 1GB. This only comes into play
> with really large amounts of memory, and is almost certainly not worth the
> agony of implementation on Linux, but we'll need to be careful elsewhere to
> conserve it as much as possible.

  Indeed.  I'm compiling my kernels with 2GB virtual.  Not as I want more
NORMAL pages in the page cache (HIGH memory is fine), but as I need
NORMAL pages for kernel data/structures (memory allocated from  
slab-caches) which need to be constantly mapped in.

Mark


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: test13-pre5
  2000-12-28 20:25 test13-pre5 Linus Torvalds
  2000-12-28 18:39 ` test13-pre5 Marcelo Tosatti
  2000-12-28 21:27 ` test13-pre5 (via82cxxx_audio.c) Jonathan Hudson
@ 2000-12-29 18:17 ` Tom Rini
  2 siblings, 0 replies; 62+ messages in thread
From: Tom Rini @ 2000-12-29 18:17 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Kernel Mailing List

On Thu, Dec 28, 2000 at 12:25:23PM -0800, Linus Torvalds wrote:

>  - pre5:
>    - NIIBE Yutaka: SuperH update
>    - Geert Uytterhoeven: m68k update
>    - David Miller: TCP RTO calc fix, UDP multicast fix etc
>    - Duncan Laurie: ServerWorks PIRQ routing definition.
>    - mm PageDirty cleanups, added sanity checks, and don't lose the bit. 

I just noticed this (playing with some other stuff), but ext2 as a module
is currently broken:
$ make INSTALL_MOD_PATH=/tmp/foo modules_install
...
if [ -r System.map ]; then /sbin/depmod -ae -F System.map -b /tmp/foo -r 2.4.0-test12; fi
depmod: *** Unresolved symbols in /tmp/foo/lib/modules/2.4.0-test12/kernel/fs/ext2/ext2.o
depmod:         buffer_insert_inode_queue
depmod:         fsync_inode_buffers

I tried the following locally and it fixes it.
---<cut>---
--- fs/Makefile.orig	Fri Dec 29 10:35:50 2000
+++ fs/Makefile	Fri Dec 29 10:36:06 2000
@@ -7,7 +7,7 @@
 
 O_TARGET := fs.o
 
-export-objs :=	filesystems.o
+export-objs :=	filesystems.o buffer.o
 mod-subdirs :=	nls
 
 obj-y :=	open.o read_write.o devices.o file_table.o buffer.o \
--- fs/buffer.c.orig	Fri Dec 29 10:33:21 2000
+++ fs/buffer.c	Fri Dec 29 10:35:46 2000
@@ -29,6 +29,7 @@
 /* async buffer flushing, 1999 Andrea Arcangeli <andrea@suse.de> */
 
 #include <linux/config.h>
+#include <linux/module.h>
 #include <linux/sched.h>
 #include <linux/fs.h>
 #include <linux/malloc.h>
@@ -579,6 +580,8 @@
 	spin_unlock(&lru_list_lock);
 }
 
+EXPORT_SYMBOL(buffer_insert_inode_queue);
+
 /* The caller must have the lru_list lock before calling the 
    remove_inode_queue functions.  */
 static void __remove_inode_queue(struct buffer_head *bh)
@@ -900,6 +903,7 @@
 		return err2;
 }
 
+EXPORT_SYMBOL(fsync_inode_buffers);
 
 /*
  * osync is designed to support O_SYNC io.  It waits synchronously for
---<end>---

-- 
Tom Rini (TR1265)
http://gate.crashing.org/~trini/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] filemap_fdatasync & related changes
  2000-12-29 16:49     ` [PATCH] filemap_fdatasync & related changes Marcelo Tosatti
@ 2000-12-29 21:13       ` Linus Torvalds
  2001-01-03 16:37       ` Chris Mason
  1 sibling, 0 replies; 62+ messages in thread
From: Linus Torvalds @ 2000-12-29 21:13 UTC (permalink / raw)
  To: linux-kernel

In article <Pine.LNX.4.21.0012291446560.13194-100000@freak.distro.conectiva>,
Marcelo Tosatti  <marcelo@conectiva.com.br> wrote:
>
>Ok, here it is.
>
>I hope I got all locking and all special cases right.
>
>Comments ?

Looks good.

There's a few things this misses, the worst of which were all my bugs in
the original description.  Things like "don't unlock the page after
calling writepage, becasue writepage will do it on its own".  Details
like that. 

The worst problem actually ends up being the fact that the global sync()
doesn't work, after all, because if it's associated with syncing the
inodes it will only sync the dirty page list for inodes that are dirty. 
Now, this probably doesn't show up in testing, because mostly if you
have dirty pages the inode too _will_ be dirty, but that only makes the
bug more insidious. 

So I'll work on this a bit to polish up these problems, but on the whole
this is all fine.

		Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: test13-pre5
  2000-12-29 15:46         ` test13-pre5 Mark Hemment
  2000-12-29 16:30           ` test13-pre5 Tim Wright
@ 2000-12-29 21:23           ` David S. Miller
  2000-12-29 21:51             ` test13-pre5 Linus Torvalds
  1 sibling, 1 reply; 62+ messages in thread
From: David S. Miller @ 2000-12-29 21:23 UTC (permalink / raw)
  To: markhe; +Cc: ak, torvalds, marcelo, linux-kernel

   Date: Fri, 29 Dec 2000 15:46:22 +0000 (GMT)
   From: Mark Hemment <markhe@veritas.com>

     For my development testing, I'm running a _heavily_ hacked
   kernel.  One of these hacks is to pull the wait_queue_head out of
   struct page; the waitq-heads are in a separate allocated area of
   memory, with a waitq-head pointer embedded in the page structure
   (allocated/initialised in free_area_init_core()).  This gives a
   page structure of 60bytes, giving me one free double-word to play
   with (which I'm using as a pointer to a release function).

Not something like those damn Solaris turnstiles, no please....

Later,
David S. Miller
davem@redhat.com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: test13-pre5
  2000-12-29 21:23           ` test13-pre5 David S. Miller
@ 2000-12-29 21:51             ` Linus Torvalds
  0 siblings, 0 replies; 62+ messages in thread
From: Linus Torvalds @ 2000-12-29 21:51 UTC (permalink / raw)
  To: David S. Miller; +Cc: markhe, ak, marcelo, linux-kernel



On Fri, 29 Dec 2000, David S. Miller wrote:
> 
>      For my development testing, I'm running a _heavily_ hacked
>    kernel.  One of these hacks is to pull the wait_queue_head out of
>    struct page; the waitq-heads are in a separate allocated area of
>    memory, with a waitq-head pointer embedded in the page structure
>    (allocated/initialised in free_area_init_core()).  This gives a
>    page structure of 60bytes, giving me one free double-word to play
>    with (which I'm using as a pointer to a release function).
> 
> Not something like those damn Solaris turnstiles, no please....

If you want to have a release function, please just use "page->mapping",
which gives you much more, including memory pressure indicators etc. Now
_that_ can be useful for doing things like slab caches.

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: test13-pre5
  2000-12-28 23:15       ` test13-pre5 Linus Torvalds
  2000-12-28 23:25         ` test13-pre5 Andi Kleen
@ 2000-12-30 13:24         ` Geert Uytterhoeven
  2000-12-31 17:21           ` test13-pre5 Andi Kleen
  2001-01-01 19:58           ` test13-pre5 H. Peter Anvin
  1 sibling, 2 replies; 62+ messages in thread
From: Geert Uytterhoeven @ 2000-12-30 13:24 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andi Kleen, Marcelo Tosatti, Kernel Mailing List

On Thu, 28 Dec 2000, Linus Torvalds wrote:
> On Thu, 28 Dec 2000, Andi Kleen wrote:
> > - Instead of having a zone pointer mask use a 8 or 16 byte index into a 
> > zone table. On a modern CPU it is much cheaper to do the and/shifts than
> > to do even a single cache miss during page aging. On a lot of systems 
> > that zone index could be hardcoded to 0 anyways, giving better code.
> > - Instead of using 4/8 bytes for the age use only 16bit (FreeBSD which
> > has the same swapping algorithm even only uses 8bit) 
> 
> This would be good, but can be hard.
> 
> FreeBSD doesn't try to be portable any more, but Linux does, and there are
> architectures where 8- and 16-bit accesses aren't atomic but have to be
> done with read-modify-write cycles.
> 
> And even for fields like "age", where we don't care whether the age itself
> is 100% accurate, we _do_ care that the fields close-by don't get strange
> effects from updating "age". We used to have exactly this problem on alpha
> back in the 2.1.x timeframe.
> 
> This is why a lot of fields are 32-bit, even though we wouldn't need more
> than 8 or 16 bits of them.

What about defining new types for this? Like e.g. `x8', being `u8' on platforms
were that's OK, and `u32' on platforms where that's more efficient?

Gr{oetje,eeting}s,

						Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
							    -- Linus Torvalds

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: test13-pre5
  2000-12-30 13:24         ` test13-pre5 Geert Uytterhoeven
@ 2000-12-31 17:21           ` Andi Kleen
  2000-12-31 17:27             ` test13-pre5 Linus Torvalds
  2001-01-01 19:58           ` test13-pre5 H. Peter Anvin
  1 sibling, 1 reply; 62+ messages in thread
From: Andi Kleen @ 2000-12-31 17:21 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: Linus Torvalds, Andi Kleen, Marcelo Tosatti, Kernel Mailing List

On Sat, Dec 30, 2000 at 02:24:06PM +0100, Geert Uytterhoeven wrote:
> On Thu, 28 Dec 2000, Linus Torvalds wrote:
> > On Thu, 28 Dec 2000, Andi Kleen wrote:
> > > - Instead of having a zone pointer mask use a 8 or 16 byte index into a 
> > > zone table. On a modern CPU it is much cheaper to do the and/shifts than
> > > to do even a single cache miss during page aging. On a lot of systems 
> > > that zone index could be hardcoded to 0 anyways, giving better code.
> > > - Instead of using 4/8 bytes for the age use only 16bit (FreeBSD which
> > > has the same swapping algorithm even only uses 8bit) 
> > 
> > This would be good, but can be hard.
> > 
> > FreeBSD doesn't try to be portable any more, but Linux does, and there are
> > architectures where 8- and 16-bit accesses aren't atomic but have to be
> > done with read-modify-write cycles.
> > 
> > And even for fields like "age", where we don't care whether the age itself
> > is 100% accurate, we _do_ care that the fields close-by don't get strange
> > effects from updating "age". We used to have exactly this problem on alpha
> > back in the 2.1.x timeframe.
> > 
> > This is why a lot of fields are 32-bit, even though we wouldn't need more
> > than 8 or 16 bits of them.
> 
> What about defining new types for this? Like e.g. `x8', being `u8' on platforms
> were that's OK, and `u32' on platforms where that's more efficient?

Sounds good. It could also be controlled by a CONFIG_SPACE_EFFICIENT for
embedded systems, where you could trade a bit of CPU for less memory overhead 
even on systems where u8 is slow and atomicity doesn't come into play
because it's UP anyways. 

Only problem I see is that when the programmer was wrong about the possible
range (which sometimes happens) then it could mysteriously work on some
machines and fail on others. This is already the case e.g. with atomic_t,
which is shorter than 32bit e.g. on sparc32, so it is probably not a too
big problem. 


-Andi

/* asm/types.h for a random 32bit machine with no byte access */ 
#if defined(CONFIG_SPACE_EFFICIENT) && !defined(CONFIG_SMP)
typedef __u8 x8; 
typedef __u16 x16; 
typedef __u32 x32; 
#else
typedef __u32 x8; 
typedef __u32 x16; 
typedef __u32 x32; 
#endif

-Andi

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: test13-pre5
  2000-12-31 17:21           ` test13-pre5 Andi Kleen
@ 2000-12-31 17:27             ` Linus Torvalds
  2000-12-31 17:36               ` test13-pre5 Andi Kleen
                                 ` (3 more replies)
  0 siblings, 4 replies; 62+ messages in thread
From: Linus Torvalds @ 2000-12-31 17:27 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Geert Uytterhoeven, Marcelo Tosatti, Kernel Mailing List



On Sun, 31 Dec 2000, Andi Kleen wrote:
> 
> Sounds good. It could also be controlled by a CONFIG_SPACE_EFFICIENT for
> embedded systems, where you could trade a bit of CPU for less memory overhead 
> even on systems where u8 is slow and atomicity doesn't come into play
> because it's UP anyways. 

UP has nothing to do with it.

The alpha systems I remember this problem on were all SMP.

Imagine an architecture where you need to do a

	load_32()
	mask-and-insert-byte
	store_32()

and imagine that an interrupt comes in:

	load_32()
	mask-and-insert-byte

			* INTERRUPT *

			load_32()
			mask-and-insert-ANOTHER-byte
			store_32()

			interrupt return

	store_32()

and notice how the value written by the interrupt is gone, gone, gone,
even though it was to a completely different byte.

Now, imagine that the first byte is the "age", and imagine that the thing
the interrupt tries to update is "flags".

Yes, you're screwed.

I don't think it's a good diea.

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: test13-pre5
  2000-12-31 17:27             ` test13-pre5 Linus Torvalds
@ 2000-12-31 17:36               ` Andi Kleen
  2000-12-31 18:19                 ` test13-pre5 Andrea Arcangeli
  2000-12-31 18:06               ` test13-pre5 Andi Kleen
                                 ` (2 subsequent siblings)
  3 siblings, 1 reply; 62+ messages in thread
From: Andi Kleen @ 2000-12-31 17:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, Geert Uytterhoeven, Marcelo Tosatti,
	Kernel Mailing List

On Sun, Dec 31, 2000 at 09:27:23AM -0800, Linus Torvalds wrote:
> 
> 
> On Sun, 31 Dec 2000, Andi Kleen wrote:
> > 
> > Sounds good. It could also be controlled by a CONFIG_SPACE_EFFICIENT for
> > embedded systems, where you could trade a bit of CPU for less memory overhead 
> > even on systems where u8 is slow and atomicity doesn't come into play
> > because it's UP anyways. 
> 
> UP has nothing to do with it.
> 
> The alpha systems I remember this problem on were all SMP.

AFAIK alpha has byte instructions now.


-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: test13-pre5
  2000-12-31 17:27             ` test13-pre5 Linus Torvalds
  2000-12-31 17:36               ` test13-pre5 Andi Kleen
@ 2000-12-31 18:06               ` Andi Kleen
  2000-12-31 18:07               ` test13-pre5 Matti Aarnio
  2000-12-31 18:10               ` test13-pre5 Andrea Arcangeli
  3 siblings, 0 replies; 62+ messages in thread
From: Andi Kleen @ 2000-12-31 18:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, Geert Uytterhoeven, Marcelo Tosatti,
	Kernel Mailing List

On Sun, Dec 31, 2000 at 09:27:23AM -0800, Linus Torvalds wrote:
> 
> 
> On Sun, 31 Dec 2000, Andi Kleen wrote:
> > 
> > Sounds good. It could also be controlled by a CONFIG_SPACE_EFFICIENT for
> > embedded systems, where you could trade a bit of CPU for less memory overhead 
> > even on systems where u8 is slow and atomicity doesn't come into play
> > because it's UP anyways. 
> 
> UP has nothing to do with it.
> 
> The alpha systems I remember this problem on were all SMP.
[...]

I just checked all architecture manuals I could lay my hands on
(sparcv9, ppc32, mips r4400, parisc 1.1, alpha, sh is somewhere in
storage but as I remember it has it too) 
and they all seem to have at least store byte and mostly store
half words instructions. 

> 
> Imagine an architecture where you need to do a
> 
> 	load_32() 
> 	mask-and-insert-byte
> 	store_32()

iirc the Alpha guys found out that they couldn't drive half of the
available devices without byte store, and since then nobody has 
repeated that mistake @)


> 
> I don't think it's a good diea.

I don't see it. Just define x8 to u32 on old alpha and let most other architectures
be happy. 


-Andi

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: test13-pre5
  2000-12-31 17:27             ` test13-pre5 Linus Torvalds
  2000-12-31 17:36               ` test13-pre5 Andi Kleen
  2000-12-31 18:06               ` test13-pre5 Andi Kleen
@ 2000-12-31 18:07               ` Matti Aarnio
  2000-12-31 19:15                 ` test13-pre5 Linus Torvalds
  2000-12-31 18:10               ` test13-pre5 Andrea Arcangeli
  3 siblings, 1 reply; 62+ messages in thread
From: Matti Aarnio @ 2000-12-31 18:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, Geert Uytterhoeven, Marcelo Tosatti,
	Kernel Mailing List

On Sun, Dec 31, 2000 at 09:27:23AM -0800, Linus Torvalds wrote:
> On Sun, 31 Dec 2000, Andi Kleen wrote:
> > 
> > Sounds good. It could also be controlled by a CONFIG_SPACE_EFFICIENT for
> > embedded systems, where you could trade a bit of CPU for less memory overhead 
> > even on systems where u8 is slow and atomicity doesn't come into play
> > because it's UP anyways. 
> 
> UP has nothing to do with it.
> The alpha systems I remember this problem on were all SMP.

	Actually nothing SMP specific in that problem sphere.
	Alpha has  load-locked/store-conditional  pair for
	this type of memory accesses to automatically detect,
	and (conditionally) restart the operation - to form
	classical  ``locked-read-modify-write'' operations.

	In what situations the compiler will use those instructions,
	that I don't know.   Volatiles, very least, use them.
	Will closely packed bytes be processed with it without
	them being volatiles ?  How about bitfields ?

	Newer Alphas have byte/short load/store instructions,
	so things really aren't that straight-forward...

....
> I don't think it's a good diea.
> 
> 		Linus

/Matti Aarnio
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: test13-pre5
  2000-12-31 17:27             ` test13-pre5 Linus Torvalds
                                 ` (2 preceding siblings ...)
  2000-12-31 18:07               ` test13-pre5 Matti Aarnio
@ 2000-12-31 18:10               ` Andrea Arcangeli
  3 siblings, 0 replies; 62+ messages in thread
From: Andrea Arcangeli @ 2000-12-31 18:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, Geert Uytterhoeven, Marcelo Tosatti,
	Kernel Mailing List

On Sun, Dec 31, 2000 at 09:27:23AM -0800, Linus Torvalds wrote:
> The alpha systems I remember this problem on were all [..]

Yes the granularity issue has nothing to do with SMP (with preemptive kernel
it can trigger even without interrupts involved into the code). Also
CONFIG_SPACE_EFFICIENT looks not necessary.

The x8 name is confusing IMHO (when I read `8' I expect 8bits only, the x
isn't explicit enough). But by using a better name we could save some byte on
alpha ev6 and x86. Something like granular_char/granular_short/granular_int
looks nicer.  For the generic 64bit cpu they needs to be _unconditionally_
defined to `long'.

BTW, only old chips (ev[45]) doesn't provide byte granularity. Infact a linux
kernel compiled for ev6 can handle byte granularity also on alpha (it uses
-mcpu=ev6).

alpha reference manual 5.2.2:

	[..] For each region, an implementation must support aligned quadword
	access and may optionally support aligned longword access or byte
	access. If byte access is supported in a region, aligned word access
	and aligned longword access are also supported. [..]

21264hrm:

	[..] The 21264-generated external references to memory space are
	always of a fixed 64-byte size, though the internal access granularity
	is byte, word, longword, or quad-word. [..]

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: test13-pre5
  2000-12-31 17:36               ` test13-pre5 Andi Kleen
@ 2000-12-31 18:19                 ` Andrea Arcangeli
  0 siblings, 0 replies; 62+ messages in thread
From: Andrea Arcangeli @ 2000-12-31 18:19 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linus Torvalds, Geert Uytterhoeven, Marcelo Tosatti,
	Kernel Mailing List

On Sun, Dec 31, 2000 at 06:36:50PM +0100, Andi Kleen wrote:
> AFAIK alpha has byte instructions now.

See other post. Only from ev6 (at least as far as gcc is concerned). I've an
userspace testcase here (it was originally an obscure alpha userspace MM
corruption bug report that I sorted out some time ago) that works only only
when compiled for ev6 because it needs `short' granularity (not even byte
granularity).

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: test13-pre5
  2000-12-31 18:07               ` test13-pre5 Matti Aarnio
@ 2000-12-31 19:15                 ` Linus Torvalds
  2000-12-31 19:49                   ` test13-pre5 Andi Kleen
  0 siblings, 1 reply; 62+ messages in thread
From: Linus Torvalds @ 2000-12-31 19:15 UTC (permalink / raw)
  To: linux-kernel

In article <20001231200741.F28963@mea-ext.zmailer.org>,
Matti Aarnio  <matti.aarnio@zmailer.org> wrote:
>
>	Actually nothing SMP specific in that problem sphere.
>	Alpha has  load-locked/store-conditional  pair for
>	this type of memory accesses to automatically detect,
>	and (conditionally) restart the operation - to form
>	classical  ``locked-read-modify-write'' operations.

Sure, we could make the older alphas use ldl_l stl_c for byte accesses,
but if you thought byte accesses on those machines were kind-of slow
before, just WAIT until that happens.

Old alpha machines (the same ones that would need this code) were
HORRIBLE at ldl_l<->stl_c: they go out all the way to the bus to set the
lock.  So suddenly your every byte access ends up being a few hundred
cycles!

So ldl_l/stc_l is not the answer.  It would work, but it would be so
slow that you'd be a lot better off not doing it. 

I think they fixed ldl/stc later on (so that it only sets a bit locally
that gets cleared by the cache coherency protocol), but as later alphas
have the byte accesses anyway that doesn't matter here. The faster
ldl/stc makes for much faster spinlocks on newer alphas, though.

		Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: test13-pre5
  2000-12-31 19:15                 ` test13-pre5 Linus Torvalds
@ 2000-12-31 19:49                   ` Andi Kleen
  0 siblings, 0 replies; 62+ messages in thread
From: Andi Kleen @ 2000-12-31 19:49 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

On Sun, Dec 31, 2000 at 11:15:51AM -0800, Linus Torvalds wrote:
> In article <20001231200741.F28963@mea-ext.zmailer.org>,
> Matti Aarnio  <matti.aarnio@zmailer.org> wrote:
> >
> >	Actually nothing SMP specific in that problem sphere.
> >	Alpha has  load-locked/store-conditional  pair for
> >	this type of memory accesses to automatically detect,
> >	and (conditionally) restart the operation - to form
> >	classical  ``locked-read-modify-write'' operations.
> 
> Sure, we could make the older alphas use ldl_l stl_c for byte accesses,
> but if you thought byte accesses on those machines were kind-of slow
> before, just WAIT until that happens.

The older Alphas would just typedef x8/x16 (or granular_u8, granular_u16
or whatever it is called) to u32 and be the same as today. Just most
other boxes would benefit.

This actually all assumes that gcc really uses the byte instructions
for byte stores in structures, which is to be determined.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: test13-pre5
  2000-12-30 13:24         ` test13-pre5 Geert Uytterhoeven
  2000-12-31 17:21           ` test13-pre5 Andi Kleen
@ 2001-01-01 19:58           ` H. Peter Anvin
  1 sibling, 0 replies; 62+ messages in thread
From: H. Peter Anvin @ 2001-01-01 19:58 UTC (permalink / raw)
  To: linux-kernel

Followup to:  <Pine.LNX.4.10.10012301422310.581-100000@cassiopeia.home>
By author:    Geert Uytterhoeven <geert@linux-m68k.org>
In newsgroup: linux.dev.kernel
> 
> What about defining new types for this? Like e.g. `x8', being `u8' on platforms
> were that's OK, and `u32' on platforms where that's more efficient?
> 

You may just want to look at how C99 handles this using <stdint.h>;
stdint.h defines types of the following format:

	 int, uint		... signed/unsigned

	 <size>			... exact size
	 _least<size>		... no smaller than
	 _fast<size>		... no smaller than, and efficient

	 _t

E.g. uint32_t, int_least64_t, uint_fast8_t (the latter could easily be
a 32-bit type, for eaxmple.)

In addition, constructor macros are defined, as well as (u)intmax_t
and (u)intptr_t; which are defined as the largest
possible integer and an integer large enough to hold a (void *),
respectively.

In other words:

	(void *)(uintptr_t)(void *)foo == (void *)foo

	-hpa

-- 
<hpa@transmeta.com> at work, <hpa@zytor.com> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] filemap_fdatasync & related changes
  2000-12-29 16:49     ` [PATCH] filemap_fdatasync & related changes Marcelo Tosatti
  2000-12-29 21:13       ` Linus Torvalds
@ 2001-01-03 16:37       ` Chris Mason
  2001-01-03 17:07         ` Daniel Phillips
                           ` (2 more replies)
  1 sibling, 3 replies; 62+ messages in thread
From: Chris Mason @ 2001-01-03 16:37 UTC (permalink / raw)
  To: Marcelo Tosatti, Linus Torvalds, Alexander Viro; +Cc: Kernel Mailing List

Hi guys,

Just noticed the filemap_fdatasync code doesn't check the return value from
writepage.  Linus, would you take a patch that redirtied the page, puts it
back onto the dirty list (at the tail), and unlocks the page when writepage
returns 1?

That would loop forever if the writepage func kept returning 1 though...I
think that's what we want, unless someone like ramfs made a writepage func
that always returned 1.

For the reiserfs code, I'd like to be able to tell page_launder and
filemap_fdatasync to try the page again later.  reiserfs would clean and
unlock the page if no io needs to be done at all.

-chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] filemap_fdatasync & related changes
  2001-01-03 16:37       ` Chris Mason
@ 2001-01-03 17:07         ` Daniel Phillips
  2001-01-03 18:28         ` Linus Torvalds
  2001-01-04  9:48         ` Christoph Rohland
  2 siblings, 0 replies; 62+ messages in thread
From: Daniel Phillips @ 2001-01-03 17:07 UTC (permalink / raw)
  To: Chris Mason, linux-kernel

Chris Mason wrote:
> 
> Hi guys,
> 
> Just noticed the filemap_fdatasync code doesn't check the return value from
> writepage.  Linus, would you take a patch that redirtied the page, puts it
> back onto the dirty list (at the tail), and unlocks the page when writepage
> returns 1?

It would be a lot more consistent if ->writepage always unlocks the page
(or starts the chain of events that eventually unlocks the page) even
when it fails.

> That would loop forever if the writepage func kept returning 1 though...I
> think that's what we want, unless someone like ramfs made a writepage func
> that always returned 1.

Even then it's ok.  Without major surgery, those pages are going to loop
forever somewhere, it might as well be the inactive_dirty list.  This is
going to mess up the mm balancing calculations, but of course only when
somebody is actually using ramfs.  Maybe somebody will come up with a
tidy way of getting those pages entirely off the mm lists so they doing
just keep eating cpu cycles.

By the way, the whole concept of ramfs is really elegant - ramdisks are
just wrong, because all the ramdisk data in cache is in memory twice. 
Solution: throw away the disk, keep the cache.  Nice.

> For the reiserfs code, I'd like to be able to tell page_launder and
> filemap_fdatasync to try the page again later.  reiserfs would clean and
> unlock the page if no io needs to be done at all.

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] filemap_fdatasync & related changes
  2001-01-03 16:37       ` Chris Mason
  2001-01-03 17:07         ` Daniel Phillips
@ 2001-01-03 18:28         ` Linus Torvalds
  2001-01-03 19:09           ` Chris Mason
  2001-01-04 18:46           ` Stephen C. Tweedie
  2001-01-04  9:48         ` Christoph Rohland
  2 siblings, 2 replies; 62+ messages in thread
From: Linus Torvalds @ 2001-01-03 18:28 UTC (permalink / raw)
  To: Chris Mason; +Cc: Marcelo Tosatti, Alexander Viro, Kernel Mailing List



On Wed, 3 Jan 2001, Chris Mason wrote:
> 
> Just noticed the filemap_fdatasync code doesn't check the return value from
> writepage.  Linus, would you take a patch that redirtied the page, puts it
> back onto the dirty list (at the tail), and unlocks the page when writepage
> returns 1?

I don't see how that would help. It's just asking for endless loops.

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] filemap_fdatasync & related changes
  2001-01-03 18:28         ` Linus Torvalds
@ 2001-01-03 19:09           ` Chris Mason
  2001-01-03 21:39             ` Chris Wedgwood
  2001-01-04 18:46           ` Stephen C. Tweedie
  1 sibling, 1 reply; 62+ messages in thread
From: Chris Mason @ 2001-01-03 19:09 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Marcelo Tosatti, Alexander Viro, Kernel Mailing List



On Wednesday, January 03, 2001 10:28:05 AM -0800 Linus Torvalds
<torvalds@transmeta.com> wrote:
 
> On Wed, 3 Jan 2001, Chris Mason wrote:
>> 
>> Just noticed the filemap_fdatasync code doesn't check the return value
>> from writepage.  Linus, would you take a patch that redirtied the page,
>> puts it back onto the dirty list (at the tail), and unlocks the page
>> when writepage returns 1?
> 
> I don't see how that would help. It's just asking for endless loops.
> 
Yes, we would be explictly saying it is an error to always return 1.

The problem with [mf]sync is that we really want to write the page before
returning back to the application.  So, skipping dirty pages that need to
hit the disk is just as big an error.  Since the FS can't know the context
of when writepage is called, it would be obligated to never return 1 when
the page has data that needs to hit disk. 

Is there an easier/better way to tell writepage callers to retry the page
write later?

-chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] filemap_fdatasync & related changes
  2001-01-03 19:09           ` Chris Mason
@ 2001-01-03 21:39             ` Chris Wedgwood
  0 siblings, 0 replies; 62+ messages in thread
From: Chris Wedgwood @ 2001-01-03 21:39 UTC (permalink / raw)
  To: Chris Mason
  Cc: Linus Torvalds, Marcelo Tosatti, Alexander Viro,
	Kernel Mailing List

On Wed, Jan 03, 2001 at 02:09:30PM -0500, Chris Mason wrote:

    The problem with [mf]sync is that we really want to write the
    page before returning back to the application.

We don't _want_ to -- we MUST. Some applications assume fsync and
fdatasync work correctly; if they do not the all bests are off.

This includes many MTAs, MDAs and databases (msync also being
important).

Looping somewhere before returning to the application is by far
better than returning when some pages have yet to be flushed.



  --cw

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] filemap_fdatasync & related changes
  2001-01-03 16:37       ` Chris Mason
  2001-01-03 17:07         ` Daniel Phillips
  2001-01-03 18:28         ` Linus Torvalds
@ 2001-01-04  9:48         ` Christoph Rohland
  2001-01-04 14:41           ` Marcelo Tosatti
  2001-01-04 15:30           ` Chris Mason
  2 siblings, 2 replies; 62+ messages in thread
From: Christoph Rohland @ 2001-01-04  9:48 UTC (permalink / raw)
  To: Chris Mason
  Cc: Marcelo Tosatti, Linus Torvalds, Alexander Viro,
	Kernel Mailing List

Chris Mason <mason@suse.com> writes:
> Just noticed the filemap_fdatasync code doesn't check the return value from
> writepage.  Linus, would you take a patch that redirtied the page, puts it
> back onto the dirty list (at the tail), and unlocks the page when writepage
> returns 1?
> 
> That would loop forever if the writepage func kept returning 1 though...I
> think that's what we want, unless someone like ramfs made a writepage func
> that always returned 1.

shmem has such a writepage for locked shm segments. It also always
return 1 if the swap space is exhausted. So everybody using shared
anonymous, SYSV shared or POSIX shared memory can hit this.

I invented the return code 1 exactly to be able to handle this.

Greetings
                Chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] filemap_fdatasync & related changes
  2001-01-04  9:48         ` Christoph Rohland
@ 2001-01-04 14:41           ` Marcelo Tosatti
  2001-01-04 16:58             ` Rik van Riel
  2001-01-04 15:30           ` Chris Mason
  1 sibling, 1 reply; 62+ messages in thread
From: Marcelo Tosatti @ 2001-01-04 14:41 UTC (permalink / raw)
  To: Christoph Rohland
  Cc: Chris Mason, Linus Torvalds, Alexander Viro, Kernel Mailing List


On 4 Jan 2001, Christoph Rohland wrote:

> Chris Mason <mason@suse.com> writes:
> > Just noticed the filemap_fdatasync code doesn't check the return value from
> > writepage.  Linus, would you take a patch that redirtied the page, puts it
> > back onto the dirty list (at the tail), and unlocks the page when writepage
> > returns 1?
> > 
> > That would loop forever if the writepage func kept returning 1 though...I
> > think that's what we want, unless someone like ramfs made a writepage func
> > that always returned 1.
> 
> shmem has such a writepage for locked shm segments. 

The correct thing seems to be wait for this lock in case of
fdatasync/sync/etc, then.

What about a sync parameter to ->writepage() ? 

> return 1 if the swap space is exhausted. So everybody using shared
> anonymous, SYSV shared or POSIX shared memory can hit this.

In case of swap exhausted I think we should just fail.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] filemap_fdatasync & related changes
  2001-01-04  9:48         ` Christoph Rohland
  2001-01-04 14:41           ` Marcelo Tosatti
@ 2001-01-04 15:30           ` Chris Mason
  2001-01-04 17:01             ` Christoph Rohland
  1 sibling, 1 reply; 62+ messages in thread
From: Chris Mason @ 2001-01-04 15:30 UTC (permalink / raw)
  To: Christoph Rohland
  Cc: Marcelo Tosatti, Linus Torvalds, Alexander Viro,
	Kernel Mailing List



On Thursday, January 04, 2001 10:48:13 AM +0100 Christoph Rohland
<cr@sap.com> wrote:

> Chris Mason <mason@suse.com> writes:
>> Just noticed the filemap_fdatasync code doesn't check the return value
>> from writepage.  Linus, would you take a patch that redirtied the page,
>> puts it back onto the dirty list (at the tail), and unlocks the page
>> when writepage returns 1?
>> 
>> That would loop forever if the writepage func kept returning 1 though...I
>> think that's what we want, unless someone like ramfs made a writepage
>> func that always returned 1.
> 
> shmem has such a writepage for locked shm segments. It also always
> return 1 if the swap space is exhausted. So everybody using shared
> anonymous, SYSV shared or POSIX shared memory can hit this.
> 
> I invented the return code 1 exactly to be able to handle this.
> 

Yes, right now the shmem writepage calls are the only ones returning one at
all.  But, the question of how to properly fsync/msync these kinds of pages
still stands.  Returning from an fsync before writing them isn't correct.

Either way, filemap_fdatasync needs fixing, since it drops the dirty bit on
these pages.  They'll never get written after going through it.

-chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] filemap_fdatasync & related changes
  2001-01-04 14:41           ` Marcelo Tosatti
@ 2001-01-04 16:58             ` Rik van Riel
  2001-01-04 17:38               ` Chris Mason
  0 siblings, 1 reply; 62+ messages in thread
From: Rik van Riel @ 2001-01-04 16:58 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Christoph Rohland, Chris Mason, Linus Torvalds, Alexander Viro,
	Kernel Mailing List

On Thu, 4 Jan 2001, Marcelo Tosatti wrote:
> On 4 Jan 2001, Christoph Rohland wrote:
> > Chris Mason <mason@suse.com> writes:

> > > That would loop forever if the writepage func kept returning 1 though
	[snip]

> > return 1 if the swap space is exhausted. So everybody using shared
> > anonymous, SYSV shared or POSIX shared memory can hit this.
> 
> In case of swap exhausted I think we should just fail.

No. In case of swap exhausted the page needs to be moved to
the active list so we don't clog up the inactive_dirty list
and have the possibility of a deadlock (already done).

In case of "write ordering not yet satisfied", the writepage
function should leave the page on the inactive_dirty list and
maybe write out something else (reiserfs, ext3, xfs...).

regards,

Rik
--
Hollywood goes for world dumbination,
	Trailer at 11.

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com.br/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] filemap_fdatasync & related changes
  2001-01-04 15:30           ` Chris Mason
@ 2001-01-04 17:01             ` Christoph Rohland
  0 siblings, 0 replies; 62+ messages in thread
From: Christoph Rohland @ 2001-01-04 17:01 UTC (permalink / raw)
  To: Chris Mason
  Cc: Marcelo Tosatti, Linus Torvalds, Alexander Viro,
	Kernel Mailing List

Chris Mason <mason@suse.com> writes:

> Yes, right now the shmem writepage calls are the only ones returning one at
> all.  But, the question of how to properly fsync/msync these kinds of pages
> still stands.  Returning from an fsync before writing them isn't correct.

Yes, and [fm]sync should not do anything on shmem pages. There is
nothing to sync. So everything is fine.

        Christoph

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] filemap_fdatasync & related changes
  2001-01-04 16:58             ` Rik van Riel
@ 2001-01-04 17:38               ` Chris Mason
  2001-01-04 18:00                 ` Linus Torvalds
  0 siblings, 1 reply; 62+ messages in thread
From: Chris Mason @ 2001-01-04 17:38 UTC (permalink / raw)
  To: Rik van Riel, Marcelo Tosatti
  Cc: Christoph Rohland, Linus Torvalds, Alexander Viro,
	Kernel Mailing List

Ok, lets just fix filemap_fdatasync.  We can tackle the msync/fsync
interaction with screwed up FS writepages later, since all of the existing
writepage funcs are safe.  The problems I see with filemap_fdatasync when
writepage returns 1:

The page dirty bit is not reset.
the page is never unlocked.

So how about something like this in filemap_fdatasync

ret = writepage(page) ;
if (ret == 1) {
    /* writepage declined to write it out,
    ** leave it on the locked list, but make
    ** sure the dirty bit stays on so the page
    ** doesn't disappear
    */
    SetPageDirty(page) ;
    UnlockPage(page) ;
}

And then, in filemap_fdatawait:

list_del(&page->list) ;
if (PageDirty(page))
    list_add(&page->list, &mapping->dirty_pages) ;
else
    list_add(&page->list, &mapping->clean_pages) ;

-chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] filemap_fdatasync & related changes
  2001-01-04 17:38               ` Chris Mason
@ 2001-01-04 18:00                 ` Linus Torvalds
  2001-01-04 22:52                   ` Daniel Phillips
  0 siblings, 1 reply; 62+ messages in thread
From: Linus Torvalds @ 2001-01-04 18:00 UTC (permalink / raw)
  To: Chris Mason
  Cc: Rik van Riel, Marcelo Tosatti, Christoph Rohland, Alexander Viro,
	Kernel Mailing List



On Thu, 4 Jan 2001, Chris Mason wrote:
>
> The page dirty bit is not reset.
> the page is never unlocked.
> 
> So how about something like this in filemap_fdatasync

I'd rather just change the rule that "writepage()" will clear the dirty
bit itself and always unlock (and "1" just to inform the upper layers that
the page cannot be thrown away).

> And then, in filemap_fdatawait:
> 
> list_del(&page->list) ;
> if (PageDirty(page))
>     list_add(&page->list, &mapping->dirty_pages) ;
> else
>     list_add(&page->list, &mapping->clean_pages) ;

Can't be done. The thing depends on the fact that it always depletes the
dirty list. The whole fdatasync() logic hinges on the fact that it will
always move pages "down" - otherwise it will never complete.

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] filemap_fdatasync & related changes
  2001-01-03 18:28         ` Linus Torvalds
  2001-01-03 19:09           ` Chris Mason
@ 2001-01-04 18:46           ` Stephen C. Tweedie
  1 sibling, 0 replies; 62+ messages in thread
From: Stephen C. Tweedie @ 2001-01-04 18:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Chris Mason, Marcelo Tosatti, Alexander Viro, Kernel Mailing List,
	Stephen Tweedie

hi,

On Wed, Jan 03, 2001 at 10:28:05AM -0800, Linus Torvalds wrote:
> 
> On Wed, 3 Jan 2001, Chris Mason wrote:
> > 
> > Just noticed the filemap_fdatasync code doesn't check the return value from
> > writepage.  Linus, would you take a patch that redirtied the page, puts it
> > back onto the dirty list (at the tail), and unlocks the page when writepage
> > returns 1?
> 
> I don't see how that would help. It's just asking for endless loops.

Ultimately our whole IO-error handling for writes is fundamentally
broken right now, at least at the buffer_head level.  We just don't
have any mechanism in the buffer_head for distinguishing between write
errors and read errors.  If you get a write error, the next reader who
happens upon that buffer will see an invalid buffer and will try to
reread from disk, discarding the newer data from cache that we had
problems writing.

It's less of a problem with the page cache in 2.4 because such a write
error doesn't set the page to being not-uptodate, but the problem is
still there.  If we're going to start trying to be rigorous about
propagating driver write failure notification back up to user space
(eg. for O_SYNC), then ideally we need to do it in conjunction with a
rethink of the end_request() code.

The only reason I've come up against this particular problem recently
is that it impacts journaling filesystems.  It can lead to buffers in
memory suddenly appearing non-uptodate after a write, causing
subsequent reads to submit disk IOs on a buffer which the journaling
code expects to be unlocked.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] filemap_fdatasync & related changes
  2001-01-04 18:00                 ` Linus Torvalds
@ 2001-01-04 22:52                   ` Daniel Phillips
  0 siblings, 0 replies; 62+ messages in thread
From: Daniel Phillips @ 2001-01-04 22:52 UTC (permalink / raw)
  To: Linus Torvalds, Chris Mason, linux-kernel

Linus Torvalds wrote:
> I'd rather just change the rule that "writepage()" will clear the dirty
> bit itself and always unlock (and "1" just to inform the upper layers that
> the page cannot be thrown away).

Change to that rule or from?  I *think* you just said:

  - If ->writepage successfully starts writeout on the page it clears
the dirty bit and returns 0

  - If not successful, ->writepage unlocks the page and return 1

Who is going to be responsible for checking that the page dirty bit was
really set, and what will happen if it wasn't?

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 62+ messages in thread

end of thread, other threads:[~2001-01-04 22:55 UTC | newest]

Thread overview: 62+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2000-12-28 20:25 test13-pre5 Linus Torvalds
2000-12-28 18:39 ` test13-pre5 Marcelo Tosatti
2000-12-28 20:59   ` test13-pre5 Linus Torvalds
2000-12-28 19:22     ` test13-pre5 Marcelo Tosatti
2000-12-28 21:21       ` [wildly off-topic] test13-pre5 Rik van Riel
2000-12-28 19:31         ` Marcelo Tosatti
2000-12-28 21:39     ` test13-pre5 Daniel Phillips
2000-12-28 22:14     ` test13-pre5 Marcelo Tosatti
2000-12-29  1:10       ` test13-pre5 Linus Torvalds
2000-12-29 12:06       ` test13-pre5 Daniel Phillips
2000-12-28 22:17     ` test13-pre5 Andi Kleen
2000-12-28 22:33       ` test13-pre5 David S. Miller
2000-12-28 22:58         ` test13-pre5 Andi Kleen
2000-12-28 22:54           ` test13-pre5 David S. Miller
2000-12-28 23:17             ` test13-pre5 Andi Kleen
2000-12-28 23:14               ` test13-pre5 David S. Miller
2000-12-28 23:39                 ` test13-pre5 Andi Kleen
2000-12-28 23:25               ` test13-pre5 Rik van Riel
2000-12-28 23:36               ` test13-pre5 Linus Torvalds
2000-12-28 23:03           ` test13-pre5 Rik van Riel
2000-12-29 15:46         ` test13-pre5 Mark Hemment
2000-12-29 16:30           ` test13-pre5 Tim Wright
2000-12-29 17:54             ` test13-pre5 Mark Hemment
2000-12-29 21:23           ` test13-pre5 David S. Miller
2000-12-29 21:51             ` test13-pre5 Linus Torvalds
2000-12-28 23:15       ` test13-pre5 Linus Torvalds
2000-12-28 23:25         ` test13-pre5 Andi Kleen
2000-12-28 23:37           ` test13-pre5 Linus Torvalds
2000-12-28 23:43             ` test13-pre5 Andi Kleen
2000-12-29  0:49             ` test13-pre5 Stefan Traby
2000-12-29  1:23               ` test13-pre5 Linus Torvalds
2000-12-29  1:06                 ` test13-pre5 Albert Cranford
2000-12-29 14:53                   ` test13-pre5 Stefan Traby
2000-12-30 13:24         ` test13-pre5 Geert Uytterhoeven
2000-12-31 17:21           ` test13-pre5 Andi Kleen
2000-12-31 17:27             ` test13-pre5 Linus Torvalds
2000-12-31 17:36               ` test13-pre5 Andi Kleen
2000-12-31 18:19                 ` test13-pre5 Andrea Arcangeli
2000-12-31 18:06               ` test13-pre5 Andi Kleen
2000-12-31 18:07               ` test13-pre5 Matti Aarnio
2000-12-31 19:15                 ` test13-pre5 Linus Torvalds
2000-12-31 19:49                   ` test13-pre5 Andi Kleen
2000-12-31 18:10               ` test13-pre5 Andrea Arcangeli
2001-01-01 19:58           ` test13-pre5 H. Peter Anvin
2000-12-29 16:49     ` [PATCH] filemap_fdatasync & related changes Marcelo Tosatti
2000-12-29 21:13       ` Linus Torvalds
2001-01-03 16:37       ` Chris Mason
2001-01-03 17:07         ` Daniel Phillips
2001-01-03 18:28         ` Linus Torvalds
2001-01-03 19:09           ` Chris Mason
2001-01-03 21:39             ` Chris Wedgwood
2001-01-04 18:46           ` Stephen C. Tweedie
2001-01-04  9:48         ` Christoph Rohland
2001-01-04 14:41           ` Marcelo Tosatti
2001-01-04 16:58             ` Rik van Riel
2001-01-04 17:38               ` Chris Mason
2001-01-04 18:00                 ` Linus Torvalds
2001-01-04 22:52                   ` Daniel Phillips
2001-01-04 15:30           ` Chris Mason
2001-01-04 17:01             ` Christoph Rohland
2000-12-28 21:27 ` test13-pre5 (via82cxxx_audio.c) Jonathan Hudson
2000-12-29 18:17 ` test13-pre5 Tom Rini

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox