Re: [PATCH 3/3] mm: Swap checksum

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Nick Piggin <npiggin@suse.de>
To: Avi Kivity <avi@redhat.com>
Cc: Cesar Eduardo Barros <cesarb@cesarb.net>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 3/3] mm: Swap checksum
Date: Mon, 24 May 2010 17:32:59 +1000	[thread overview]
Message-ID: <20100524073259.GW2516@laptop> (raw)
In-Reply-To: <4BFA1F92.2080802@redhat.com>

On Mon, May 24, 2010 at 09:41:22AM +0300, Avi Kivity wrote:
> On 05/23/2010 09:58 PM, Cesar Eduardo Barros wrote:
> >Em 23-05-2010 12:19, Avi Kivity escreveu:
> >>On 64-bit, we may be able to store the checksum in the pte, if the swap
> >>device is small enough.
> >
> >Which pte?
> 
> All of them.
> 
> >Correct me if I am wrong, but I do not think all pages written to
> >the swap have exactly one pte pointing to them. And I have not
> >looked at the shmem.c code yet, but does it even use ptes?
> 
> Well, the ptes need the swap address written into them, so they are
> already found and updated somehow.  All that's needed is to update
> the value written to also include the checksum.
> 
> >It might be possible (find all ptes and write the 32-bit checksum
> >to them, do something else for shmem, have two different code
> >paths for small/large swapfiles), but I do not know if the memory
> >savings are worth the extra complexity (especially the need for
> >two separate code paths).
> 
> Certainly not at first, but later it may be worthwhile.
> 
> >
> >>If we take the trouble to touch the page, we may as well compare it
> >>against zero, and if so drop it instead of swapping it out.
> >
> >The problem with this is that the page is touched deep inside the
> >crc32c code, which might even be using hardware instructions
> >(crc32c-intel). So we would need to read it two times to compare
> >against zero.
> 
> The second read is very cheap since the page is already in cache.
> Also, we fail early when any word is nonzero, so usually the compare
> exits quickly.

For a page being written back from pagecache to disk, or for a
page being swapped out, the contents are likely cache cold and
likely not to be used in future either. Therefore a crc routine
for that would do well to minimise cache pollution.


> >One possibility could be to compare the full page against zero
> >only if its crc is a specific value (the crc32c of a page full of
> >zeros). This would not be too slow (we would be wasting time only
> >when we have a very high probability of saving much more time),
> >and not need to touch the crc32c code at all. I would only have to
> >look at how this messes up the state tracking (i.e. how to make it
> >track the fact that, instead of getting written out, this is now a
> >zeroed page).
> 
> Instead of returning a swap pte to be written to the page tables,
> return a zeroed pte.

A pte_none pte, to be precise.

I wonder, though. If we no longer trust block devices to give the
correct data back, should we provide a meta block device to do error
detection? No production filesystem on Linux has checksums (well, ext4
has a few). Of the ones that add checksumming, I'd say most will not do
data checksumming (and for direct IO it is not done).

WARNING: multiple messages have this Message-ID (diff)

From: Nick Piggin <npiggin@suse.de>
To: Avi Kivity <avi@redhat.com>
Cc: Cesar Eduardo Barros <cesarb@cesarb.net>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 3/3] mm: Swap checksum
Date: Mon, 24 May 2010 17:32:59 +1000	[thread overview]
Message-ID: <20100524073259.GW2516@laptop> (raw)
In-Reply-To: <4BFA1F92.2080802@redhat.com>

On Mon, May 24, 2010 at 09:41:22AM +0300, Avi Kivity wrote:
> On 05/23/2010 09:58 PM, Cesar Eduardo Barros wrote:
> >Em 23-05-2010 12:19, Avi Kivity escreveu:
> >>On 64-bit, we may be able to store the checksum in the pte, if the swap
> >>device is small enough.
> >
> >Which pte?
> 
> All of them.
> 
> >Correct me if I am wrong, but I do not think all pages written to
> >the swap have exactly one pte pointing to them. And I have not
> >looked at the shmem.c code yet, but does it even use ptes?
> 
> Well, the ptes need the swap address written into them, so they are
> already found and updated somehow.  All that's needed is to update
> the value written to also include the checksum.
> 
> >It might be possible (find all ptes and write the 32-bit checksum
> >to them, do something else for shmem, have two different code
> >paths for small/large swapfiles), but I do not know if the memory
> >savings are worth the extra complexity (especially the need for
> >two separate code paths).
> 
> Certainly not at first, but later it may be worthwhile.
> 
> >
> >>If we take the trouble to touch the page, we may as well compare it
> >>against zero, and if so drop it instead of swapping it out.
> >
> >The problem with this is that the page is touched deep inside the
> >crc32c code, which might even be using hardware instructions
> >(crc32c-intel). So we would need to read it two times to compare
> >against zero.
> 
> The second read is very cheap since the page is already in cache.
> Also, we fail early when any word is nonzero, so usually the compare
> exits quickly.

For a page being written back from pagecache to disk, or for a
page being swapped out, the contents are likely cache cold and
likely not to be used in future either. Therefore a crc routine
for that would do well to minimise cache pollution.


> >One possibility could be to compare the full page against zero
> >only if its crc is a specific value (the crc32c of a page full of
> >zeros). This would not be too slow (we would be wasting time only
> >when we have a very high probability of saving much more time),
> >and not need to touch the crc32c code at all. I would only have to
> >look at how this messes up the state tracking (i.e. how to make it
> >track the fact that, instead of getting written out, this is now a
> >zeroed page).
> 
> Instead of returning a swap pte to be written to the page tables,
> return a zeroed pte.

A pte_none pte, to be precise.

I wonder, though. If we no longer trust block devices to give the
correct data back, should we provide a meta block device to do error
detection? No production filesystem on Linux has checksums (well, ext4
has a few). Of the ones that add checksumming, I'd say most will not do
data checksumming (and for direct IO it is not done).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2010-05-24  7:33 UTC|newest]

Thread overview: 49+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-05-22 18:08 [PATCH 0/3] mm: Swap checksum Cesar Eduardo Barros
2010-05-22 18:08 ` Cesar Eduardo Barros
2010-05-22 18:08 ` [PATCH 1/3] mm/swapfile.c: better messages for swap_info_get Cesar Eduardo Barros
2010-05-22 18:08   ` Cesar Eduardo Barros
2010-05-22 18:13   ` Borislav Petkov
2010-05-22 18:13     ` Borislav Petkov
2010-05-22 18:18     ` Cesar Eduardo Barros
2010-05-22 18:18       ` Cesar Eduardo Barros
2010-05-22 18:08 ` [PATCH 2/3] kernel/power/swap.c: do not use end_swap_bio_read Cesar Eduardo Barros
2010-05-22 18:08   ` Cesar Eduardo Barros
2010-05-22 18:08 ` [PATCH 3/3] mm: Swap checksum Cesar Eduardo Barros
2010-05-22 18:08   ` Cesar Eduardo Barros
2010-05-23 15:19   ` Avi Kivity
2010-05-23 15:19     ` Avi Kivity
2010-05-23 18:58     ` Cesar Eduardo Barros
2010-05-23 18:58       ` Cesar Eduardo Barros
2010-05-24  6:41       ` Avi Kivity
2010-05-24  6:41         ` Avi Kivity
2010-05-24  7:32         ` Nick Piggin [this message]
2010-05-24  7:32           ` Nick Piggin
2010-05-24 10:51           ` Avi Kivity
2010-05-24 10:51             ` Avi Kivity
2010-05-24 11:24         ` Cesar Eduardo Barros
2010-05-24 11:24           ` Cesar Eduardo Barros
2010-05-23 14:03 ` [PATCH 0/3] " Minchan Kim
2010-05-23 14:03   ` Minchan Kim
2010-05-23 18:32   ` Cesar Eduardo Barros
2010-05-23 18:32     ` Cesar Eduardo Barros
2010-05-24  0:09     ` Minchan Kim
2010-05-24  0:09       ` Minchan Kim
2010-05-24  0:57       ` Cesar Eduardo Barros
2010-05-24  0:57         ` Cesar Eduardo Barros
2010-05-24  2:05         ` Minchan Kim
2010-05-24  2:05           ` Minchan Kim
2010-05-24 10:50           ` Cesar Eduardo Barros
2010-05-24 10:50             ` Cesar Eduardo Barros
2010-05-25 23:52             ` Minchan Kim
2010-05-25 23:52               ` Minchan Kim
2010-05-26 10:21               ` Cesar Eduardo Barros
2010-05-26 10:21                 ` Cesar Eduardo Barros
2010-05-26 15:31                 ` Minchan Kim
2010-05-26 15:31                   ` Minchan Kim
2010-05-26 21:28                   ` Valdis.Kletnieks
2010-05-26 22:45                     ` Minchan Kim
2010-05-26 22:45                       ` Minchan Kim
2010-05-26 23:19                       ` Cesar Eduardo Barros
2010-05-26 23:19                         ` Cesar Eduardo Barros
2010-05-26 23:27                         ` Minchan Kim
2010-05-26 23:27                           ` Minchan Kim

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100524073259.GW2516@laptop \
    --to=npiggin@suse.de \
    --cc=avi@redhat.com \
    --cc=cesarb@cesarb.net \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.