* [RFC PATCH 0/7] evacuate struct page from the block layer
@ 2015-03-16 20:25 Dan Williams
2015-03-16 20:25 ` Dan Williams
` (3 more replies)
0 siblings, 4 replies; 59+ messages in thread
From: Dan Williams @ 2015-03-16 20:25 UTC (permalink / raw)
To: linux-kernel
Cc: linux-arch, axboe, riel, linux-nvdimm, Dave Hansen, linux-raid,
mgorman, hch, linux-fsdevel, Matthew Wilcox
Avoid the impending disaster of requiring struct page coverage for what
is expected to be ever increasing capacities of persistent memory. In
conversations with Rik van Riel, Mel Gorman, and Jens Axboe at the
recently concluded Linux Storage Summit it became clear that struct page
is not required in many places, it was simply convenient to re-use.
Introduce helpers and infrastructure to remove struct page usage where
it is not necessary. One use case for these changes is to implement a
write-back-cache in persistent memory for software-RAID. Another use
case for the scatterlist changes is RDMA to a pfn-range.
This compiles and boots, but 0day-kbuild-robot coverage is needed before
this set exits "RFC". Obviously, the coccinelle script needs to be
re-run on the block updates for kernel.next. As is, this only includes
the resulting auto-generated-patch against 4.0-rc3.
---
Dan Williams (6):
block: add helpers for accessing a bio_vec page
block: convert bio_vec.bv_page to bv_pfn
dma-mapping: allow archs to optionally specify a ->map_pfn() operation
scatterlist: use sg_phys()
x86: support dma_map_pfn()
block: base support for pfn i/o
Matthew Wilcox (1):
scatterlist: support "page-less" (__pfn_t only) entries
arch/Kconfig | 3 +
arch/arm/mm/dma-mapping.c | 2 -
arch/microblaze/kernel/dma.c | 2 -
arch/powerpc/sysdev/axonram.c | 2 -
arch/x86/Kconfig | 12 +++
arch/x86/kernel/amd_gart_64.c | 22 ++++--
arch/x86/kernel/pci-nommu.c | 22 ++++--
arch/x86/kernel/pci-swiotlb.c | 4 +
arch/x86/pci/sta2x11-fixup.c | 4 +
arch/x86/xen/pci-swiotlb-xen.c | 4 +
block/bio-integrity.c | 8 +-
block/bio.c | 83 +++++++++++++++------
block/blk-core.c | 9 ++
block/blk-integrity.c | 7 +-
block/blk-lib.c | 2 -
block/blk-merge.c | 15 ++--
block/bounce.c | 26 +++----
drivers/block/aoe/aoecmd.c | 8 +-
drivers/block/brd.c | 2 -
drivers/block/drbd/drbd_bitmap.c | 5 +
drivers/block/drbd/drbd_main.c | 4 +
drivers/block/drbd/drbd_receiver.c | 4 +
drivers/block/drbd/drbd_worker.c | 3 +
drivers/block/floppy.c | 6 +-
drivers/block/loop.c | 8 +-
drivers/block/nbd.c | 8 +-
drivers/block/nvme-core.c | 2 -
drivers/block/pktcdvd.c | 11 ++-
drivers/block/ps3disk.c | 2 -
drivers/block/ps3vram.c | 2 -
drivers/block/rbd.c | 2 -
drivers/block/rsxx/dma.c | 3 +
drivers/block/umem.c | 2 -
drivers/block/zram/zram_drv.c | 10 +--
drivers/dma/ste_dma40.c | 5 -
drivers/iommu/amd_iommu.c | 21 ++++-
drivers/iommu/intel-iommu.c | 26 +++++--
drivers/iommu/iommu.c | 2 -
drivers/md/bcache/btree.c | 4 +
drivers/md/bcache/debug.c | 6 +-
drivers/md/bcache/movinggc.c | 2 -
drivers/md/bcache/request.c | 6 +-
drivers/md/bcache/super.c | 10 +--
drivers/md/bcache/util.c | 5 +
drivers/md/bcache/writeback.c | 2 -
drivers/md/dm-crypt.c | 12 ++-
drivers/md/dm-io.c | 2 -
drivers/md/dm-verity.c | 2 -
drivers/md/raid1.c | 50 +++++++------
drivers/md/raid10.c | 38 +++++-----
drivers/md/raid5.c | 6 +-
drivers/mmc/card/queue.c | 4 +
drivers/s390/block/dasd_diag.c | 2 -
drivers/s390/block/dasd_eckd.c | 14 ++--
drivers/s390/block/dasd_fba.c | 6 +-
drivers/s390/block/dcssblk.c | 2 -
drivers/s390/block/scm_blk.c | 2 -
drivers/s390/block/scm_blk_cluster.c | 2 -
drivers/s390/block/xpram.c | 2 -
drivers/scsi/mpt2sas/mpt2sas_transport.c | 6 +-
drivers/scsi/mpt3sas/mpt3sas_transport.c | 6 +-
drivers/scsi/sd_dif.c | 4 +
drivers/staging/android/ion/ion_chunk_heap.c | 4 +
drivers/staging/lustre/lustre/llite/lloop.c | 2 -
drivers/xen/biomerge.c | 4 +
drivers/xen/swiotlb-xen.c | 29 +++++--
fs/btrfs/check-integrity.c | 6 +-
fs/btrfs/compression.c | 12 ++-
fs/btrfs/disk-io.c | 4 +
fs/btrfs/extent_io.c | 8 +-
fs/btrfs/file-item.c | 8 +-
fs/btrfs/inode.c | 18 +++--
fs/btrfs/raid56.c | 4 +
fs/btrfs/volumes.c | 2 -
fs/buffer.c | 4 +
fs/direct-io.c | 2 -
fs/exofs/ore.c | 4 +
fs/exofs/ore_raid.c | 2 -
fs/ext4/page-io.c | 2 -
fs/f2fs/data.c | 4 +
fs/f2fs/segment.c | 2 -
fs/gfs2/lops.c | 4 +
fs/jfs/jfs_logmgr.c | 4 +
fs/logfs/dev_bdev.c | 10 +--
fs/mpage.c | 2 -
fs/splice.c | 2 -
include/asm-generic/dma-mapping-common.h | 30 ++++++++
include/asm-generic/memory_model.h | 4 +
include/asm-generic/scatterlist.h | 6 ++
include/crypto/scatterwalk.h | 10 +++
include/linux/bio.h | 24 +++---
include/linux/blk_types.h | 21 +++++
include/linux/blkdev.h | 2 +
include/linux/dma-debug.h | 23 +++++-
include/linux/dma-mapping.h | 8 ++
include/linux/scatterlist.h | 101 ++++++++++++++++++++++++--
include/linux/swiotlb.h | 5 +
kernel/power/block_io.c | 2 -
lib/dma-debug.c | 4 +
lib/swiotlb.c | 20 ++++-
mm/iov_iter.c | 22 +++---
mm/page_io.c | 8 +-
net/ceph/messenger.c | 2 -
103 files changed, 658 insertions(+), 335 deletions(-)
^ permalink raw reply [flat|nested] 59+ messages in thread* [RFC PATCH 0/7] evacuate struct page from the block layer 2015-03-16 20:25 [RFC PATCH 0/7] evacuate struct page from the block layer Dan Williams @ 2015-03-16 20:25 ` Dan Williams 2015-03-16 20:25 ` [RFC PATCH 2/7] block: convert bio_vec.bv_page to bv_pfn Dan Williams ` (2 subsequent siblings) 3 siblings, 0 replies; 59+ messages in thread From: Dan Williams @ 2015-03-16 20:25 UTC (permalink / raw) To: linux-kernel Cc: linux-arch, axboe, riel, linux-nvdimm, Dave Hansen, linux-raid, mgorman, hch, linux-fsdevel, Matthew Wilcox Avoid the impending disaster of requiring struct page coverage for what is expected to be ever increasing capacities of persistent memory. In conversations with Rik van Riel, Mel Gorman, and Jens Axboe at the recently concluded Linux Storage Summit it became clear that struct page is not required in many places, it was simply convenient to re-use. Introduce helpers and infrastructure to remove struct page usage where it is not necessary. One use case for these changes is to implement a write-back-cache in persistent memory for software-RAID. Another use case for the scatterlist changes is RDMA to a pfn-range. This compiles and boots, but 0day-kbuild-robot coverage is needed before this set exits "RFC". Obviously, the coccinelle script needs to be re-run on the block updates for kernel.next. As is, this only includes the resulting auto-generated-patch against 4.0-rc3. --- Dan Williams (6): block: add helpers for accessing a bio_vec page block: convert bio_vec.bv_page to bv_pfn dma-mapping: allow archs to optionally specify a ->map_pfn() operation scatterlist: use sg_phys() x86: support dma_map_pfn() block: base support for pfn i/o Matthew Wilcox (1): scatterlist: support "page-less" (__pfn_t only) entries arch/Kconfig | 3 + arch/arm/mm/dma-mapping.c | 2 - arch/microblaze/kernel/dma.c | 2 - arch/powerpc/sysdev/axonram.c | 2 - arch/x86/Kconfig | 12 +++ arch/x86/kernel/amd_gart_64.c | 22 ++++-- arch/x86/kernel/pci-nommu.c | 22 ++++-- arch/x86/kernel/pci-swiotlb.c | 4 + arch/x86/pci/sta2x11-fixup.c | 4 + arch/x86/xen/pci-swiotlb-xen.c | 4 + block/bio-integrity.c | 8 +- block/bio.c | 83 +++++++++++++++------ block/blk-core.c | 9 ++ block/blk-integrity.c | 7 +- block/blk-lib.c | 2 - block/blk-merge.c | 15 ++-- block/bounce.c | 26 +++---- drivers/block/aoe/aoecmd.c | 8 +- drivers/block/brd.c | 2 - drivers/block/drbd/drbd_bitmap.c | 5 + drivers/block/drbd/drbd_main.c | 4 + drivers/block/drbd/drbd_receiver.c | 4 + drivers/block/drbd/drbd_worker.c | 3 + drivers/block/floppy.c | 6 +- drivers/block/loop.c | 8 +- drivers/block/nbd.c | 8 +- drivers/block/nvme-core.c | 2 - drivers/block/pktcdvd.c | 11 ++- drivers/block/ps3disk.c | 2 - drivers/block/ps3vram.c | 2 - drivers/block/rbd.c | 2 - drivers/block/rsxx/dma.c | 3 + drivers/block/umem.c | 2 - drivers/block/zram/zram_drv.c | 10 +-- drivers/dma/ste_dma40.c | 5 - drivers/iommu/amd_iommu.c | 21 ++++- drivers/iommu/intel-iommu.c | 26 +++++-- drivers/iommu/iommu.c | 2 - drivers/md/bcache/btree.c | 4 + drivers/md/bcache/debug.c | 6 +- drivers/md/bcache/movinggc.c | 2 - drivers/md/bcache/request.c | 6 +- drivers/md/bcache/super.c | 10 +-- drivers/md/bcache/util.c | 5 + drivers/md/bcache/writeback.c | 2 - drivers/md/dm-crypt.c | 12 ++- drivers/md/dm-io.c | 2 - drivers/md/dm-verity.c | 2 - drivers/md/raid1.c | 50 +++++++------ drivers/md/raid10.c | 38 +++++----- drivers/md/raid5.c | 6 +- drivers/mmc/card/queue.c | 4 + drivers/s390/block/dasd_diag.c | 2 - drivers/s390/block/dasd_eckd.c | 14 ++-- drivers/s390/block/dasd_fba.c | 6 +- drivers/s390/block/dcssblk.c | 2 - drivers/s390/block/scm_blk.c | 2 - drivers/s390/block/scm_blk_cluster.c | 2 - drivers/s390/block/xpram.c | 2 - drivers/scsi/mpt2sas/mpt2sas_transport.c | 6 +- drivers/scsi/mpt3sas/mpt3sas_transport.c | 6 +- drivers/scsi/sd_dif.c | 4 + drivers/staging/android/ion/ion_chunk_heap.c | 4 + drivers/staging/lustre/lustre/llite/lloop.c | 2 - drivers/xen/biomerge.c | 4 + drivers/xen/swiotlb-xen.c | 29 +++++-- fs/btrfs/check-integrity.c | 6 +- fs/btrfs/compression.c | 12 ++- fs/btrfs/disk-io.c | 4 + fs/btrfs/extent_io.c | 8 +- fs/btrfs/file-item.c | 8 +- fs/btrfs/inode.c | 18 +++-- fs/btrfs/raid56.c | 4 + fs/btrfs/volumes.c | 2 - fs/buffer.c | 4 + fs/direct-io.c | 2 - fs/exofs/ore.c | 4 + fs/exofs/ore_raid.c | 2 - fs/ext4/page-io.c | 2 - fs/f2fs/data.c | 4 + fs/f2fs/segment.c | 2 - fs/gfs2/lops.c | 4 + fs/jfs/jfs_logmgr.c | 4 + fs/logfs/dev_bdev.c | 10 +-- fs/mpage.c | 2 - fs/splice.c | 2 - include/asm-generic/dma-mapping-common.h | 30 ++++++++ include/asm-generic/memory_model.h | 4 + include/asm-generic/scatterlist.h | 6 ++ include/crypto/scatterwalk.h | 10 +++ include/linux/bio.h | 24 +++--- include/linux/blk_types.h | 21 +++++ include/linux/blkdev.h | 2 + include/linux/dma-debug.h | 23 +++++- include/linux/dma-mapping.h | 8 ++ include/linux/scatterlist.h | 101 ++++++++++++++++++++++++-- include/linux/swiotlb.h | 5 + kernel/power/block_io.c | 2 - lib/dma-debug.c | 4 + lib/swiotlb.c | 20 ++++- mm/iov_iter.c | 22 +++--- mm/page_io.c | 8 +- net/ceph/messenger.c | 2 - 103 files changed, 658 insertions(+), 335 deletions(-) ^ permalink raw reply [flat|nested] 59+ messages in thread
* [RFC PATCH 2/7] block: convert bio_vec.bv_page to bv_pfn 2015-03-16 20:25 [RFC PATCH 0/7] evacuate struct page from the block layer Dan Williams 2015-03-16 20:25 ` Dan Williams @ 2015-03-16 20:25 ` Dan Williams 2015-03-16 20:25 ` Dan Williams 2015-03-16 23:05 ` Al Viro 2015-03-18 10:47 ` [RFC PATCH 0/7] evacuate struct page from the block layer Boaz Harrosh 2015-03-18 20:26 ` Andrew Morton 3 siblings, 2 replies; 59+ messages in thread From: Dan Williams @ 2015-03-16 20:25 UTC (permalink / raw) To: linux-kernel Cc: linux-arch, axboe, riel, linux-nvdimm, Dave Hansen, linux-raid, mgorman, hch, linux-fsdevel, Matthew Wilcox Carry a pfn in a bio_vec rather than a page in support of allowing bio(s) to reference unmapped (not struct page backed) persistent memory. As Dave Hansen points out, it would be unfortunate if we ended up with less type safety after this conversion, so introduce __pfn_t. Cc: Matthew Wilcox <willy@linux.intel.com> [willy: use pfn_t] [kvm: "no, use __pfn_t, we already stole pfn_t"] Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com> --- block/bio.c | 1 + block/blk-integrity.c | 4 ++-- block/blk-merge.c | 6 +++--- block/bounce.c | 2 +- drivers/md/bcache/btree.c | 2 +- include/asm-generic/memory_model.h | 4 ++++ include/linux/bio.h | 20 +++++++++++--------- include/linux/blk_types.h | 14 +++++++++++--- include/linux/scatterlist.h | 16 ++++++++++++++++ include/linux/swiotlb.h | 1 + mm/iov_iter.c | 22 +++++++++++----------- mm/page_io.c | 2 +- 12 files changed, 63 insertions(+), 31 deletions(-) diff --git a/block/bio.c b/block/bio.c index 7100fd6d5898..3d494e85e16d 100644 --- a/block/bio.c +++ b/block/bio.c @@ -28,6 +28,7 @@ #include <linux/mempool.h> #include <linux/workqueue.h> #include <linux/cgroup.h> +#include <linux/scatterlist.h> #include <trace/events/block.h> diff --git a/block/blk-integrity.c b/block/blk-integrity.c index 6c8b1d63e90b..34e53951a0d1 100644 --- a/block/blk-integrity.c +++ b/block/blk-integrity.c @@ -43,7 +43,7 @@ static const char *bi_unsupported_name = "unsupported"; */ int blk_rq_count_integrity_sg(struct request_queue *q, struct bio *bio) { - struct bio_vec iv, ivprv = { NULL }; + struct bio_vec iv, ivprv = BIO_VEC_INIT(ivprv); unsigned int segments = 0; unsigned int seg_size = 0; struct bvec_iter iter; @@ -89,7 +89,7 @@ EXPORT_SYMBOL(blk_rq_count_integrity_sg); int blk_rq_map_integrity_sg(struct request_queue *q, struct bio *bio, struct scatterlist *sglist) { - struct bio_vec iv, ivprv = { NULL }; + struct bio_vec iv, ivprv = BIO_VEC_INIT(ivprv); struct scatterlist *sg = NULL; unsigned int segments = 0; struct bvec_iter iter; diff --git a/block/blk-merge.c b/block/blk-merge.c index 39bd9925c057..8420d553b8ef 100644 --- a/block/blk-merge.c +++ b/block/blk-merge.c @@ -13,7 +13,7 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q, struct bio *bio, bool no_sg_merge) { - struct bio_vec bv, bvprv = { NULL }; + struct bio_vec bv, bvprv = BIO_VEC_INIT(bvprv); int cluster, high, highprv = 1; unsigned int seg_size, nr_phys_segs; struct bio *fbio, *bbio; @@ -123,7 +123,7 @@ EXPORT_SYMBOL(blk_recount_segments); static int blk_phys_contig_segment(struct request_queue *q, struct bio *bio, struct bio *nxt) { - struct bio_vec end_bv = { NULL }, nxt_bv; + struct bio_vec end_bv = BIO_VEC_INIT(end_bv), nxt_bv; struct bvec_iter iter; if (!blk_queue_cluster(q)) @@ -202,7 +202,7 @@ static int __blk_bios_map_sg(struct request_queue *q, struct bio *bio, struct scatterlist *sglist, struct scatterlist **sg) { - struct bio_vec bvec, bvprv = { NULL }; + struct bio_vec bvec, bvprv = BIO_VEC_INIT(bvprv); struct bvec_iter iter; int nsegs, cluster; diff --git a/block/bounce.c b/block/bounce.c index 0390e44d6e1b..4a3098067c81 100644 --- a/block/bounce.c +++ b/block/bounce.c @@ -64,7 +64,7 @@ static void bounce_copy_vec(struct bio_vec *to, unsigned char *vfrom) #else /* CONFIG_HIGHMEM */ #define bounce_copy_vec(to, vfrom) \ - memcpy(page_address((to)->bv_page) + (to)->bv_offset, vfrom, (to)->bv_len) + memcpy(page_address(bvec_page(to)) + (to)->bv_offset, vfrom, (to)->bv_len) #endif /* CONFIG_HIGHMEM */ diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c index 2e76e8b62902..36bbe29a806b 100644 --- a/drivers/md/bcache/btree.c +++ b/drivers/md/bcache/btree.c @@ -426,7 +426,7 @@ static void do_btree_node_write(struct btree *b) void *base = (void *) ((unsigned long) i & ~(PAGE_SIZE - 1)); bio_for_each_segment_all(bv, b->bio, j) - memcpy(page_address(bv->bv_page), + memcpy(page_address(bvec_page(bv)), base + j * PAGE_SIZE, PAGE_SIZE); bch_submit_bbio(b->bio, b->c, &k.key, 0); diff --git a/include/asm-generic/memory_model.h b/include/asm-generic/memory_model.h index 14909b0b9cae..e6c2fda25820 100644 --- a/include/asm-generic/memory_model.h +++ b/include/asm-generic/memory_model.h @@ -72,6 +72,10 @@ #define page_to_pfn __page_to_pfn #define pfn_to_page __pfn_to_page +typedef struct { + unsigned long pfn; +} __pfn_t; + #endif /* __ASSEMBLY__ */ #endif diff --git a/include/linux/bio.h b/include/linux/bio.h index f6a2427980f3..f35c90d5fd4d 100644 --- a/include/linux/bio.h +++ b/include/linux/bio.h @@ -63,8 +63,8 @@ */ #define __bvec_iter_bvec(bvec, iter) (&(bvec)[(iter).bi_idx]) -#define bvec_iter_page(bvec, iter) \ - (__bvec_iter_bvec((bvec), (iter))->bv_page) +#define bvec_iter_pfn(bvec, iter) \ + (__bvec_iter_bvec((bvec), (iter))->bv_pfn) #define bvec_iter_len(bvec, iter) \ min((iter).bi_size, \ @@ -75,7 +75,7 @@ #define bvec_iter_bvec(bvec, iter) \ ((struct bio_vec) { \ - .bv_page = bvec_iter_page((bvec), (iter)), \ + .bv_pfn = bvec_iter_pfn((bvec), (iter)), \ .bv_len = bvec_iter_len((bvec), (iter)), \ .bv_offset = bvec_iter_offset((bvec), (iter)), \ }) @@ -83,14 +83,16 @@ #define bio_iter_iovec(bio, iter) \ bvec_iter_bvec((bio)->bi_io_vec, (iter)) -#define bio_iter_page(bio, iter) \ - bvec_iter_page((bio)->bi_io_vec, (iter)) +#define bio_iter_pfn(bio, iter) \ + bvec_iter_pfn((bio)->bi_io_vec, (iter)) #define bio_iter_len(bio, iter) \ bvec_iter_len((bio)->bi_io_vec, (iter)) #define bio_iter_offset(bio, iter) \ bvec_iter_offset((bio)->bi_io_vec, (iter)) -#define bio_page(bio) bio_iter_page((bio), (bio)->bi_iter) +#define bio_page(bio) \ + pfn_to_page((bio_iter_pfn((bio), (bio)->bi_iter)).pfn) +#define bio_pfn(bio) bio_iter_pfn((bio), (bio)->bi_iter) #define bio_offset(bio) bio_iter_offset((bio), (bio)->bi_iter) #define bio_iovec(bio) bio_iter_iovec((bio), (bio)->bi_iter) @@ -150,8 +152,8 @@ static inline void *bio_data(struct bio *bio) /* * will die */ -#define bio_to_phys(bio) (page_to_phys(bio_page((bio))) + (unsigned long) bio_offset((bio))) -#define bvec_to_phys(bv) (page_to_phys((bv)->bv_page) + (unsigned long) (bv)->bv_offset) +#define bio_to_phys(bio) (pfn_to_phys(bio_pfn((bio))) + (unsigned long) bio_offset((bio))) +#define bvec_to_phys(bv) (pfn_to_phys((bv)->bv_pfn) + (unsigned long) (bv)->bv_offset) /* * queues that have highmem support enabled may still need to revert to @@ -160,7 +162,7 @@ static inline void *bio_data(struct bio *bio) * I/O completely on that queue (see ide-dma for example) */ #define __bio_kmap_atomic(bio, iter) \ - (kmap_atomic(bio_iter_iovec((bio), (iter)).bv_page) + \ + (kmap_atomic(bio_iter_iovec((bio), bvec_page(iter)) + \ bio_iter_iovec((bio), (iter)).bv_offset) #define __bio_kunmap_atomic(addr) kunmap_atomic(addr) diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index 3193a0b7051f..7f63fa3e4fda 100644 --- a/include/linux/blk_types.h +++ b/include/linux/blk_types.h @@ -5,7 +5,9 @@ #ifndef __LINUX_BLK_TYPES_H #define __LINUX_BLK_TYPES_H +#include <linux/scatterlist.h> #include <linux/types.h> +#include <asm/pgtable.h> struct bio_set; struct bio; @@ -21,19 +23,25 @@ typedef void (bio_destructor_t) (struct bio *); * was unsigned short, but we might as well be ready for > 64kB I/O pages */ struct bio_vec { - struct page *bv_page; + __pfn_t bv_pfn; unsigned int bv_len; unsigned int bv_offset; }; +#define BIO_VEC_INIT(name) { .bv_pfn = { .pfn = 0 }, .bv_len = 0, \ + .bv_offset = 0 } + +#define BIO_VEC(name) \ + struct bio_vec name = BIO_VEC_INIT(name) + static inline struct page *bvec_page(const struct bio_vec *bvec) { - return bvec->bv_page; + return pfn_to_page(bvec->bv_pfn.pfn); } static inline void bvec_set_page(struct bio_vec *bvec, struct page *page) { - bvec->bv_page = page; + bvec->bv_pfn = page_to_pfn_typed(page); } #ifdef CONFIG_BLOCK diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h index ed8f9e70df9b..5a15b1ce3c9e 100644 --- a/include/linux/scatterlist.h +++ b/include/linux/scatterlist.h @@ -9,6 +9,22 @@ #include <asm/scatterlist.h> #include <asm/io.h> +#ifndef __pfn_to_phys +#define __pfn_to_phys(pfn) ((dma_addr_t)(pfn) << PAGE_SHIFT) +#endif + +static inline dma_addr_t pfn_to_phys(__pfn_t pfn) +{ + return __pfn_to_phys(pfn.pfn); +} + +static inline __pfn_t page_to_pfn_typed(struct page *page) +{ + __pfn_t pfn = { .pfn = page_to_pfn(page) }; + + return pfn; +} + struct sg_table { struct scatterlist *sgl; /* the list */ unsigned int nents; /* number of mapped entries */ diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h index e7a018eaf3a2..dc3a94ce3b45 100644 --- a/include/linux/swiotlb.h +++ b/include/linux/swiotlb.h @@ -1,6 +1,7 @@ #ifndef __LINUX_SWIOTLB_H #define __LINUX_SWIOTLB_H +#include <linux/dma-direction.h> #include <linux/types.h> struct device; diff --git a/mm/iov_iter.c b/mm/iov_iter.c index 827732047da1..be9a7c5b8703 100644 --- a/mm/iov_iter.c +++ b/mm/iov_iter.c @@ -61,7 +61,7 @@ __p = i->bvec; \ __v.bv_len = min_t(size_t, n, __p->bv_len - skip); \ if (likely(__v.bv_len)) { \ - __v.bv_page = __p->bv_page; \ + __v.bv_pfn = __p->bv_pfn; \ __v.bv_offset = __p->bv_offset + skip; \ (void)(STEP); \ skip += __v.bv_len; \ @@ -72,7 +72,7 @@ __v.bv_len = min_t(size_t, n, __p->bv_len); \ if (unlikely(!__v.bv_len)) \ continue; \ - __v.bv_page = __p->bv_page; \ + __v.bv_pfn = __p->bv_pfn; \ __v.bv_offset = __p->bv_offset; \ (void)(STEP); \ skip = __v.bv_len; \ @@ -369,7 +369,7 @@ size_t copy_to_iter(void *addr, size_t bytes, struct iov_iter *i) iterate_and_advance(i, bytes, v, __copy_to_user(v.iov_base, (from += v.iov_len) - v.iov_len, v.iov_len), - memcpy_to_page(v.bv_page, v.bv_offset, + memcpy_to_page(bvec_page(&v), v.bv_offset, (from += v.bv_len) - v.bv_len, v.bv_len), memcpy(v.iov_base, (from += v.iov_len) - v.iov_len, v.iov_len) ) @@ -390,7 +390,7 @@ size_t copy_from_iter(void *addr, size_t bytes, struct iov_iter *i) iterate_and_advance(i, bytes, v, __copy_from_user((to += v.iov_len) - v.iov_len, v.iov_base, v.iov_len), - memcpy_from_page((to += v.bv_len) - v.bv_len, v.bv_page, + memcpy_from_page((to += v.bv_len) - v.bv_len, bvec_page(&v), v.bv_offset, v.bv_len), memcpy((to += v.iov_len) - v.iov_len, v.iov_base, v.iov_len) ) @@ -411,7 +411,7 @@ size_t copy_from_iter_nocache(void *addr, size_t bytes, struct iov_iter *i) iterate_and_advance(i, bytes, v, __copy_from_user_nocache((to += v.iov_len) - v.iov_len, v.iov_base, v.iov_len), - memcpy_from_page((to += v.bv_len) - v.bv_len, v.bv_page, + memcpy_from_page((to += v.bv_len) - v.bv_len, bvec_page(&v), v.bv_offset, v.bv_len), memcpy((to += v.iov_len) - v.iov_len, v.iov_base, v.iov_len) ) @@ -456,7 +456,7 @@ size_t iov_iter_zero(size_t bytes, struct iov_iter *i) iterate_and_advance(i, bytes, v, __clear_user(v.iov_base, v.iov_len), - memzero_page(v.bv_page, v.bv_offset, v.bv_len), + memzero_page(bvec_page(&v), v.bv_offset, v.bv_len), memset(v.iov_base, 0, v.iov_len) ) @@ -471,7 +471,7 @@ size_t iov_iter_copy_from_user_atomic(struct page *page, iterate_all_kinds(i, bytes, v, __copy_from_user_inatomic((p += v.iov_len) - v.iov_len, v.iov_base, v.iov_len), - memcpy_from_page((p += v.bv_len) - v.bv_len, v.bv_page, + memcpy_from_page((p += v.bv_len) - v.bv_len, bvec_page(&v), v.bv_offset, v.bv_len), memcpy((p += v.iov_len) - v.iov_len, v.iov_base, v.iov_len) ) @@ -570,7 +570,7 @@ ssize_t iov_iter_get_pages(struct iov_iter *i, 0;}),({ /* can't be more than PAGE_SIZE */ *start = v.bv_offset; - get_page(*pages = v.bv_page); + get_page(*pages = bvec_page(&v)); return v.bv_len; }),({ return -EFAULT; @@ -624,7 +624,7 @@ ssize_t iov_iter_get_pages_alloc(struct iov_iter *i, *pages = p = get_pages_array(1); if (!p) return -ENOMEM; - get_page(*p = v.bv_page); + get_page(*p = bvec_page(&v)); return v.bv_len; }),({ return -EFAULT; @@ -658,7 +658,7 @@ size_t csum_and_copy_from_iter(void *addr, size_t bytes, __wsum *csum, } err ? v.iov_len : 0; }), ({ - char *p = kmap_atomic(v.bv_page); + char *p = kmap_atomic(bvec_page(&v)); next = csum_partial_copy_nocheck(p + v.bv_offset, (to += v.bv_len) - v.bv_len, v.bv_len, 0); @@ -702,7 +702,7 @@ size_t csum_and_copy_to_iter(void *addr, size_t bytes, __wsum *csum, } err ? v.iov_len : 0; }), ({ - char *p = kmap_atomic(v.bv_page); + char *p = kmap_atomic(bvec_page(&v)); next = csum_partial_copy_nocheck((from += v.bv_len) - v.bv_len, p + v.bv_offset, v.bv_len, 0); diff --git a/mm/page_io.c b/mm/page_io.c index c540dbc6a9e5..b7c8d2c3f8f9 100644 --- a/mm/page_io.c +++ b/mm/page_io.c @@ -265,7 +265,7 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc, struct file *swap_file = sis->swap_file; struct address_space *mapping = swap_file->f_mapping; struct bio_vec bv = { - .bv_page = page, + .bv_pfn = page_to_pfn_typed(page), .bv_len = PAGE_SIZE, .bv_offset = 0 }; ^ permalink raw reply related [flat|nested] 59+ messages in thread
* [RFC PATCH 2/7] block: convert bio_vec.bv_page to bv_pfn 2015-03-16 20:25 ` [RFC PATCH 2/7] block: convert bio_vec.bv_page to bv_pfn Dan Williams @ 2015-03-16 20:25 ` Dan Williams 2015-03-16 23:05 ` Al Viro 1 sibling, 0 replies; 59+ messages in thread From: Dan Williams @ 2015-03-16 20:25 UTC (permalink / raw) To: linux-kernel Cc: linux-arch, axboe, riel, linux-nvdimm, Dave Hansen, linux-raid, mgorman, hch, linux-fsdevel, Matthew Wilcox Carry a pfn in a bio_vec rather than a page in support of allowing bio(s) to reference unmapped (not struct page backed) persistent memory. As Dave Hansen points out, it would be unfortunate if we ended up with less type safety after this conversion, so introduce __pfn_t. Cc: Matthew Wilcox <willy@linux.intel.com> [willy: use pfn_t] [kvm: "no, use __pfn_t, we already stole pfn_t"] Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com> --- block/bio.c | 1 + block/blk-integrity.c | 4 ++-- block/blk-merge.c | 6 +++--- block/bounce.c | 2 +- drivers/md/bcache/btree.c | 2 +- include/asm-generic/memory_model.h | 4 ++++ include/linux/bio.h | 20 +++++++++++--------- include/linux/blk_types.h | 14 +++++++++++--- include/linux/scatterlist.h | 16 ++++++++++++++++ include/linux/swiotlb.h | 1 + mm/iov_iter.c | 22 +++++++++++----------- mm/page_io.c | 2 +- 12 files changed, 63 insertions(+), 31 deletions(-) diff --git a/block/bio.c b/block/bio.c index 7100fd6d5898..3d494e85e16d 100644 --- a/block/bio.c +++ b/block/bio.c @@ -28,6 +28,7 @@ #include <linux/mempool.h> #include <linux/workqueue.h> #include <linux/cgroup.h> +#include <linux/scatterlist.h> #include <trace/events/block.h> diff --git a/block/blk-integrity.c b/block/blk-integrity.c index 6c8b1d63e90b..34e53951a0d1 100644 --- a/block/blk-integrity.c +++ b/block/blk-integrity.c @@ -43,7 +43,7 @@ static const char *bi_unsupported_name = "unsupported"; */ int blk_rq_count_integrity_sg(struct request_queue *q, struct bio *bio) { - struct bio_vec iv, ivprv = { NULL }; + struct bio_vec iv, ivprv = BIO_VEC_INIT(ivprv); unsigned int segments = 0; unsigned int seg_size = 0; struct bvec_iter iter; @@ -89,7 +89,7 @@ EXPORT_SYMBOL(blk_rq_count_integrity_sg); int blk_rq_map_integrity_sg(struct request_queue *q, struct bio *bio, struct scatterlist *sglist) { - struct bio_vec iv, ivprv = { NULL }; + struct bio_vec iv, ivprv = BIO_VEC_INIT(ivprv); struct scatterlist *sg = NULL; unsigned int segments = 0; struct bvec_iter iter; diff --git a/block/blk-merge.c b/block/blk-merge.c index 39bd9925c057..8420d553b8ef 100644 --- a/block/blk-merge.c +++ b/block/blk-merge.c @@ -13,7 +13,7 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q, struct bio *bio, bool no_sg_merge) { - struct bio_vec bv, bvprv = { NULL }; + struct bio_vec bv, bvprv = BIO_VEC_INIT(bvprv); int cluster, high, highprv = 1; unsigned int seg_size, nr_phys_segs; struct bio *fbio, *bbio; @@ -123,7 +123,7 @@ EXPORT_SYMBOL(blk_recount_segments); static int blk_phys_contig_segment(struct request_queue *q, struct bio *bio, struct bio *nxt) { - struct bio_vec end_bv = { NULL }, nxt_bv; + struct bio_vec end_bv = BIO_VEC_INIT(end_bv), nxt_bv; struct bvec_iter iter; if (!blk_queue_cluster(q)) @@ -202,7 +202,7 @@ static int __blk_bios_map_sg(struct request_queue *q, struct bio *bio, struct scatterlist *sglist, struct scatterlist **sg) { - struct bio_vec bvec, bvprv = { NULL }; + struct bio_vec bvec, bvprv = BIO_VEC_INIT(bvprv); struct bvec_iter iter; int nsegs, cluster; diff --git a/block/bounce.c b/block/bounce.c index 0390e44d6e1b..4a3098067c81 100644 --- a/block/bounce.c +++ b/block/bounce.c @@ -64,7 +64,7 @@ static void bounce_copy_vec(struct bio_vec *to, unsigned char *vfrom) #else /* CONFIG_HIGHMEM */ #define bounce_copy_vec(to, vfrom) \ - memcpy(page_address((to)->bv_page) + (to)->bv_offset, vfrom, (to)->bv_len) + memcpy(page_address(bvec_page(to)) + (to)->bv_offset, vfrom, (to)->bv_len) #endif /* CONFIG_HIGHMEM */ diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c index 2e76e8b62902..36bbe29a806b 100644 --- a/drivers/md/bcache/btree.c +++ b/drivers/md/bcache/btree.c @@ -426,7 +426,7 @@ static void do_btree_node_write(struct btree *b) void *base = (void *) ((unsigned long) i & ~(PAGE_SIZE - 1)); bio_for_each_segment_all(bv, b->bio, j) - memcpy(page_address(bv->bv_page), + memcpy(page_address(bvec_page(bv)), base + j * PAGE_SIZE, PAGE_SIZE); bch_submit_bbio(b->bio, b->c, &k.key, 0); diff --git a/include/asm-generic/memory_model.h b/include/asm-generic/memory_model.h index 14909b0b9cae..e6c2fda25820 100644 --- a/include/asm-generic/memory_model.h +++ b/include/asm-generic/memory_model.h @@ -72,6 +72,10 @@ #define page_to_pfn __page_to_pfn #define pfn_to_page __pfn_to_page +typedef struct { + unsigned long pfn; +} __pfn_t; + #endif /* __ASSEMBLY__ */ #endif diff --git a/include/linux/bio.h b/include/linux/bio.h index f6a2427980f3..f35c90d5fd4d 100644 --- a/include/linux/bio.h +++ b/include/linux/bio.h @@ -63,8 +63,8 @@ */ #define __bvec_iter_bvec(bvec, iter) (&(bvec)[(iter).bi_idx]) -#define bvec_iter_page(bvec, iter) \ - (__bvec_iter_bvec((bvec), (iter))->bv_page) +#define bvec_iter_pfn(bvec, iter) \ + (__bvec_iter_bvec((bvec), (iter))->bv_pfn) #define bvec_iter_len(bvec, iter) \ min((iter).bi_size, \ @@ -75,7 +75,7 @@ #define bvec_iter_bvec(bvec, iter) \ ((struct bio_vec) { \ - .bv_page = bvec_iter_page((bvec), (iter)), \ + .bv_pfn = bvec_iter_pfn((bvec), (iter)), \ .bv_len = bvec_iter_len((bvec), (iter)), \ .bv_offset = bvec_iter_offset((bvec), (iter)), \ }) @@ -83,14 +83,16 @@ #define bio_iter_iovec(bio, iter) \ bvec_iter_bvec((bio)->bi_io_vec, (iter)) -#define bio_iter_page(bio, iter) \ - bvec_iter_page((bio)->bi_io_vec, (iter)) +#define bio_iter_pfn(bio, iter) \ + bvec_iter_pfn((bio)->bi_io_vec, (iter)) #define bio_iter_len(bio, iter) \ bvec_iter_len((bio)->bi_io_vec, (iter)) #define bio_iter_offset(bio, iter) \ bvec_iter_offset((bio)->bi_io_vec, (iter)) -#define bio_page(bio) bio_iter_page((bio), (bio)->bi_iter) +#define bio_page(bio) \ + pfn_to_page((bio_iter_pfn((bio), (bio)->bi_iter)).pfn) +#define bio_pfn(bio) bio_iter_pfn((bio), (bio)->bi_iter) #define bio_offset(bio) bio_iter_offset((bio), (bio)->bi_iter) #define bio_iovec(bio) bio_iter_iovec((bio), (bio)->bi_iter) @@ -150,8 +152,8 @@ static inline void *bio_data(struct bio *bio) /* * will die */ -#define bio_to_phys(bio) (page_to_phys(bio_page((bio))) + (unsigned long) bio_offset((bio))) -#define bvec_to_phys(bv) (page_to_phys((bv)->bv_page) + (unsigned long) (bv)->bv_offset) +#define bio_to_phys(bio) (pfn_to_phys(bio_pfn((bio))) + (unsigned long) bio_offset((bio))) +#define bvec_to_phys(bv) (pfn_to_phys((bv)->bv_pfn) + (unsigned long) (bv)->bv_offset) /* * queues that have highmem support enabled may still need to revert to @@ -160,7 +162,7 @@ static inline void *bio_data(struct bio *bio) * I/O completely on that queue (see ide-dma for example) */ #define __bio_kmap_atomic(bio, iter) \ - (kmap_atomic(bio_iter_iovec((bio), (iter)).bv_page) + \ + (kmap_atomic(bio_iter_iovec((bio), bvec_page(iter)) + \ bio_iter_iovec((bio), (iter)).bv_offset) #define __bio_kunmap_atomic(addr) kunmap_atomic(addr) diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index 3193a0b7051f..7f63fa3e4fda 100644 --- a/include/linux/blk_types.h +++ b/include/linux/blk_types.h @@ -5,7 +5,9 @@ #ifndef __LINUX_BLK_TYPES_H #define __LINUX_BLK_TYPES_H +#include <linux/scatterlist.h> #include <linux/types.h> +#include <asm/pgtable.h> struct bio_set; struct bio; @@ -21,19 +23,25 @@ typedef void (bio_destructor_t) (struct bio *); * was unsigned short, but we might as well be ready for > 64kB I/O pages */ struct bio_vec { - struct page *bv_page; + __pfn_t bv_pfn; unsigned int bv_len; unsigned int bv_offset; }; +#define BIO_VEC_INIT(name) { .bv_pfn = { .pfn = 0 }, .bv_len = 0, \ + .bv_offset = 0 } + +#define BIO_VEC(name) \ + struct bio_vec name = BIO_VEC_INIT(name) + static inline struct page *bvec_page(const struct bio_vec *bvec) { - return bvec->bv_page; + return pfn_to_page(bvec->bv_pfn.pfn); } static inline void bvec_set_page(struct bio_vec *bvec, struct page *page) { - bvec->bv_page = page; + bvec->bv_pfn = page_to_pfn_typed(page); } #ifdef CONFIG_BLOCK diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h index ed8f9e70df9b..5a15b1ce3c9e 100644 --- a/include/linux/scatterlist.h +++ b/include/linux/scatterlist.h @@ -9,6 +9,22 @@ #include <asm/scatterlist.h> #include <asm/io.h> +#ifndef __pfn_to_phys +#define __pfn_to_phys(pfn) ((dma_addr_t)(pfn) << PAGE_SHIFT) +#endif + +static inline dma_addr_t pfn_to_phys(__pfn_t pfn) +{ + return __pfn_to_phys(pfn.pfn); +} + +static inline __pfn_t page_to_pfn_typed(struct page *page) +{ + __pfn_t pfn = { .pfn = page_to_pfn(page) }; + + return pfn; +} + struct sg_table { struct scatterlist *sgl; /* the list */ unsigned int nents; /* number of mapped entries */ diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h index e7a018eaf3a2..dc3a94ce3b45 100644 --- a/include/linux/swiotlb.h +++ b/include/linux/swiotlb.h @@ -1,6 +1,7 @@ #ifndef __LINUX_SWIOTLB_H #define __LINUX_SWIOTLB_H +#include <linux/dma-direction.h> #include <linux/types.h> struct device; diff --git a/mm/iov_iter.c b/mm/iov_iter.c index 827732047da1..be9a7c5b8703 100644 --- a/mm/iov_iter.c +++ b/mm/iov_iter.c @@ -61,7 +61,7 @@ __p = i->bvec; \ __v.bv_len = min_t(size_t, n, __p->bv_len - skip); \ if (likely(__v.bv_len)) { \ - __v.bv_page = __p->bv_page; \ + __v.bv_pfn = __p->bv_pfn; \ __v.bv_offset = __p->bv_offset + skip; \ (void)(STEP); \ skip += __v.bv_len; \ @@ -72,7 +72,7 @@ __v.bv_len = min_t(size_t, n, __p->bv_len); \ if (unlikely(!__v.bv_len)) \ continue; \ - __v.bv_page = __p->bv_page; \ + __v.bv_pfn = __p->bv_pfn; \ __v.bv_offset = __p->bv_offset; \ (void)(STEP); \ skip = __v.bv_len; \ @@ -369,7 +369,7 @@ size_t copy_to_iter(void *addr, size_t bytes, struct iov_iter *i) iterate_and_advance(i, bytes, v, __copy_to_user(v.iov_base, (from += v.iov_len) - v.iov_len, v.iov_len), - memcpy_to_page(v.bv_page, v.bv_offset, + memcpy_to_page(bvec_page(&v), v.bv_offset, (from += v.bv_len) - v.bv_len, v.bv_len), memcpy(v.iov_base, (from += v.iov_len) - v.iov_len, v.iov_len) ) @@ -390,7 +390,7 @@ size_t copy_from_iter(void *addr, size_t bytes, struct iov_iter *i) iterate_and_advance(i, bytes, v, __copy_from_user((to += v.iov_len) - v.iov_len, v.iov_base, v.iov_len), - memcpy_from_page((to += v.bv_len) - v.bv_len, v.bv_page, + memcpy_from_page((to += v.bv_len) - v.bv_len, bvec_page(&v), v.bv_offset, v.bv_len), memcpy((to += v.iov_len) - v.iov_len, v.iov_base, v.iov_len) ) @@ -411,7 +411,7 @@ size_t copy_from_iter_nocache(void *addr, size_t bytes, struct iov_iter *i) iterate_and_advance(i, bytes, v, __copy_from_user_nocache((to += v.iov_len) - v.iov_len, v.iov_base, v.iov_len), - memcpy_from_page((to += v.bv_len) - v.bv_len, v.bv_page, + memcpy_from_page((to += v.bv_len) - v.bv_len, bvec_page(&v), v.bv_offset, v.bv_len), memcpy((to += v.iov_len) - v.iov_len, v.iov_base, v.iov_len) ) @@ -456,7 +456,7 @@ size_t iov_iter_zero(size_t bytes, struct iov_iter *i) iterate_and_advance(i, bytes, v, __clear_user(v.iov_base, v.iov_len), - memzero_page(v.bv_page, v.bv_offset, v.bv_len), + memzero_page(bvec_page(&v), v.bv_offset, v.bv_len), memset(v.iov_base, 0, v.iov_len) ) @@ -471,7 +471,7 @@ size_t iov_iter_copy_from_user_atomic(struct page *page, iterate_all_kinds(i, bytes, v, __copy_from_user_inatomic((p += v.iov_len) - v.iov_len, v.iov_base, v.iov_len), - memcpy_from_page((p += v.bv_len) - v.bv_len, v.bv_page, + memcpy_from_page((p += v.bv_len) - v.bv_len, bvec_page(&v), v.bv_offset, v.bv_len), memcpy((p += v.iov_len) - v.iov_len, v.iov_base, v.iov_len) ) @@ -570,7 +570,7 @@ ssize_t iov_iter_get_pages(struct iov_iter *i, 0;}),({ /* can't be more than PAGE_SIZE */ *start = v.bv_offset; - get_page(*pages = v.bv_page); + get_page(*pages = bvec_page(&v)); return v.bv_len; }),({ return -EFAULT; @@ -624,7 +624,7 @@ ssize_t iov_iter_get_pages_alloc(struct iov_iter *i, *pages = p = get_pages_array(1); if (!p) return -ENOMEM; - get_page(*p = v.bv_page); + get_page(*p = bvec_page(&v)); return v.bv_len; }),({ return -EFAULT; @@ -658,7 +658,7 @@ size_t csum_and_copy_from_iter(void *addr, size_t bytes, __wsum *csum, } err ? v.iov_len : 0; }), ({ - char *p = kmap_atomic(v.bv_page); + char *p = kmap_atomic(bvec_page(&v)); next = csum_partial_copy_nocheck(p + v.bv_offset, (to += v.bv_len) - v.bv_len, v.bv_len, 0); @@ -702,7 +702,7 @@ size_t csum_and_copy_to_iter(void *addr, size_t bytes, __wsum *csum, } err ? v.iov_len : 0; }), ({ - char *p = kmap_atomic(v.bv_page); + char *p = kmap_atomic(bvec_page(&v)); next = csum_partial_copy_nocheck((from += v.bv_len) - v.bv_len, p + v.bv_offset, v.bv_len, 0); diff --git a/mm/page_io.c b/mm/page_io.c index c540dbc6a9e5..b7c8d2c3f8f9 100644 --- a/mm/page_io.c +++ b/mm/page_io.c @@ -265,7 +265,7 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc, struct file *swap_file = sis->swap_file; struct address_space *mapping = swap_file->f_mapping; struct bio_vec bv = { - .bv_page = page, + .bv_pfn = page_to_pfn_typed(page), .bv_len = PAGE_SIZE, .bv_offset = 0 }; ^ permalink raw reply related [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 2/7] block: convert bio_vec.bv_page to bv_pfn 2015-03-16 20:25 ` [RFC PATCH 2/7] block: convert bio_vec.bv_page to bv_pfn Dan Williams 2015-03-16 20:25 ` Dan Williams @ 2015-03-16 23:05 ` Al Viro 2015-03-16 23:05 ` Al Viro 2015-03-17 13:02 ` Matthew Wilcox 1 sibling, 2 replies; 59+ messages in thread From: Al Viro @ 2015-03-16 23:05 UTC (permalink / raw) To: Dan Williams Cc: linux-kernel, linux-arch, axboe, riel, linux-nvdimm, Dave Hansen, linux-raid, mgorman, hch, linux-fsdevel, Matthew Wilcox > diff --git a/mm/iov_iter.c b/mm/iov_iter.c > index 827732047da1..be9a7c5b8703 100644 > --- a/mm/iov_iter.c > +++ b/mm/iov_iter.c > @@ -61,7 +61,7 @@ > __p = i->bvec; \ > __v.bv_len = min_t(size_t, n, __p->bv_len - skip); \ > if (likely(__v.bv_len)) { \ > - __v.bv_page = __p->bv_page; \ > + __v.bv_pfn = __p->bv_pfn; \ > __v.bv_offset = __p->bv_offset + skip; \ > (void)(STEP); \ > skip += __v.bv_len; \ > @@ -72,7 +72,7 @@ > __v.bv_len = min_t(size_t, n, __p->bv_len); \ > if (unlikely(!__v.bv_len)) \ > continue; \ > - __v.bv_page = __p->bv_page; \ > + __v.bv_pfn = __p->bv_pfn; \ > __v.bv_offset = __p->bv_offset; \ > (void)(STEP); \ > skip = __v.bv_len; \ > @@ -369,7 +369,7 @@ size_t copy_to_iter(void *addr, size_t bytes, struct iov_iter *i) > iterate_and_advance(i, bytes, v, > __copy_to_user(v.iov_base, (from += v.iov_len) - v.iov_len, > v.iov_len), > - memcpy_to_page(v.bv_page, v.bv_offset, > + memcpy_to_page(bvec_page(&v), v.bv_offset, How had memcpy_to_page(NULL, ...) worked for you? ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 2/7] block: convert bio_vec.bv_page to bv_pfn 2015-03-16 23:05 ` Al Viro @ 2015-03-16 23:05 ` Al Viro 2015-03-17 13:02 ` Matthew Wilcox 1 sibling, 0 replies; 59+ messages in thread From: Al Viro @ 2015-03-16 23:05 UTC (permalink / raw) To: Dan Williams Cc: linux-kernel, linux-arch, axboe, riel, linux-nvdimm, Dave Hansen, linux-raid, mgorman, hch, linux-fsdevel, Matthew Wilcox > diff --git a/mm/iov_iter.c b/mm/iov_iter.c > index 827732047da1..be9a7c5b8703 100644 > --- a/mm/iov_iter.c > +++ b/mm/iov_iter.c > @@ -61,7 +61,7 @@ > __p = i->bvec; \ > __v.bv_len = min_t(size_t, n, __p->bv_len - skip); \ > if (likely(__v.bv_len)) { \ > - __v.bv_page = __p->bv_page; \ > + __v.bv_pfn = __p->bv_pfn; \ > __v.bv_offset = __p->bv_offset + skip; \ > (void)(STEP); \ > skip += __v.bv_len; \ > @@ -72,7 +72,7 @@ > __v.bv_len = min_t(size_t, n, __p->bv_len); \ > if (unlikely(!__v.bv_len)) \ > continue; \ > - __v.bv_page = __p->bv_page; \ > + __v.bv_pfn = __p->bv_pfn; \ > __v.bv_offset = __p->bv_offset; \ > (void)(STEP); \ > skip = __v.bv_len; \ > @@ -369,7 +369,7 @@ size_t copy_to_iter(void *addr, size_t bytes, struct iov_iter *i) > iterate_and_advance(i, bytes, v, > __copy_to_user(v.iov_base, (from += v.iov_len) - v.iov_len, > v.iov_len), > - memcpy_to_page(v.bv_page, v.bv_offset, > + memcpy_to_page(bvec_page(&v), v.bv_offset, How had memcpy_to_page(NULL, ...) worked for you? ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 2/7] block: convert bio_vec.bv_page to bv_pfn 2015-03-16 23:05 ` Al Viro 2015-03-16 23:05 ` Al Viro @ 2015-03-17 13:02 ` Matthew Wilcox 2015-03-17 13:02 ` Matthew Wilcox 2015-03-17 15:53 ` Dan Williams 1 sibling, 2 replies; 59+ messages in thread From: Matthew Wilcox @ 2015-03-17 13:02 UTC (permalink / raw) To: Al Viro Cc: Dan Williams, linux-kernel, linux-arch, axboe, riel, linux-nvdimm, Dave Hansen, linux-raid, mgorman, hch, linux-fsdevel On Mon, Mar 16, 2015 at 11:05:33PM +0000, Al Viro wrote: > > diff --git a/mm/iov_iter.c b/mm/iov_iter.c > > index 827732047da1..be9a7c5b8703 100644 > > --- a/mm/iov_iter.c > > +++ b/mm/iov_iter.c > > @@ -61,7 +61,7 @@ > > __p = i->bvec; \ > > __v.bv_len = min_t(size_t, n, __p->bv_len - skip); \ > > if (likely(__v.bv_len)) { \ > > - __v.bv_page = __p->bv_page; \ > > + __v.bv_pfn = __p->bv_pfn; \ > > __v.bv_offset = __p->bv_offset + skip; \ > > (void)(STEP); \ > > skip += __v.bv_len; \ > > @@ -72,7 +72,7 @@ > > __v.bv_len = min_t(size_t, n, __p->bv_len); \ > > if (unlikely(!__v.bv_len)) \ > > continue; \ > > - __v.bv_page = __p->bv_page; \ > > + __v.bv_pfn = __p->bv_pfn; \ > > __v.bv_offset = __p->bv_offset; \ > > (void)(STEP); \ > > skip = __v.bv_len; \ > > @@ -369,7 +369,7 @@ size_t copy_to_iter(void *addr, size_t bytes, struct iov_iter *i) > > iterate_and_advance(i, bytes, v, > > __copy_to_user(v.iov_base, (from += v.iov_len) - v.iov_len, > > v.iov_len), > > - memcpy_to_page(v.bv_page, v.bv_offset, > > + memcpy_to_page(bvec_page(&v), v.bv_offset, > > How had memcpy_to_page(NULL, ...) worked for you? static inline struct page *bvec_page(const struct bio_vec *bvec) { - return bvec->bv_page; + return pfn_to_page(bvec->bv_pfn.pfn); } (yes, more work to be done here to make copy_to_iter work to a bvec that is actually targetting a page-less address, but these are RFC patches showing the direction we're heading in while keeping current code working) ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 2/7] block: convert bio_vec.bv_page to bv_pfn 2015-03-17 13:02 ` Matthew Wilcox @ 2015-03-17 13:02 ` Matthew Wilcox 2015-03-17 15:53 ` Dan Williams 1 sibling, 0 replies; 59+ messages in thread From: Matthew Wilcox @ 2015-03-17 13:02 UTC (permalink / raw) To: Al Viro Cc: Dan Williams, linux-kernel, linux-arch, axboe, riel, linux-nvdimm, Dave Hansen, linux-raid, mgorman, hch, linux-fsdevel On Mon, Mar 16, 2015 at 11:05:33PM +0000, Al Viro wrote: > > diff --git a/mm/iov_iter.c b/mm/iov_iter.c > > index 827732047da1..be9a7c5b8703 100644 > > --- a/mm/iov_iter.c > > +++ b/mm/iov_iter.c > > @@ -61,7 +61,7 @@ > > __p = i->bvec; \ > > __v.bv_len = min_t(size_t, n, __p->bv_len - skip); \ > > if (likely(__v.bv_len)) { \ > > - __v.bv_page = __p->bv_page; \ > > + __v.bv_pfn = __p->bv_pfn; \ > > __v.bv_offset = __p->bv_offset + skip; \ > > (void)(STEP); \ > > skip += __v.bv_len; \ > > @@ -72,7 +72,7 @@ > > __v.bv_len = min_t(size_t, n, __p->bv_len); \ > > if (unlikely(!__v.bv_len)) \ > > continue; \ > > - __v.bv_page = __p->bv_page; \ > > + __v.bv_pfn = __p->bv_pfn; \ > > __v.bv_offset = __p->bv_offset; \ > > (void)(STEP); \ > > skip = __v.bv_len; \ > > @@ -369,7 +369,7 @@ size_t copy_to_iter(void *addr, size_t bytes, struct iov_iter *i) > > iterate_and_advance(i, bytes, v, > > __copy_to_user(v.iov_base, (from += v.iov_len) - v.iov_len, > > v.iov_len), > > - memcpy_to_page(v.bv_page, v.bv_offset, > > + memcpy_to_page(bvec_page(&v), v.bv_offset, > > How had memcpy_to_page(NULL, ...) worked for you? static inline struct page *bvec_page(const struct bio_vec *bvec) { - return bvec->bv_page; + return pfn_to_page(bvec->bv_pfn.pfn); } (yes, more work to be done here to make copy_to_iter work to a bvec that is actually targetting a page-less address, but these are RFC patches showing the direction we're heading in while keeping current code working) ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 2/7] block: convert bio_vec.bv_page to bv_pfn 2015-03-17 13:02 ` Matthew Wilcox 2015-03-17 13:02 ` Matthew Wilcox @ 2015-03-17 15:53 ` Dan Williams 2015-03-17 15:53 ` Dan Williams 1 sibling, 1 reply; 59+ messages in thread From: Dan Williams @ 2015-03-17 15:53 UTC (permalink / raw) To: Matthew Wilcox Cc: Al Viro, linux-kernel@vger.kernel.org, linux-arch, Jens Axboe, riel, linux-nvdimm, Dave Hansen, linux-raid, mgorman, Christoph Hellwig, linux-fsdevel On Tue, Mar 17, 2015 at 6:02 AM, Matthew Wilcox <willy@linux.intel.com> wrote: > On Mon, Mar 16, 2015 at 11:05:33PM +0000, Al Viro wrote: >> > diff --git a/mm/iov_iter.c b/mm/iov_iter.c >> > index 827732047da1..be9a7c5b8703 100644 >> > --- a/mm/iov_iter.c >> > +++ b/mm/iov_iter.c >> > @@ -61,7 +61,7 @@ >> > __p = i->bvec; \ >> > __v.bv_len = min_t(size_t, n, __p->bv_len - skip); \ >> > if (likely(__v.bv_len)) { \ >> > - __v.bv_page = __p->bv_page; \ >> > + __v.bv_pfn = __p->bv_pfn; \ >> > __v.bv_offset = __p->bv_offset + skip; \ >> > (void)(STEP); \ >> > skip += __v.bv_len; \ >> > @@ -72,7 +72,7 @@ >> > __v.bv_len = min_t(size_t, n, __p->bv_len); \ >> > if (unlikely(!__v.bv_len)) \ >> > continue; \ >> > - __v.bv_page = __p->bv_page; \ >> > + __v.bv_pfn = __p->bv_pfn; \ >> > __v.bv_offset = __p->bv_offset; \ >> > (void)(STEP); \ >> > skip = __v.bv_len; \ >> > @@ -369,7 +369,7 @@ size_t copy_to_iter(void *addr, size_t bytes, struct iov_iter *i) >> > iterate_and_advance(i, bytes, v, >> > __copy_to_user(v.iov_base, (from += v.iov_len) - v.iov_len, >> > v.iov_len), >> > - memcpy_to_page(v.bv_page, v.bv_offset, >> > + memcpy_to_page(bvec_page(&v), v.bv_offset, >> >> How had memcpy_to_page(NULL, ...) worked for you? > > static inline struct page *bvec_page(const struct bio_vec *bvec) > { > - return bvec->bv_page; > + return pfn_to_page(bvec->bv_pfn.pfn); > } > > (yes, more work to be done here to make copy_to_iter work to a bvec that > is actually targetting a page-less address, but these are RFC patches > showing the direction we're heading in while keeping current code working) > Right, the next item to tackle is kmap() and kmap_atomic() before we can start converting paths to be "native" pfn-only. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 2/7] block: convert bio_vec.bv_page to bv_pfn 2015-03-17 15:53 ` Dan Williams @ 2015-03-17 15:53 ` Dan Williams 0 siblings, 0 replies; 59+ messages in thread From: Dan Williams @ 2015-03-17 15:53 UTC (permalink / raw) To: Matthew Wilcox Cc: Al Viro, linux-kernel@vger.kernel.org, linux-arch, Jens Axboe, riel, linux-nvdimm, Dave Hansen, linux-raid, mgorman, Christoph Hellwig, linux-fsdevel On Tue, Mar 17, 2015 at 6:02 AM, Matthew Wilcox <willy@linux.intel.com> wrote: > On Mon, Mar 16, 2015 at 11:05:33PM +0000, Al Viro wrote: >> > diff --git a/mm/iov_iter.c b/mm/iov_iter.c >> > index 827732047da1..be9a7c5b8703 100644 >> > --- a/mm/iov_iter.c >> > +++ b/mm/iov_iter.c >> > @@ -61,7 +61,7 @@ >> > __p = i->bvec; \ >> > __v.bv_len = min_t(size_t, n, __p->bv_len - skip); \ >> > if (likely(__v.bv_len)) { \ >> > - __v.bv_page = __p->bv_page; \ >> > + __v.bv_pfn = __p->bv_pfn; \ >> > __v.bv_offset = __p->bv_offset + skip; \ >> > (void)(STEP); \ >> > skip += __v.bv_len; \ >> > @@ -72,7 +72,7 @@ >> > __v.bv_len = min_t(size_t, n, __p->bv_len); \ >> > if (unlikely(!__v.bv_len)) \ >> > continue; \ >> > - __v.bv_page = __p->bv_page; \ >> > + __v.bv_pfn = __p->bv_pfn; \ >> > __v.bv_offset = __p->bv_offset; \ >> > (void)(STEP); \ >> > skip = __v.bv_len; \ >> > @@ -369,7 +369,7 @@ size_t copy_to_iter(void *addr, size_t bytes, struct iov_iter *i) >> > iterate_and_advance(i, bytes, v, >> > __copy_to_user(v.iov_base, (from += v.iov_len) - v.iov_len, >> > v.iov_len), >> > - memcpy_to_page(v.bv_page, v.bv_offset, >> > + memcpy_to_page(bvec_page(&v), v.bv_offset, >> >> How had memcpy_to_page(NULL, ...) worked for you? > > static inline struct page *bvec_page(const struct bio_vec *bvec) > { > - return bvec->bv_page; > + return pfn_to_page(bvec->bv_pfn.pfn); > } > > (yes, more work to be done here to make copy_to_iter work to a bvec that > is actually targetting a page-less address, but these are RFC patches > showing the direction we're heading in while keeping current code working) > Right, the next item to tackle is kmap() and kmap_atomic() before we can start converting paths to be "native" pfn-only. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 0/7] evacuate struct page from the block layer 2015-03-16 20:25 [RFC PATCH 0/7] evacuate struct page from the block layer Dan Williams 2015-03-16 20:25 ` Dan Williams 2015-03-16 20:25 ` [RFC PATCH 2/7] block: convert bio_vec.bv_page to bv_pfn Dan Williams @ 2015-03-18 10:47 ` Boaz Harrosh 2015-03-18 10:47 ` Boaz Harrosh ` (2 more replies) 2015-03-18 20:26 ` Andrew Morton 3 siblings, 3 replies; 59+ messages in thread From: Boaz Harrosh @ 2015-03-18 10:47 UTC (permalink / raw) To: Dan Williams, linux-kernel, axboe, hch, Al Viro, Andrew Morton, Linus Torvalds Cc: linux-arch, riel, linux-nvdimm, Dave Hansen, linux-raid, mgorman, linux-fsdevel, Matthew Wilcox On 03/16/2015 10:25 PM, Dan Williams wrote: > Avoid the impending disaster of requiring struct page coverage for what > is expected to be ever increasing capacities of persistent memory. If you are saying "disaster", than we need to believe you. Or is there a scientific proof for this. Actually what you are proposing below, is the "real disaster". (I do hope it is not impending) > In conversations with Rik van Riel, Mel Gorman, and Jens Axboe at the > recently concluded Linux Storage Summit it became clear that struct page > is not required in many places, it was simply convenient to re-use. > > Introduce helpers and infrastructure to remove struct page usage where > it is not necessary. One use case for these changes is to implement a > write-back-cache in persistent memory for software-RAID. Another use > case for the scatterlist changes is RDMA to a pfn-range. > > This compiles and boots, but 0day-kbuild-robot coverage is needed before > this set exits "RFC". Obviously, the coccinelle script needs to be > re-run on the block updates for kernel.next. As is, this only includes > the resulting auto-generated-patch against 4.0-rc3. > > --- > > Dan Williams (6): > block: add helpers for accessing a bio_vec page > block: convert bio_vec.bv_page to bv_pfn > dma-mapping: allow archs to optionally specify a ->map_pfn() operation > scatterlist: use sg_phys() > x86: support dma_map_pfn() > block: base support for pfn i/o > > Matthew Wilcox (1): > scatterlist: support "page-less" (__pfn_t only) entries > > > arch/Kconfig | 3 + > arch/arm/mm/dma-mapping.c | 2 - > arch/microblaze/kernel/dma.c | 2 - > arch/powerpc/sysdev/axonram.c | 2 - > arch/x86/Kconfig | 12 +++ > arch/x86/kernel/amd_gart_64.c | 22 ++++-- > arch/x86/kernel/pci-nommu.c | 22 ++++-- > arch/x86/kernel/pci-swiotlb.c | 4 + > arch/x86/pci/sta2x11-fixup.c | 4 + > arch/x86/xen/pci-swiotlb-xen.c | 4 + > block/bio-integrity.c | 8 +- > block/bio.c | 83 +++++++++++++++------ > block/blk-core.c | 9 ++ > block/blk-integrity.c | 7 +- > block/blk-lib.c | 2 - > block/blk-merge.c | 15 ++-- > block/bounce.c | 26 +++---- > drivers/block/aoe/aoecmd.c | 8 +- > drivers/block/brd.c | 2 - > drivers/block/drbd/drbd_bitmap.c | 5 + > drivers/block/drbd/drbd_main.c | 4 + > drivers/block/drbd/drbd_receiver.c | 4 + > drivers/block/drbd/drbd_worker.c | 3 + > drivers/block/floppy.c | 6 +- > drivers/block/loop.c | 8 +- > drivers/block/nbd.c | 8 +- > drivers/block/nvme-core.c | 2 - > drivers/block/pktcdvd.c | 11 ++- > drivers/block/ps3disk.c | 2 - > drivers/block/ps3vram.c | 2 - > drivers/block/rbd.c | 2 - > drivers/block/rsxx/dma.c | 3 + > drivers/block/umem.c | 2 - > drivers/block/zram/zram_drv.c | 10 +-- > drivers/dma/ste_dma40.c | 5 - > drivers/iommu/amd_iommu.c | 21 ++++- > drivers/iommu/intel-iommu.c | 26 +++++-- > drivers/iommu/iommu.c | 2 - > drivers/md/bcache/btree.c | 4 + > drivers/md/bcache/debug.c | 6 +- > drivers/md/bcache/movinggc.c | 2 - > drivers/md/bcache/request.c | 6 +- > drivers/md/bcache/super.c | 10 +-- > drivers/md/bcache/util.c | 5 + > drivers/md/bcache/writeback.c | 2 - > drivers/md/dm-crypt.c | 12 ++- > drivers/md/dm-io.c | 2 - > drivers/md/dm-verity.c | 2 - > drivers/md/raid1.c | 50 +++++++------ > drivers/md/raid10.c | 38 +++++----- > drivers/md/raid5.c | 6 +- > drivers/mmc/card/queue.c | 4 + > drivers/s390/block/dasd_diag.c | 2 - > drivers/s390/block/dasd_eckd.c | 14 ++-- > drivers/s390/block/dasd_fba.c | 6 +- > drivers/s390/block/dcssblk.c | 2 - > drivers/s390/block/scm_blk.c | 2 - > drivers/s390/block/scm_blk_cluster.c | 2 - > drivers/s390/block/xpram.c | 2 - > drivers/scsi/mpt2sas/mpt2sas_transport.c | 6 +- > drivers/scsi/mpt3sas/mpt3sas_transport.c | 6 +- > drivers/scsi/sd_dif.c | 4 + > drivers/staging/android/ion/ion_chunk_heap.c | 4 + > drivers/staging/lustre/lustre/llite/lloop.c | 2 - > drivers/xen/biomerge.c | 4 + > drivers/xen/swiotlb-xen.c | 29 +++++-- > fs/btrfs/check-integrity.c | 6 +- > fs/btrfs/compression.c | 12 ++- > fs/btrfs/disk-io.c | 4 + > fs/btrfs/extent_io.c | 8 +- > fs/btrfs/file-item.c | 8 +- > fs/btrfs/inode.c | 18 +++-- > fs/btrfs/raid56.c | 4 + > fs/btrfs/volumes.c | 2 - > fs/buffer.c | 4 + > fs/direct-io.c | 2 - > fs/exofs/ore.c | 4 + > fs/exofs/ore_raid.c | 2 - > fs/ext4/page-io.c | 2 - > fs/f2fs/data.c | 4 + > fs/f2fs/segment.c | 2 - > fs/gfs2/lops.c | 4 + > fs/jfs/jfs_logmgr.c | 4 + > fs/logfs/dev_bdev.c | 10 +-- > fs/mpage.c | 2 - > fs/splice.c | 2 - > include/asm-generic/dma-mapping-common.h | 30 ++++++++ > include/asm-generic/memory_model.h | 4 + > include/asm-generic/scatterlist.h | 6 ++ > include/crypto/scatterwalk.h | 10 +++ > include/linux/bio.h | 24 +++--- > include/linux/blk_types.h | 21 +++++ > include/linux/blkdev.h | 2 + > include/linux/dma-debug.h | 23 +++++- > include/linux/dma-mapping.h | 8 ++ > include/linux/scatterlist.h | 101 ++++++++++++++++++++++++-- > include/linux/swiotlb.h | 5 + > kernel/power/block_io.c | 2 - > lib/dma-debug.c | 4 + > lib/swiotlb.c | 20 ++++- > mm/iov_iter.c | 22 +++--- > mm/page_io.c | 8 +- > net/ceph/messenger.c | 2 - God! Look at this endless list of files and it is only the very beginning. It does not even work and touches only 10% of what will need to be touched for this to work, and very very marginally at that. There will always be "another subsystem" that will not work. For example NUMA how will you do NUMA aware pmem? and this is just a simple example. (I'm saying NUMA because our tests show a huge drop in performance if you do not do NUMA aware allocation) Al, Jens, Christoph Andrew. Think of the immediate stability nightmare and the long term torture to maintain two code paths. Two set of tests, and the combinatorial explosions of tests. I'm not the one afraid of hard work, if it was for a good cause, but for what? really for what? The block layer, and RDMA, and networking, and spline, and what ever the heck any one wants to imagine to do with pmem, already works perfectly stable. right now! We have set up RDMA pmem target without a single line of extra code, and the RDMA client was trivial to write. We are sending down block layer BIOs from pmem from day one, and even iscsi NFS and any kind of networking directly from pmem, for almost a year now. All it takes is two simple patches to mm that creates a pages-section for pmem. The Kernel DOCs do says that a page is a construct that keeps track of the sate of a physical page in memory. A memory mapped pmem is perfectly that, and it has state that needs tracking just the same, Say that converted block layers of yours now happens to be an iscsi and goes through the network stack, it starts to need ref-counting, flags ... It has state. Matthew Dan. I don't get it. Don't you guys at Intel have nothing to do? why change half the Kernel? for what? to achieve what? all your wildest dreams about pmem are right here already. What is it that you guys want to do with this code that we cannot already do? And I can show you two tons of things you cannot do with this code that we can already do. With two simple patches. If it is stability that you are concerned with, "what if a pmem-page gets to the wrong mm subsystem?" There are a couple small hardening patches and and extra page-flag allocated, that can make the all thing foolproof. Though up until now I have not encountered any problem. > 103 files changed, 658 insertions(+), 335 deletions(-) Please look, this is only the beginning. And does not even work. Let us come back to our senses. As true hackers lets do the minimum effort to achieve new heights. All it really takes to do all this is 2 little patches. Cheers Boaz ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 0/7] evacuate struct page from the block layer 2015-03-18 10:47 ` [RFC PATCH 0/7] evacuate struct page from the block layer Boaz Harrosh @ 2015-03-18 10:47 ` Boaz Harrosh 2015-03-18 13:06 ` Matthew Wilcox 2015-03-18 15:35 ` Dan Williams 2 siblings, 0 replies; 59+ messages in thread From: Boaz Harrosh @ 2015-03-18 10:47 UTC (permalink / raw) To: Dan Williams, linux-kernel, axboe, hch, Al Viro, Andrew Morton, Linus Torvalds Cc: linux-arch, riel, linux-nvdimm, Dave Hansen, linux-raid, mgorman, linux-fsdevel, Matthew Wilcox On 03/16/2015 10:25 PM, Dan Williams wrote: > Avoid the impending disaster of requiring struct page coverage for what > is expected to be ever increasing capacities of persistent memory. If you are saying "disaster", than we need to believe you. Or is there a scientific proof for this. Actually what you are proposing below, is the "real disaster". (I do hope it is not impending) > In conversations with Rik van Riel, Mel Gorman, and Jens Axboe at the > recently concluded Linux Storage Summit it became clear that struct page > is not required in many places, it was simply convenient to re-use. > > Introduce helpers and infrastructure to remove struct page usage where > it is not necessary. One use case for these changes is to implement a > write-back-cache in persistent memory for software-RAID. Another use > case for the scatterlist changes is RDMA to a pfn-range. > > This compiles and boots, but 0day-kbuild-robot coverage is needed before > this set exits "RFC". Obviously, the coccinelle script needs to be > re-run on the block updates for kernel.next. As is, this only includes > the resulting auto-generated-patch against 4.0-rc3. > > --- > > Dan Williams (6): > block: add helpers for accessing a bio_vec page > block: convert bio_vec.bv_page to bv_pfn > dma-mapping: allow archs to optionally specify a ->map_pfn() operation > scatterlist: use sg_phys() > x86: support dma_map_pfn() > block: base support for pfn i/o > > Matthew Wilcox (1): > scatterlist: support "page-less" (__pfn_t only) entries > > > arch/Kconfig | 3 + > arch/arm/mm/dma-mapping.c | 2 - > arch/microblaze/kernel/dma.c | 2 - > arch/powerpc/sysdev/axonram.c | 2 - > arch/x86/Kconfig | 12 +++ > arch/x86/kernel/amd_gart_64.c | 22 ++++-- > arch/x86/kernel/pci-nommu.c | 22 ++++-- > arch/x86/kernel/pci-swiotlb.c | 4 + > arch/x86/pci/sta2x11-fixup.c | 4 + > arch/x86/xen/pci-swiotlb-xen.c | 4 + > block/bio-integrity.c | 8 +- > block/bio.c | 83 +++++++++++++++------ > block/blk-core.c | 9 ++ > block/blk-integrity.c | 7 +- > block/blk-lib.c | 2 - > block/blk-merge.c | 15 ++-- > block/bounce.c | 26 +++---- > drivers/block/aoe/aoecmd.c | 8 +- > drivers/block/brd.c | 2 - > drivers/block/drbd/drbd_bitmap.c | 5 + > drivers/block/drbd/drbd_main.c | 4 + > drivers/block/drbd/drbd_receiver.c | 4 + > drivers/block/drbd/drbd_worker.c | 3 + > drivers/block/floppy.c | 6 +- > drivers/block/loop.c | 8 +- > drivers/block/nbd.c | 8 +- > drivers/block/nvme-core.c | 2 - > drivers/block/pktcdvd.c | 11 ++- > drivers/block/ps3disk.c | 2 - > drivers/block/ps3vram.c | 2 - > drivers/block/rbd.c | 2 - > drivers/block/rsxx/dma.c | 3 + > drivers/block/umem.c | 2 - > drivers/block/zram/zram_drv.c | 10 +-- > drivers/dma/ste_dma40.c | 5 - > drivers/iommu/amd_iommu.c | 21 ++++- > drivers/iommu/intel-iommu.c | 26 +++++-- > drivers/iommu/iommu.c | 2 - > drivers/md/bcache/btree.c | 4 + > drivers/md/bcache/debug.c | 6 +- > drivers/md/bcache/movinggc.c | 2 - > drivers/md/bcache/request.c | 6 +- > drivers/md/bcache/super.c | 10 +-- > drivers/md/bcache/util.c | 5 + > drivers/md/bcache/writeback.c | 2 - > drivers/md/dm-crypt.c | 12 ++- > drivers/md/dm-io.c | 2 - > drivers/md/dm-verity.c | 2 - > drivers/md/raid1.c | 50 +++++++------ > drivers/md/raid10.c | 38 +++++----- > drivers/md/raid5.c | 6 +- > drivers/mmc/card/queue.c | 4 + > drivers/s390/block/dasd_diag.c | 2 - > drivers/s390/block/dasd_eckd.c | 14 ++-- > drivers/s390/block/dasd_fba.c | 6 +- > drivers/s390/block/dcssblk.c | 2 - > drivers/s390/block/scm_blk.c | 2 - > drivers/s390/block/scm_blk_cluster.c | 2 - > drivers/s390/block/xpram.c | 2 - > drivers/scsi/mpt2sas/mpt2sas_transport.c | 6 +- > drivers/scsi/mpt3sas/mpt3sas_transport.c | 6 +- > drivers/scsi/sd_dif.c | 4 + > drivers/staging/android/ion/ion_chunk_heap.c | 4 + > drivers/staging/lustre/lustre/llite/lloop.c | 2 - > drivers/xen/biomerge.c | 4 + > drivers/xen/swiotlb-xen.c | 29 +++++-- > fs/btrfs/check-integrity.c | 6 +- > fs/btrfs/compression.c | 12 ++- > fs/btrfs/disk-io.c | 4 + > fs/btrfs/extent_io.c | 8 +- > fs/btrfs/file-item.c | 8 +- > fs/btrfs/inode.c | 18 +++-- > fs/btrfs/raid56.c | 4 + > fs/btrfs/volumes.c | 2 - > fs/buffer.c | 4 + > fs/direct-io.c | 2 - > fs/exofs/ore.c | 4 + > fs/exofs/ore_raid.c | 2 - > fs/ext4/page-io.c | 2 - > fs/f2fs/data.c | 4 + > fs/f2fs/segment.c | 2 - > fs/gfs2/lops.c | 4 + > fs/jfs/jfs_logmgr.c | 4 + > fs/logfs/dev_bdev.c | 10 +-- > fs/mpage.c | 2 - > fs/splice.c | 2 - > include/asm-generic/dma-mapping-common.h | 30 ++++++++ > include/asm-generic/memory_model.h | 4 + > include/asm-generic/scatterlist.h | 6 ++ > include/crypto/scatterwalk.h | 10 +++ > include/linux/bio.h | 24 +++--- > include/linux/blk_types.h | 21 +++++ > include/linux/blkdev.h | 2 + > include/linux/dma-debug.h | 23 +++++- > include/linux/dma-mapping.h | 8 ++ > include/linux/scatterlist.h | 101 ++++++++++++++++++++++++-- > include/linux/swiotlb.h | 5 + > kernel/power/block_io.c | 2 - > lib/dma-debug.c | 4 + > lib/swiotlb.c | 20 ++++- > mm/iov_iter.c | 22 +++--- > mm/page_io.c | 8 +- > net/ceph/messenger.c | 2 - God! Look at this endless list of files and it is only the very beginning. It does not even work and touches only 10% of what will need to be touched for this to work, and very very marginally at that. There will always be "another subsystem" that will not work. For example NUMA how will you do NUMA aware pmem? and this is just a simple example. (I'm saying NUMA because our tests show a huge drop in performance if you do not do NUMA aware allocation) Al, Jens, Christoph Andrew. Think of the immediate stability nightmare and the long term torture to maintain two code paths. Two set of tests, and the combinatorial explosions of tests. I'm not the one afraid of hard work, if it was for a good cause, but for what? really for what? The block layer, and RDMA, and networking, and spline, and what ever the heck any one wants to imagine to do with pmem, already works perfectly stable. right now! We have set up RDMA pmem target without a single line of extra code, and the RDMA client was trivial to write. We are sending down block layer BIOs from pmem from day one, and even iscsi NFS and any kind of networking directly from pmem, for almost a year now. All it takes is two simple patches to mm that creates a pages-section for pmem. The Kernel DOCs do says that a page is a construct that keeps track of the sate of a physical page in memory. A memory mapped pmem is perfectly that, and it has state that needs tracking just the same, Say that converted block layers of yours now happens to be an iscsi and goes through the network stack, it starts to need ref-counting, flags ... It has state. Matthew Dan. I don't get it. Don't you guys at Intel have nothing to do? why change half the Kernel? for what? to achieve what? all your wildest dreams about pmem are right here already. What is it that you guys want to do with this code that we cannot already do? And I can show you two tons of things you cannot do with this code that we can already do. With two simple patches. If it is stability that you are concerned with, "what if a pmem-page gets to the wrong mm subsystem?" There are a couple small hardening patches and and extra page-flag allocated, that can make the all thing foolproof. Though up until now I have not encountered any problem. > 103 files changed, 658 insertions(+), 335 deletions(-) Please look, this is only the beginning. And does not even work. Let us come back to our senses. As true hackers lets do the minimum effort to achieve new heights. All it really takes to do all this is 2 little patches. Cheers Boaz ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 0/7] evacuate struct page from the block layer 2015-03-18 10:47 ` [RFC PATCH 0/7] evacuate struct page from the block layer Boaz Harrosh 2015-03-18 10:47 ` Boaz Harrosh @ 2015-03-18 13:06 ` Matthew Wilcox 2015-03-18 13:06 ` Matthew Wilcox 2015-03-18 14:38 ` [Linux-nvdimm] " Boaz Harrosh 2015-03-18 15:35 ` Dan Williams 2 siblings, 2 replies; 59+ messages in thread From: Matthew Wilcox @ 2015-03-18 13:06 UTC (permalink / raw) To: Boaz Harrosh Cc: Dan Williams, linux-kernel, axboe, hch, Al Viro, Andrew Morton, Linus Torvalds, linux-arch, riel, linux-nvdimm, Dave Hansen, linux-raid, mgorman, linux-fsdevel On Wed, Mar 18, 2015 at 12:47:21PM +0200, Boaz Harrosh wrote: > God! Look at this endless list of files and it is only the very beginning. > It does not even work and touches only 10% of what will need to be touched > for this to work, and very very marginally at that. There will always be > "another subsystem" that will not work. For example NUMA how will you do > NUMA aware pmem? and this is just a simple example. (I'm saying NUMA > because our tests show a huge drop in performance if you do not do > NUMA aware allocation) You're very entertaining, but please, tone down your emails and stick to facts. The BIOS presents the persistent memory as one table entry per NUMA node, so you get one block device per NUMA node. There's no mixing of memory from different NUMA nodes within a single filesystem, unless you have a filesystem that uses multiple block devices. > I'm not the one afraid of hard work, if it was for a good cause, but for what? > really for what? The block layer, and RDMA, and networking, and spline, and what > ever the heck any one wants to imagine to do with pmem, already works perfectly > stable. right now! The overhead. Allocating a struct page for every 4k page in a 400GB DIMM (the current capacity available from one NV-DIMM vendor) occupies 6.4GB. That's an unacceptable amount of overhead. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 0/7] evacuate struct page from the block layer 2015-03-18 13:06 ` Matthew Wilcox @ 2015-03-18 13:06 ` Matthew Wilcox 2015-03-18 14:38 ` [Linux-nvdimm] " Boaz Harrosh 1 sibling, 0 replies; 59+ messages in thread From: Matthew Wilcox @ 2015-03-18 13:06 UTC (permalink / raw) To: Boaz Harrosh Cc: Dan Williams, linux-kernel, axboe, hch, Al Viro, Andrew Morton, Linus Torvalds, linux-arch, riel, linux-nvdimm, Dave Hansen, linux-raid, mgorman, linux-fsdevel On Wed, Mar 18, 2015 at 12:47:21PM +0200, Boaz Harrosh wrote: > God! Look at this endless list of files and it is only the very beginning. > It does not even work and touches only 10% of what will need to be touched > for this to work, and very very marginally at that. There will always be > "another subsystem" that will not work. For example NUMA how will you do > NUMA aware pmem? and this is just a simple example. (I'm saying NUMA > because our tests show a huge drop in performance if you do not do > NUMA aware allocation) You're very entertaining, but please, tone down your emails and stick to facts. The BIOS presents the persistent memory as one table entry per NUMA node, so you get one block device per NUMA node. There's no mixing of memory from different NUMA nodes within a single filesystem, unless you have a filesystem that uses multiple block devices. > I'm not the one afraid of hard work, if it was for a good cause, but for what? > really for what? The block layer, and RDMA, and networking, and spline, and what > ever the heck any one wants to imagine to do with pmem, already works perfectly > stable. right now! The overhead. Allocating a struct page for every 4k page in a 400GB DIMM (the current capacity available from one NV-DIMM vendor) occupies 6.4GB. That's an unacceptable amount of overhead. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [Linux-nvdimm] [RFC PATCH 0/7] evacuate struct page from the block layer 2015-03-18 13:06 ` Matthew Wilcox 2015-03-18 13:06 ` Matthew Wilcox @ 2015-03-18 14:38 ` Boaz Harrosh 2015-03-18 14:38 ` Boaz Harrosh 2015-03-20 15:56 ` Rik van Riel 1 sibling, 2 replies; 59+ messages in thread From: Boaz Harrosh @ 2015-03-18 14:38 UTC (permalink / raw) To: Matthew Wilcox, Boaz Harrosh Cc: axboe, linux-arch, riel, linux-raid, linux-nvdimm, Dave Hansen, linux-kernel, hch, Linus Torvalds, Al Viro, linux-fsdevel, Andrew Morton, mgorman On 03/18/2015 03:06 PM, Matthew Wilcox wrote: > On Wed, Mar 18, 2015 at 12:47:21PM +0200, Boaz Harrosh wrote: >> God! Look at this endless list of files and it is only the very beginning. >> It does not even work and touches only 10% of what will need to be touched >> for this to work, and very very marginally at that. There will always be >> "another subsystem" that will not work. For example NUMA how will you do >> NUMA aware pmem? and this is just a simple example. (I'm saying NUMA >> because our tests show a huge drop in performance if you do not do >> NUMA aware allocation) > > You're very entertaining, but please, tone down your emails and stick > to facts. The BIOS presents the persistent memory as one table entry > per NUMA node, so you get one block device per NUMA node. There's no > mixing of memory from different NUMA nodes within a single filesystem, > unless you have a filesystem that uses multiple block devices. > Not current BIOS, if we have them contiguous then they are presented as one range. (DDR3 BIOS). But I agree it is a bug and in our configuration we separate them to different pmem devices. Yes I meant a "filesystem that uses multiple block devices" >> I'm not the one afraid of hard work, if it was for a good cause, but for what? >> really for what? The block layer, and RDMA, and networking, and spline, and what >> ever the heck any one wants to imagine to do with pmem, already works perfectly >> stable. right now! > > The overhead. Allocating a struct page for every 4k page in a 400GB DIMM > (the current capacity available from one NV-DIMM vendor) occupies 6.4GB. > That's an unacceptable amount of overhead. > So lets fix the stacks to work nice with 2M pages. That said we can allocate the struct page also from pmem if we need to. The fact remains that we need state down the different stacks and this is the current design over all. I hate it that you introduce a double design a pfn-or-page and the combinations of them. It is ugliness to much for my guts. I would like a unified design. that runs all over the stack. Already we have too much duplication to my taste, and would love to see more unification and not more splitting. But the most important for me is do we have to sacrifice the short term to the long term. Such a massive change as you are proposing it will take years. for a theoretical 400GB DIMM. What about the 4G DIMM now in peoples hands, need they wait? (Though I still do not agree with your design) I love the SPARSE model of the "section" and the page being it's own identity relative to virtual & PFN of the section. We could think of a much smaller page-struct that only takes a ref-count and flags and have bigger page type for regular use, separate the low common part of the page, lay down clear rules about its use, and an high part that's per user. But let us think of a unified design through out. (most members of page are accessed through wrappers it would be relatively easy to split) And let us not sacrifice the now for the "far tomorrow", we should be able to do this incrementally, wasting more space now and saving later. [We can even invent a sizeless page you know how we encode the section ID directly into the 64 bit address of the page, So we can have a flag at the section that says this is a zero-size page section and the needed info is stored at the section object. But I still think you will need state per page and that we do need a minimal size. ] [BTW: The only 400GB DIMM I know of is a real flash, and not directly mapped to CPU, OK maybe read only, but the erase/write makes it logical-to-physical managed and not directly accessed ] And a personal note. I mean only to entertain. If any one feels I "toned-up", please forgive me. I meant no such thing. As a rule if I come across strong then please just laugh and don't take me seriously. I only mean scientific soundness. Thanks Boaz ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [Linux-nvdimm] [RFC PATCH 0/7] evacuate struct page from the block layer 2015-03-18 14:38 ` [Linux-nvdimm] " Boaz Harrosh @ 2015-03-18 14:38 ` Boaz Harrosh 2015-03-20 15:56 ` Rik van Riel 1 sibling, 0 replies; 59+ messages in thread From: Boaz Harrosh @ 2015-03-18 14:38 UTC (permalink / raw) To: Matthew Wilcox, Boaz Harrosh Cc: axboe, linux-arch, riel, linux-raid, linux-nvdimm, Dave Hansen, linux-kernel, hch, Linus Torvalds, Al Viro, linux-fsdevel, Andrew Morton, mgorman On 03/18/2015 03:06 PM, Matthew Wilcox wrote: > On Wed, Mar 18, 2015 at 12:47:21PM +0200, Boaz Harrosh wrote: >> God! Look at this endless list of files and it is only the very beginning. >> It does not even work and touches only 10% of what will need to be touched >> for this to work, and very very marginally at that. There will always be >> "another subsystem" that will not work. For example NUMA how will you do >> NUMA aware pmem? and this is just a simple example. (I'm saying NUMA >> because our tests show a huge drop in performance if you do not do >> NUMA aware allocation) > > You're very entertaining, but please, tone down your emails and stick > to facts. The BIOS presents the persistent memory as one table entry > per NUMA node, so you get one block device per NUMA node. There's no > mixing of memory from different NUMA nodes within a single filesystem, > unless you have a filesystem that uses multiple block devices. > Not current BIOS, if we have them contiguous then they are presented as one range. (DDR3 BIOS). But I agree it is a bug and in our configuration we separate them to different pmem devices. Yes I meant a "filesystem that uses multiple block devices" >> I'm not the one afraid of hard work, if it was for a good cause, but for what? >> really for what? The block layer, and RDMA, and networking, and spline, and what >> ever the heck any one wants to imagine to do with pmem, already works perfectly >> stable. right now! > > The overhead. Allocating a struct page for every 4k page in a 400GB DIMM > (the current capacity available from one NV-DIMM vendor) occupies 6.4GB. > That's an unacceptable amount of overhead. > So lets fix the stacks to work nice with 2M pages. That said we can allocate the struct page also from pmem if we need to. The fact remains that we need state down the different stacks and this is the current design over all. I hate it that you introduce a double design a pfn-or-page and the combinations of them. It is ugliness to much for my guts. I would like a unified design. that runs all over the stack. Already we have too much duplication to my taste, and would love to see more unification and not more splitting. But the most important for me is do we have to sacrifice the short term to the long term. Such a massive change as you are proposing it will take years. for a theoretical 400GB DIMM. What about the 4G DIMM now in peoples hands, need they wait? (Though I still do not agree with your design) I love the SPARSE model of the "section" and the page being it's own identity relative to virtual & PFN of the section. We could think of a much smaller page-struct that only takes a ref-count and flags and have bigger page type for regular use, separate the low common part of the page, lay down clear rules about its use, and an high part that's per user. But let us think of a unified design through out. (most members of page are accessed through wrappers it would be relatively easy to split) And let us not sacrifice the now for the "far tomorrow", we should be able to do this incrementally, wasting more space now and saving later. [We can even invent a sizeless page you know how we encode the section ID directly into the 64 bit address of the page, So we can have a flag at the section that says this is a zero-size page section and the needed info is stored at the section object. But I still think you will need state per page and that we do need a minimal size. ] [BTW: The only 400GB DIMM I know of is a real flash, and not directly mapped to CPU, OK maybe read only, but the erase/write makes it logical-to-physical managed and not directly accessed ] And a personal note. I mean only to entertain. If any one feels I "toned-up", please forgive me. I meant no such thing. As a rule if I come across strong then please just laugh and don't take me seriously. I only mean scientific soundness. Thanks Boaz ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [Linux-nvdimm] [RFC PATCH 0/7] evacuate struct page from the block layer 2015-03-18 14:38 ` [Linux-nvdimm] " Boaz Harrosh 2015-03-18 14:38 ` Boaz Harrosh @ 2015-03-20 15:56 ` Rik van Riel 2015-03-20 15:56 ` Rik van Riel 2015-03-22 11:53 ` Boaz Harrosh 1 sibling, 2 replies; 59+ messages in thread From: Rik van Riel @ 2015-03-20 15:56 UTC (permalink / raw) To: Boaz Harrosh, Matthew Wilcox, Boaz Harrosh Cc: axboe, linux-arch, linux-raid, linux-nvdimm, Dave Hansen, linux-kernel, hch, Linus Torvalds, Al Viro, linux-fsdevel, Andrew Morton, mgorman On 03/18/2015 10:38 AM, Boaz Harrosh wrote: > On 03/18/2015 03:06 PM, Matthew Wilcox wrote: >>> I'm not the one afraid of hard work, if it was for a good cause, but for what? >>> really for what? The block layer, and RDMA, and networking, and spline, and what >>> ever the heck any one wants to imagine to do with pmem, already works perfectly >>> stable. right now! >> >> The overhead. Allocating a struct page for every 4k page in a 400GB DIMM >> (the current capacity available from one NV-DIMM vendor) occupies 6.4GB. >> That's an unacceptable amount of overhead. >> > > So lets fix the stacks to work nice with 2M pages. That said we can > allocate the struct page also from pmem if we need to. The fact remains > that we need state down the different stacks and this is the current > design over all. Fixing the stack to work with 2M pages will be just as invasive, and just as much work as making it work without a struct page. What state do you need, exactly? The struct page in the VM is mostly used for two things: 1) to get a memory address of the data 2) refcounting, to make sure the page does not go away during an IO operation, copy, etc... Persistent memory cannot be paged out so (2) is not a concern, as long as we ensure the object the page belongs to does not go away. There are no seek times, so moving it around may not be necessary either, making (1) not a concern. The only case where (1) would be a concern is if we wanted to move data in persistent memory around for better NUMA locality. However, persistent memory DIMMs are on their way to being too large to move the memory, anyway - all we can usefully do is detect where programs are accessing memory, and move the programs there. What state do you need that is not already represented? 1.5% overhead isn't a whole lot, but it appears to be unnecessary. If you have a convincing argument as to why we need a struct page, you might want to articulate it in order to convince us. -- All rights reversed ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [Linux-nvdimm] [RFC PATCH 0/7] evacuate struct page from the block layer 2015-03-20 15:56 ` Rik van Riel @ 2015-03-20 15:56 ` Rik van Riel 2015-03-22 11:53 ` Boaz Harrosh 1 sibling, 0 replies; 59+ messages in thread From: Rik van Riel @ 2015-03-20 15:56 UTC (permalink / raw) To: Boaz Harrosh, Matthew Wilcox, Boaz Harrosh Cc: axboe, linux-arch, linux-raid, linux-nvdimm, Dave Hansen, linux-kernel, hch, Linus Torvalds, Al Viro, linux-fsdevel, Andrew Morton, mgorman On 03/18/2015 10:38 AM, Boaz Harrosh wrote: > On 03/18/2015 03:06 PM, Matthew Wilcox wrote: >>> I'm not the one afraid of hard work, if it was for a good cause, but for what? >>> really for what? The block layer, and RDMA, and networking, and spline, and what >>> ever the heck any one wants to imagine to do with pmem, already works perfectly >>> stable. right now! >> >> The overhead. Allocating a struct page for every 4k page in a 400GB DIMM >> (the current capacity available from one NV-DIMM vendor) occupies 6.4GB. >> That's an unacceptable amount of overhead. >> > > So lets fix the stacks to work nice with 2M pages. That said we can > allocate the struct page also from pmem if we need to. The fact remains > that we need state down the different stacks and this is the current > design over all. Fixing the stack to work with 2M pages will be just as invasive, and just as much work as making it work without a struct page. What state do you need, exactly? The struct page in the VM is mostly used for two things: 1) to get a memory address of the data 2) refcounting, to make sure the page does not go away during an IO operation, copy, etc... Persistent memory cannot be paged out so (2) is not a concern, as long as we ensure the object the page belongs to does not go away. There are no seek times, so moving it around may not be necessary either, making (1) not a concern. The only case where (1) would be a concern is if we wanted to move data in persistent memory around for better NUMA locality. However, persistent memory DIMMs are on their way to being too large to move the memory, anyway - all we can usefully do is detect where programs are accessing memory, and move the programs there. What state do you need that is not already represented? 1.5% overhead isn't a whole lot, but it appears to be unnecessary. If you have a convincing argument as to why we need a struct page, you might want to articulate it in order to convince us. -- All rights reversed ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [Linux-nvdimm] [RFC PATCH 0/7] evacuate struct page from the block layer 2015-03-20 15:56 ` Rik van Riel 2015-03-20 15:56 ` Rik van Riel @ 2015-03-22 11:53 ` Boaz Harrosh 2015-03-22 11:53 ` Boaz Harrosh 1 sibling, 1 reply; 59+ messages in thread From: Boaz Harrosh @ 2015-03-22 11:53 UTC (permalink / raw) To: Rik van Riel, Matthew Wilcox, Boaz Harrosh Cc: axboe, linux-arch, linux-raid, linux-nvdimm, Dave Hansen, linux-kernel, hch, Linus Torvalds, Al Viro, linux-fsdevel, Andrew Morton, mgorman On 03/20/2015 05:56 PM, Rik van Riel wrote: > On 03/18/2015 10:38 AM, Boaz Harrosh wrote: >> On 03/18/2015 03:06 PM, Matthew Wilcox wrote: > >>>> I'm not the one afraid of hard work, if it was for a good cause, but for what? >>>> really for what? The block layer, and RDMA, and networking, and spline, and what >>>> ever the heck any one wants to imagine to do with pmem, already works perfectly >>>> stable. right now! >>> >>> The overhead. Allocating a struct page for every 4k page in a 400GB DIMM >>> (the current capacity available from one NV-DIMM vendor) occupies 6.4GB. >>> That's an unacceptable amount of overhead. >>> >> >> So lets fix the stacks to work nice with 2M pages. That said we can >> allocate the struct page also from pmem if we need to. The fact remains >> that we need state down the different stacks and this is the current >> design over all. > > Fixing the stack to work with 2M pages will be just as invasive, > and just as much work as making it work without a struct page. > > What state do you need, exactly? > It is not me that needs state it is the Kernel. Let me show you what I can do now that uses state (and pages). block layer sends a bio via iscsi, in turn it goes around and sends it via networking stack. Here page-ref is used as well as all kind of page based management. (This is half the Kernel converted right here) Same thing but iser & RDMA. Same thing to a null-target, via the target stack, maybe via path-threw. Another big example: At user-mode application I mmap a portion of pmem, I then use the libvirt API to designate a named shared-memory object. At vm I use the same API to retrieve a pointer to that pmem region and boom, I'm persistent. (Same can be done between two VMs) mmap(pmem) send it to network, to encryption, direct_io RDMA, anything copyless. So many subsystem use page_lock page->lru page-ref and are written to receive and manage pages. I do not like to be excluded from these systems, and I would very much hate to re-write them. block layer is an example. > The struct page in the VM is mostly used for two things: > 1) to get a memory address of the data > 2) refcounting, to make sure the page does not go away > during an IO operation, copy, etc... > > Persistent memory cannot be paged out so (2) is not a concern, as > long as we ensure the object the page belongs to does not go away. > There are no seek times, so moving it around may not be necessary > either, making (1) not a concern. > I lost you sorry. I'm not sure what you meant here? Yes kmap/kunmap is mute. I do not see any use for highmem and any 32bitness with this thing. refcounting is used sure, even with pmem see above. Actually relaying on refcounting existence can solve us some stuff at the pmem management level, which exist today. (RDMA while truncate) > The only case where (1) would be a concern is if we wanted to move > data in persistent memory around for better NUMA locality. However, > persistent memory DIMMs are on their way to being too large to move > the memory, anyway - all we can usefully do is detect where programs > are accessing memory, and move the programs there. > So actually I have hands on experience with this very problem. We have observed that NUMA kills us. Now going through memory_add_physaddr_to_nid() loop for every 4k operation was a pain, but caching it on page_to_nid() (As part of flags in 64bit) is very nice optimization, we do NUMA aware block allocation and it preforms much better. (Never like a single node but magnitude better then without) > What state do you need that is not already represented? > Most of these subsystem you guys are focused on it is mostly read-only state. Except page-ref. But never the less the page has added information describing the pfn. Like nid mapping->ops flags etc ... And it is also a stop gap of translation. give me a page I now the pfn and vaddr, give me a pfn I know page give me a vaddr I know the page. So I can move between all these domains. Now I am sure that in hindsight we might have devised better structures and abstractions that could carry all this information in a more abstract and convenient way, throughout the Kernel. But for now this basic object is a page and is passed around like in a relay-race. Each subsystem with its own page based meta-structure. The only real global token is page-struct. You are saying: "not already represented" ? I'm saying exactly, sir it is already represented as a page-struct. Anything else is in the far far future. (if at all) > 1.5% overhead isn't a whole lot, but it appears to be unnecessary. > unnecessary, in a theoretical future with every single Kernel subsystem changed (maybe for the better I'm not saying). And this future is not even at all clear what it is. But for current code structure it is very much necessary. For the very long present days, it is not 1.5% with or without. It is need-to-copy or direct(-1.5%) [For me it is not even the performance of a memcpy which exacly halves my pmem performance, it is the latency and the extra nightmare locking and management to keep in sync two copies of the same thing] > If you have a convincing argument as to why we need a struct page, > you might want to articulate it in order to convince us. > The must simple convincing argument there is. "Existing code". Apparently page was needed, maybe we can all think of much better constructs. But for now this is what the Kernel is based on. Until such time that we better it it is there. Since when we refrain from new technologies and new fixtures because "A major cleanup is needed". I'm all for all the great "change-every-file in Kernel" ideas some guys have, but while at it also change the small patch I added to support pmem. For me pmem is now, at clients systems. and I chose direct(-1.5%) over need-to-copy. Because it gives me the performance, and most important, latency that sales my products. What is your timetable? Cheers Boaz ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [Linux-nvdimm] [RFC PATCH 0/7] evacuate struct page from the block layer 2015-03-22 11:53 ` Boaz Harrosh @ 2015-03-22 11:53 ` Boaz Harrosh 0 siblings, 0 replies; 59+ messages in thread From: Boaz Harrosh @ 2015-03-22 11:53 UTC (permalink / raw) To: Rik van Riel, Matthew Wilcox, Boaz Harrosh Cc: axboe, linux-arch, linux-raid, linux-nvdimm, Dave Hansen, linux-kernel, hch, Linus Torvalds, Al Viro, linux-fsdevel, Andrew Morton, mgorman On 03/20/2015 05:56 PM, Rik van Riel wrote: > On 03/18/2015 10:38 AM, Boaz Harrosh wrote: >> On 03/18/2015 03:06 PM, Matthew Wilcox wrote: > >>>> I'm not the one afraid of hard work, if it was for a good cause, but for what? >>>> really for what? The block layer, and RDMA, and networking, and spline, and what >>>> ever the heck any one wants to imagine to do with pmem, already works perfectly >>>> stable. right now! >>> >>> The overhead. Allocating a struct page for every 4k page in a 400GB DIMM >>> (the current capacity available from one NV-DIMM vendor) occupies 6.4GB. >>> That's an unacceptable amount of overhead. >>> >> >> So lets fix the stacks to work nice with 2M pages. That said we can >> allocate the struct page also from pmem if we need to. The fact remains >> that we need state down the different stacks and this is the current >> design over all. > > Fixing the stack to work with 2M pages will be just as invasive, > and just as much work as making it work without a struct page. > > What state do you need, exactly? > It is not me that needs state it is the Kernel. Let me show you what I can do now that uses state (and pages). block layer sends a bio via iscsi, in turn it goes around and sends it via networking stack. Here page-ref is used as well as all kind of page based management. (This is half the Kernel converted right here) Same thing but iser & RDMA. Same thing to a null-target, via the target stack, maybe via path-threw. Another big example: At user-mode application I mmap a portion of pmem, I then use the libvirt API to designate a named shared-memory object. At vm I use the same API to retrieve a pointer to that pmem region and boom, I'm persistent. (Same can be done between two VMs) mmap(pmem) send it to network, to encryption, direct_io RDMA, anything copyless. So many subsystem use page_lock page->lru page-ref and are written to receive and manage pages. I do not like to be excluded from these systems, and I would very much hate to re-write them. block layer is an example. > The struct page in the VM is mostly used for two things: > 1) to get a memory address of the data > 2) refcounting, to make sure the page does not go away > during an IO operation, copy, etc... > > Persistent memory cannot be paged out so (2) is not a concern, as > long as we ensure the object the page belongs to does not go away. > There are no seek times, so moving it around may not be necessary > either, making (1) not a concern. > I lost you sorry. I'm not sure what you meant here? Yes kmap/kunmap is mute. I do not see any use for highmem and any 32bitness with this thing. refcounting is used sure, even with pmem see above. Actually relaying on refcounting existence can solve us some stuff at the pmem management level, which exist today. (RDMA while truncate) > The only case where (1) would be a concern is if we wanted to move > data in persistent memory around for better NUMA locality. However, > persistent memory DIMMs are on their way to being too large to move > the memory, anyway - all we can usefully do is detect where programs > are accessing memory, and move the programs there. > So actually I have hands on experience with this very problem. We have observed that NUMA kills us. Now going through memory_add_physaddr_to_nid() loop for every 4k operation was a pain, but caching it on page_to_nid() (As part of flags in 64bit) is very nice optimization, we do NUMA aware block allocation and it preforms much better. (Never like a single node but magnitude better then without) > What state do you need that is not already represented? > Most of these subsystem you guys are focused on it is mostly read-only state. Except page-ref. But never the less the page has added information describing the pfn. Like nid mapping->ops flags etc ... And it is also a stop gap of translation. give me a page I now the pfn and vaddr, give me a pfn I know page give me a vaddr I know the page. So I can move between all these domains. Now I am sure that in hindsight we might have devised better structures and abstractions that could carry all this information in a more abstract and convenient way, throughout the Kernel. But for now this basic object is a page and is passed around like in a relay-race. Each subsystem with its own page based meta-structure. The only real global token is page-struct. You are saying: "not already represented" ? I'm saying exactly, sir it is already represented as a page-struct. Anything else is in the far far future. (if at all) > 1.5% overhead isn't a whole lot, but it appears to be unnecessary. > unnecessary, in a theoretical future with every single Kernel subsystem changed (maybe for the better I'm not saying). And this future is not even at all clear what it is. But for current code structure it is very much necessary. For the very long present days, it is not 1.5% with or without. It is need-to-copy or direct(-1.5%) [For me it is not even the performance of a memcpy which exacly halves my pmem performance, it is the latency and the extra nightmare locking and management to keep in sync two copies of the same thing] > If you have a convincing argument as to why we need a struct page, > you might want to articulate it in order to convince us. > The must simple convincing argument there is. "Existing code". Apparently page was needed, maybe we can all think of much better constructs. But for now this is what the Kernel is based on. Until such time that we better it it is there. Since when we refrain from new technologies and new fixtures because "A major cleanup is needed". I'm all for all the great "change-every-file in Kernel" ideas some guys have, but while at it also change the small patch I added to support pmem. For me pmem is now, at clients systems. and I chose direct(-1.5%) over need-to-copy. Because it gives me the performance, and most important, latency that sales my products. What is your timetable? Cheers Boaz ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 0/7] evacuate struct page from the block layer 2015-03-18 10:47 ` [RFC PATCH 0/7] evacuate struct page from the block layer Boaz Harrosh 2015-03-18 10:47 ` Boaz Harrosh 2015-03-18 13:06 ` Matthew Wilcox @ 2015-03-18 15:35 ` Dan Williams 2 siblings, 0 replies; 59+ messages in thread From: Dan Williams @ 2015-03-18 15:35 UTC (permalink / raw) To: Boaz Harrosh Cc: linux-kernel@vger.kernel.org, Jens Axboe, Christoph Hellwig, Al Viro, Andrew Morton, Linus Torvalds, linux-arch, riel, linux-nvdimm@lists.01.org, Dave Hansen, linux-raid, mgorman, linux-fsdevel, Matthew Wilcox On Wed, Mar 18, 2015 at 3:47 AM, Boaz Harrosh <openosd@gmail.com> wrote: > On 03/16/2015 10:25 PM, Dan Williams wrote: >> Avoid the impending disaster of requiring struct page coverage for what >> is expected to be ever increasing capacities of persistent memory. > > If you are saying "disaster", than we need to believe you. Or is there > a scientific proof for this. The same Moore's Law based extrapolation that Dave Chinner did to determine that major feature development on XFS may cease in 5 - 7 years. In Dave's words we're looking ahead to "lots and fast". Given the time scale of getting kernel changes out to end users in an enterprise kernel update the "dynamic page struct allocation" approach is already insufficient. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 0/7] evacuate struct page from the block layer 2015-03-16 20:25 [RFC PATCH 0/7] evacuate struct page from the block layer Dan Williams ` (2 preceding siblings ...) 2015-03-18 10:47 ` [RFC PATCH 0/7] evacuate struct page from the block layer Boaz Harrosh @ 2015-03-18 20:26 ` Andrew Morton 2015-03-19 13:43 ` Matthew Wilcox 3 siblings, 1 reply; 59+ messages in thread From: Andrew Morton @ 2015-03-18 20:26 UTC (permalink / raw) To: Dan Williams Cc: linux-kernel, linux-arch, axboe, riel, linux-nvdimm, Dave Hansen, linux-raid, mgorman, hch, linux-fsdevel, Matthew Wilcox On Mon, 16 Mar 2015 16:25:25 -0400 Dan Williams <dan.j.williams@intel.com> wrote: > Avoid the impending disaster of requiring struct page coverage for what > is expected to be ever increasing capacities of persistent memory. In > conversations with Rik van Riel, Mel Gorman, and Jens Axboe at the > recently concluded Linux Storage Summit it became clear that struct page > is not required in many places, it was simply convenient to re-use. > > Introduce helpers and infrastructure to remove struct page usage where > it is not necessary. One use case for these changes is to implement a > write-back-cache in persistent memory for software-RAID. Another use > case for the scatterlist changes is RDMA to a pfn-range. Those use-cases sound very thin. If that's all we have then I'd say "find another way of implementing those things without creating pageframes for persistent memory". IOW, please tell us much much much more about the value of this change. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 0/7] evacuate struct page from the block layer 2015-03-18 20:26 ` Andrew Morton @ 2015-03-19 13:43 ` Matthew Wilcox 2015-03-19 13:43 ` Matthew Wilcox ` (3 more replies) 0 siblings, 4 replies; 59+ messages in thread From: Matthew Wilcox @ 2015-03-19 13:43 UTC (permalink / raw) To: Andrew Morton Cc: Dan Williams, linux-kernel, linux-arch, axboe, riel, linux-nvdimm, Dave Hansen, linux-raid, mgorman, hch, linux-fsdevel On Wed, Mar 18, 2015 at 01:26:50PM -0700, Andrew Morton wrote: > On Mon, 16 Mar 2015 16:25:25 -0400 Dan Williams <dan.j.williams@intel.com> wrote: > > > Avoid the impending disaster of requiring struct page coverage for what > > is expected to be ever increasing capacities of persistent memory. In > > conversations with Rik van Riel, Mel Gorman, and Jens Axboe at the > > recently concluded Linux Storage Summit it became clear that struct page > > is not required in many places, it was simply convenient to re-use. > > > > Introduce helpers and infrastructure to remove struct page usage where > > it is not necessary. One use case for these changes is to implement a > > write-back-cache in persistent memory for software-RAID. Another use > > case for the scatterlist changes is RDMA to a pfn-range. > > Those use-cases sound very thin. If that's all we have then I'd say > "find another way of implementing those things without creating > pageframes for persistent memory". > > IOW, please tell us much much much more about the value of this change. Dan missed "Support O_DIRECT to a mapped DAX file". More generally, if we want to be able to do any kind of I/O directly to persistent memory, and I think we do, we need to do one of: 1. Construct struct pages for persistent memory 1a. Permanently 1b. While the pages are under I/O 2. Teach the I/O layers to deal in PFNs instead of struct pages 3. Replace struct page with some other structure that can represent both DRAM and PMEM I'm personally a fan of #3, and I was looking at the scatterlist as my preferred data structure. I now believe the scatterlist as it is currently defined isn't sufficient, so we probably end up needing a new data structure. I think Dan's preferred method of replacing struct pages with PFNs is actually less instrusive, but doesn't give us as much advantage (an entirely new data structure would let us move to an extent based system at the same time, instead of sticking with an array of pages). Clearly Boaz prefers 1a, which works well enough for the 8GB NV-DIMMs, but not well enough for the 400GB NV-DIMMs. What's your preference? I guess option 0 is "force all I/O to go through the page cache and then get copied", but that feels like a nasty performance hit. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 0/7] evacuate struct page from the block layer 2015-03-19 13:43 ` Matthew Wilcox @ 2015-03-19 13:43 ` Matthew Wilcox 2015-03-19 15:54 ` [Linux-nvdimm] " Boaz Harrosh ` (2 subsequent siblings) 3 siblings, 0 replies; 59+ messages in thread From: Matthew Wilcox @ 2015-03-19 13:43 UTC (permalink / raw) To: Andrew Morton Cc: Dan Williams, linux-kernel, linux-arch, axboe, riel, linux-nvdimm, Dave Hansen, linux-raid, mgorman, hch, linux-fsdevel On Wed, Mar 18, 2015 at 01:26:50PM -0700, Andrew Morton wrote: > On Mon, 16 Mar 2015 16:25:25 -0400 Dan Williams <dan.j.williams@intel.com> wrote: > > > Avoid the impending disaster of requiring struct page coverage for what > > is expected to be ever increasing capacities of persistent memory. In > > conversations with Rik van Riel, Mel Gorman, and Jens Axboe at the > > recently concluded Linux Storage Summit it became clear that struct page > > is not required in many places, it was simply convenient to re-use. > > > > Introduce helpers and infrastructure to remove struct page usage where > > it is not necessary. One use case for these changes is to implement a > > write-back-cache in persistent memory for software-RAID. Another use > > case for the scatterlist changes is RDMA to a pfn-range. > > Those use-cases sound very thin. If that's all we have then I'd say > "find another way of implementing those things without creating > pageframes for persistent memory". > > IOW, please tell us much much much more about the value of this change. Dan missed "Support O_DIRECT to a mapped DAX file". More generally, if we want to be able to do any kind of I/O directly to persistent memory, and I think we do, we need to do one of: 1. Construct struct pages for persistent memory 1a. Permanently 1b. While the pages are under I/O 2. Teach the I/O layers to deal in PFNs instead of struct pages 3. Replace struct page with some other structure that can represent both DRAM and PMEM I'm personally a fan of #3, and I was looking at the scatterlist as my preferred data structure. I now believe the scatterlist as it is currently defined isn't sufficient, so we probably end up needing a new data structure. I think Dan's preferred method of replacing struct pages with PFNs is actually less instrusive, but doesn't give us as much advantage (an entirely new data structure would let us move to an extent based system at the same time, instead of sticking with an array of pages). Clearly Boaz prefers 1a, which works well enough for the 8GB NV-DIMMs, but not well enough for the 400GB NV-DIMMs. What's your preference? I guess option 0 is "force all I/O to go through the page cache and then get copied", but that feels like a nasty performance hit. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [Linux-nvdimm] [RFC PATCH 0/7] evacuate struct page from the block layer 2015-03-19 13:43 ` Matthew Wilcox 2015-03-19 13:43 ` Matthew Wilcox @ 2015-03-19 15:54 ` Boaz Harrosh 2015-03-19 15:54 ` Boaz Harrosh 2015-03-19 19:59 ` Andrew Morton 2015-03-19 18:17 ` Christoph Hellwig 2015-03-20 16:21 ` Rik van Riel 3 siblings, 2 replies; 59+ messages in thread From: Boaz Harrosh @ 2015-03-19 15:54 UTC (permalink / raw) To: Matthew Wilcox, Andrew Morton Cc: linux-arch, axboe, riel, hch, linux-nvdimm, Dave Hansen, linux-kernel, linux-raid, mgorman, linux-fsdevel On 03/19/2015 03:43 PM, Matthew Wilcox wrote: <> > > Dan missed "Support O_DIRECT to a mapped DAX file". More generally, if we > want to be able to do any kind of I/O directly to persistent memory, > and I think we do, we need to do one of: > > 1. Construct struct pages for persistent memory > 1a. Permanently > 1b. While the pages are under I/O > 2. Teach the I/O layers to deal in PFNs instead of struct pages > 3. Replace struct page with some other structure that can represent both > DRAM and PMEM > > I'm personally a fan of #3, and I was looking at the scatterlist as > my preferred data structure. I now believe the scatterlist as it is > currently defined isn't sufficient, so we probably end up needing a new > data structure. I think Dan's preferred method of replacing struct > pages with PFNs is actually less instrusive, but doesn't give us as > much advantage (an entirely new data structure would let us move to an > extent based system at the same time, instead of sticking with an array > of pages). Clearly Boaz prefers 1a, which works well enough for the > 8GB NV-DIMMs, but not well enough for the 400GB NV-DIMMs. > > What's your preference? I guess option 0 is "force all I/O to go > through the page cache and then get copied", but that feels like a nasty > performance hit. Thanks Matthew, you have summarized it perfectly. I think #1b might have merit, as well. I have a very surgical small "hack" that we can do with allocating on demand pages before IO. It involves adding a new MEMORY_MODEL policy that is derived from SPARSEMEM but lets you allocate individual pages on demand. And a new type of page say call it GP_emulated_page. (Tell me if you find this interesting. It is 1/117 in size of both #2 or #3) In anyway please reconsider a configurable #1a for people that do not mind sacrificing 1.2% of their pmem for real pages. Even at 6G page-structs with 400G pmem, people would love some of the stuff this gives them today. just few examples: direct_access from within a VM to an host defined pmem, is trivial with no extra code with my two simple #1a patches. RDMA memory brick targets, network shared memory FS and so on, the list will always be bigger then any of #1b #2 or #3. Yes for people that want to sacrifice the extra cost. In the Kernel it was always about choice and diversity. And what does it costs us. Nothing. Two small simple patches and a Kconfig option. Note that I made it in such a way that if pmem is configured without use of pages, then the mm code is *not* configured-in automatically. We can even add a runtime option that even if #1a is enabled, for certain pmem device may not want pages allocated. And so choose at runtime rather than compile time. I think this will only farther our cause and let people advance with their research and development with great new ideas about use of pmem. Then once there is a great demand for #1a and those large 512G devices come out, we can go the #1b or #3 route and save them the extra 1.2% memory, but once they have the appetite for it. (And Andrews question becomes clear) Our two ways need not be "either-or". They can be "have both". I think choice is a good thing for us here. Even with #3 available #1a still has merit in some configurations and they can co exist perfectly. Please think about it? Thanks Boaz ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [Linux-nvdimm] [RFC PATCH 0/7] evacuate struct page from the block layer 2015-03-19 15:54 ` [Linux-nvdimm] " Boaz Harrosh @ 2015-03-19 15:54 ` Boaz Harrosh 2015-03-19 19:59 ` Andrew Morton 1 sibling, 0 replies; 59+ messages in thread From: Boaz Harrosh @ 2015-03-19 15:54 UTC (permalink / raw) To: Matthew Wilcox, Andrew Morton Cc: linux-arch, axboe, riel, hch, linux-nvdimm, Dave Hansen, linux-kernel, linux-raid, mgorman, linux-fsdevel On 03/19/2015 03:43 PM, Matthew Wilcox wrote: <> > > Dan missed "Support O_DIRECT to a mapped DAX file". More generally, if we > want to be able to do any kind of I/O directly to persistent memory, > and I think we do, we need to do one of: > > 1. Construct struct pages for persistent memory > 1a. Permanently > 1b. While the pages are under I/O > 2. Teach the I/O layers to deal in PFNs instead of struct pages > 3. Replace struct page with some other structure that can represent both > DRAM and PMEM > > I'm personally a fan of #3, and I was looking at the scatterlist as > my preferred data structure. I now believe the scatterlist as it is > currently defined isn't sufficient, so we probably end up needing a new > data structure. I think Dan's preferred method of replacing struct > pages with PFNs is actually less instrusive, but doesn't give us as > much advantage (an entirely new data structure would let us move to an > extent based system at the same time, instead of sticking with an array > of pages). Clearly Boaz prefers 1a, which works well enough for the > 8GB NV-DIMMs, but not well enough for the 400GB NV-DIMMs. > > What's your preference? I guess option 0 is "force all I/O to go > through the page cache and then get copied", but that feels like a nasty > performance hit. Thanks Matthew, you have summarized it perfectly. I think #1b might have merit, as well. I have a very surgical small "hack" that we can do with allocating on demand pages before IO. It involves adding a new MEMORY_MODEL policy that is derived from SPARSEMEM but lets you allocate individual pages on demand. And a new type of page say call it GP_emulated_page. (Tell me if you find this interesting. It is 1/117 in size of both #2 or #3) In anyway please reconsider a configurable #1a for people that do not mind sacrificing 1.2% of their pmem for real pages. Even at 6G page-structs with 400G pmem, people would love some of the stuff this gives them today. just few examples: direct_access from within a VM to an host defined pmem, is trivial with no extra code with my two simple #1a patches. RDMA memory brick targets, network shared memory FS and so on, the list will always be bigger then any of #1b #2 or #3. Yes for people that want to sacrifice the extra cost. In the Kernel it was always about choice and diversity. And what does it costs us. Nothing. Two small simple patches and a Kconfig option. Note that I made it in such a way that if pmem is configured without use of pages, then the mm code is *not* configured-in automatically. We can even add a runtime option that even if #1a is enabled, for certain pmem device may not want pages allocated. And so choose at runtime rather than compile time. I think this will only farther our cause and let people advance with their research and development with great new ideas about use of pmem. Then once there is a great demand for #1a and those large 512G devices come out, we can go the #1b or #3 route and save them the extra 1.2% memory, but once they have the appetite for it. (And Andrews question becomes clear) Our two ways need not be "either-or". They can be "have both". I think choice is a good thing for us here. Even with #3 available #1a still has merit in some configurations and they can co exist perfectly. Please think about it? Thanks Boaz ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [Linux-nvdimm] [RFC PATCH 0/7] evacuate struct page from the block layer 2015-03-19 15:54 ` [Linux-nvdimm] " Boaz Harrosh 2015-03-19 15:54 ` Boaz Harrosh @ 2015-03-19 19:59 ` Andrew Morton 2015-03-19 20:59 ` Dan Williams ` (2 more replies) 1 sibling, 3 replies; 59+ messages in thread From: Andrew Morton @ 2015-03-19 19:59 UTC (permalink / raw) To: Boaz Harrosh Cc: Matthew Wilcox, linux-arch, axboe, riel, hch, linux-nvdimm, Dave Hansen, linux-kernel, linux-raid, mgorman, linux-fsdevel On Thu, 19 Mar 2015 17:54:15 +0200 Boaz Harrosh <boaz@plexistor.com> wrote: > On 03/19/2015 03:43 PM, Matthew Wilcox wrote: > <> > > > > Dan missed "Support O_DIRECT to a mapped DAX file". More generally, if we > > want to be able to do any kind of I/O directly to persistent memory, > > and I think we do, we need to do one of: > > > > 1. Construct struct pages for persistent memory > > 1a. Permanently > > 1b. While the pages are under I/O > > 2. Teach the I/O layers to deal in PFNs instead of struct pages > > 3. Replace struct page with some other structure that can represent both > > DRAM and PMEM > > > > I'm personally a fan of #3, and I was looking at the scatterlist as > > my preferred data structure. I now believe the scatterlist as it is > > currently defined isn't sufficient, so we probably end up needing a new > > data structure. I think Dan's preferred method of replacing struct > > pages with PFNs is actually less instrusive, but doesn't give us as > > much advantage (an entirely new data structure would let us move to an > > extent based system at the same time, instead of sticking with an array > > of pages). Clearly Boaz prefers 1a, which works well enough for the > > 8GB NV-DIMMs, but not well enough for the 400GB NV-DIMMs. > > > > What's your preference? I guess option 0 is "force all I/O to go > > through the page cache and then get copied", but that feels like a nasty > > performance hit. > > Thanks Matthew, you have summarized it perfectly. > > I think #1b might have merit, as well. It would be interesting to see what a 1b implementation looks like and how it performs. We already allocate a bunch of temporary things to support in-flight IO (bio, request) and allocating pageframes on the same basis seems a fairly logical fit. It is all a bit of a stopgap, designed to shoehorn direct-io-to-dax-mapped-memory into the existing world. Longer term I'd expect us to move to something more powerful, but it's unclear what that will be at this time, so a stopgap isn't too bad? This is all contingent upon the prevalence of machines which have vast amounts of nv memory and relatively small amounts of regular memory. How confident are we that this really is the future? ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [Linux-nvdimm] [RFC PATCH 0/7] evacuate struct page from the block layer 2015-03-19 19:59 ` Andrew Morton @ 2015-03-19 20:59 ` Dan Williams 2015-03-22 17:22 ` Boaz Harrosh 2015-03-20 17:32 ` Wols Lists 2015-03-22 10:30 ` Boaz Harrosh 2 siblings, 1 reply; 59+ messages in thread From: Dan Williams @ 2015-03-19 20:59 UTC (permalink / raw) To: Andrew Morton Cc: Boaz Harrosh, linux-arch, Jens Axboe, riel, linux-raid, linux-nvdimm, Dave Hansen, linux-kernel@vger.kernel.org, Christoph Hellwig, Mel Gorman, linux-fsdevel On Thu, Mar 19, 2015 at 12:59 PM, Andrew Morton <akpm@linux-foundation.org> wrote: > On Thu, 19 Mar 2015 17:54:15 +0200 Boaz Harrosh <boaz@plexistor.com> wrote: > >> On 03/19/2015 03:43 PM, Matthew Wilcox wrote: >> <> >> > >> > Dan missed "Support O_DIRECT to a mapped DAX file". More generally, if we >> > want to be able to do any kind of I/O directly to persistent memory, >> > and I think we do, we need to do one of: >> > >> > 1. Construct struct pages for persistent memory >> > 1a. Permanently >> > 1b. While the pages are under I/O >> > 2. Teach the I/O layers to deal in PFNs instead of struct pages >> > 3. Replace struct page with some other structure that can represent both >> > DRAM and PMEM >> > >> > I'm personally a fan of #3, and I was looking at the scatterlist as >> > my preferred data structure. I now believe the scatterlist as it is >> > currently defined isn't sufficient, so we probably end up needing a new >> > data structure. I think Dan's preferred method of replacing struct >> > pages with PFNs is actually less instrusive, but doesn't give us as >> > much advantage (an entirely new data structure would let us move to an >> > extent based system at the same time, instead of sticking with an array >> > of pages). Clearly Boaz prefers 1a, which works well enough for the >> > 8GB NV-DIMMs, but not well enough for the 400GB NV-DIMMs. >> > >> > What's your preference? I guess option 0 is "force all I/O to go >> > through the page cache and then get copied", but that feels like a nasty >> > performance hit. >> >> Thanks Matthew, you have summarized it perfectly. >> >> I think #1b might have merit, as well. > > It would be interesting to see what a 1b implementation looks like and > how it performs. We already allocate a bunch of temporary things to > support in-flight IO (bio, request) and allocating pageframes on the > same basis seems a fairly logical fit. At least for block-i/o it seems the only place we really need struct page infrastructure is for kmap(). Given we already need a kmap_pfn() solution for option 2 a "dynamic allocation" stop along that development path may just naturally fall out. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [Linux-nvdimm] [RFC PATCH 0/7] evacuate struct page from the block layer 2015-03-19 20:59 ` Dan Williams @ 2015-03-22 17:22 ` Boaz Harrosh 2015-03-22 17:22 ` Boaz Harrosh 0 siblings, 1 reply; 59+ messages in thread From: Boaz Harrosh @ 2015-03-22 17:22 UTC (permalink / raw) To: Dan Williams, Andrew Morton Cc: Boaz Harrosh, linux-arch, Jens Axboe, riel, linux-raid, linux-nvdimm, Dave Hansen, linux-kernel@vger.kernel.org, Christoph Hellwig, Mel Gorman, linux-fsdevel On 03/19/2015 10:59 PM, Dan Williams wrote: > > At least for block-i/o it seems the only place we really need struct > page infrastructure is for kmap(). Given we already need a kmap_pfn() > solution for option 2 a "dynamic allocation" stop along that > development path may just naturally fall out. Really? what about networked block-io, RDMA, FcOE emulated targets, mmaped pointers. virtual-machine bdev drivers Block layer sits in the middle of the stack not at the low end as you make it appear. There are lots of below the bio subsystems that tie into a page struct, which will now stop to operate, unless you do: pfn_to_page() which means a page-less pfn will now crash or will need to be rejected so any where you have a if (page_less_pfn()) ... /* Fail or do some other code like copy */ else page = pfn_to_page() Is a double code path in the Kernel and is a nightmare to maintain. (I'm here for you believe me ;-) ) Thanks Boaz ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [Linux-nvdimm] [RFC PATCH 0/7] evacuate struct page from the block layer 2015-03-22 17:22 ` Boaz Harrosh @ 2015-03-22 17:22 ` Boaz Harrosh 0 siblings, 0 replies; 59+ messages in thread From: Boaz Harrosh @ 2015-03-22 17:22 UTC (permalink / raw) To: Dan Williams, Andrew Morton Cc: Boaz Harrosh, linux-arch, Jens Axboe, riel, linux-raid, linux-nvdimm, Dave Hansen, linux-kernel@vger.kernel.org, Christoph Hellwig, Mel Gorman, linux-fsdevel On 03/19/2015 10:59 PM, Dan Williams wrote: > > At least for block-i/o it seems the only place we really need struct > page infrastructure is for kmap(). Given we already need a kmap_pfn() > solution for option 2 a "dynamic allocation" stop along that > development path may just naturally fall out. Really? what about networked block-io, RDMA, FcOE emulated targets, mmaped pointers. virtual-machine bdev drivers Block layer sits in the middle of the stack not at the low end as you make it appear. There are lots of below the bio subsystems that tie into a page struct, which will now stop to operate, unless you do: pfn_to_page() which means a page-less pfn will now crash or will need to be rejected so any where you have a if (page_less_pfn()) ... /* Fail or do some other code like copy */ else page = pfn_to_page() Is a double code path in the Kernel and is a nightmare to maintain. (I'm here for you believe me ;-) ) Thanks Boaz ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [Linux-nvdimm] [RFC PATCH 0/7] evacuate struct page from the block layer 2015-03-19 19:59 ` Andrew Morton 2015-03-19 20:59 ` Dan Williams @ 2015-03-20 17:32 ` Wols Lists 2015-03-20 17:32 ` Wols Lists 2015-03-22 10:30 ` Boaz Harrosh 2 siblings, 1 reply; 59+ messages in thread From: Wols Lists @ 2015-03-20 17:32 UTC (permalink / raw) To: Andrew Morton, Boaz Harrosh Cc: Matthew Wilcox, linux-arch, axboe, riel, hch, linux-nvdimm, Dave Hansen, linux-kernel, linux-raid, mgorman, linux-fsdevel On 19/03/15 19:59, Andrew Morton wrote: > This is all contingent upon the prevalence of machines which have vast > amounts of nv memory and relatively small amounts of regular memory. > How confident are we that this really is the future? Somewhat off-topic, but it's also the past. I can't help thinking of the early Pick machines, which treated backing store as one giant permanent virtual memory. Back when 300Mb hard drives were HUGE. Cheers, Wol ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [Linux-nvdimm] [RFC PATCH 0/7] evacuate struct page from the block layer 2015-03-20 17:32 ` Wols Lists @ 2015-03-20 17:32 ` Wols Lists 0 siblings, 0 replies; 59+ messages in thread From: Wols Lists @ 2015-03-20 17:32 UTC (permalink / raw) To: Andrew Morton, Boaz Harrosh Cc: Matthew Wilcox, linux-arch, axboe, riel, hch, linux-nvdimm, Dave Hansen, linux-kernel, linux-raid, mgorman, linux-fsdevel On 19/03/15 19:59, Andrew Morton wrote: > This is all contingent upon the prevalence of machines which have vast > amounts of nv memory and relatively small amounts of regular memory. > How confident are we that this really is the future? Somewhat off-topic, but it's also the past. I can't help thinking of the early Pick machines, which treated backing store as one giant permanent virtual memory. Back when 300Mb hard drives were HUGE. Cheers, Wol ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [Linux-nvdimm] [RFC PATCH 0/7] evacuate struct page from the block layer 2015-03-19 19:59 ` Andrew Morton 2015-03-19 20:59 ` Dan Williams 2015-03-20 17:32 ` Wols Lists @ 2015-03-22 10:30 ` Boaz Harrosh 2015-03-22 10:30 ` Boaz Harrosh 2 siblings, 1 reply; 59+ messages in thread From: Boaz Harrosh @ 2015-03-22 10:30 UTC (permalink / raw) To: Andrew Morton Cc: Matthew Wilcox, linux-arch, axboe, riel, hch, linux-nvdimm, Dave Hansen, linux-kernel, linux-raid, mgorman, linux-fsdevel On 03/19/2015 09:59 PM, Andrew Morton wrote: > On Thu, 19 Mar 2015 17:54:15 +0200 Boaz Harrosh <boaz@plexistor.com> wrote: > >> On 03/19/2015 03:43 PM, Matthew Wilcox wrote: >> <> >>> >>> Dan missed "Support O_DIRECT to a mapped DAX file". More generally, if we >>> want to be able to do any kind of I/O directly to persistent memory, >>> and I think we do, we need to do one of: >>> >>> 1. Construct struct pages for persistent memory >>> 1a. Permanently >>> 1b. While the pages are under I/O >>> 2. Teach the I/O layers to deal in PFNs instead of struct pages >>> 3. Replace struct page with some other structure that can represent both >>> DRAM and PMEM >>> >>> I'm personally a fan of #3, and I was looking at the scatterlist as >>> my preferred data structure. I now believe the scatterlist as it is >>> currently defined isn't sufficient, so we probably end up needing a new >>> data structure. I think Dan's preferred method of replacing struct >>> pages with PFNs is actually less instrusive, but doesn't give us as >>> much advantage (an entirely new data structure would let us move to an >>> extent based system at the same time, instead of sticking with an array >>> of pages). Clearly Boaz prefers 1a, which works well enough for the >>> 8GB NV-DIMMs, but not well enough for the 400GB NV-DIMMs. >>> >>> What's your preference? I guess option 0 is "force all I/O to go >>> through the page cache and then get copied", but that feels like a nasty >>> performance hit. >> >> Thanks Matthew, you have summarized it perfectly. >> >> I think #1b might have merit, as well. > > It would be interesting to see what a 1b implementation looks like and > how it performs. We already allocate a bunch of temporary things to > support in-flight IO (bio, request) and allocating pageframes on the > same basis seems a fairly logical fit. There is a couple of ways we can do this, they are all kind of "hacks" to me, along the line of how transparent huge pages is an hack, a very nice one at that, and every one that knows me knows I love hacks, be so it is never the less. So it is all about designating the page to mean something else at a set of a flag. And actually the transparent-huge-pages is the core of this. because there is already a switch on core page operations when it is present. (for example get/put_page ) And because we do not want to allocate pages inline, as part of a section, we also need a bit of a memory_model.h new define. (May this can avoided I need to stare harder on this) > > It is all a bit of a stopgap, designed to shoehorn > direct-io-to-dax-mapped-memory into the existing world. Longer term > I'd expect us to move to something more powerful, but it's unclear what > that will be at this time, so a stopgap isn't too bad? > I'd bet real huge-pages is the long term. The one stop gap for huge-pages is that no one wants to dirty a full 2M for two changed bytes. 4k is kind of the IO performance granularity we all calculate for. This can be solved in couple of ways, all very invasive to lots of Kernel areas. Lots of times the problem is "where do you start?" > > This is all contingent upon the prevalence of machines which have vast > amounts of nv memory and relatively small amounts of regular memory. > How confident are we that this really is the future? > One thing you guys are ignoring is that the 1.5% "waste" can come from nv-memory. If real ram is scarce and nv-ram is hips cheep, just allocate the pages from nvram then. Do not forget that very soon after the availability of real nvram, I mean not the backed up one, but the real like mram or reram. Lots of machines will be 100% nv-ram + sram caches. This is nothing to do with storage speed, it is to do with power consumption. The machine shuts-off and picks up exactly where it was. (Even at power on they consume much less, no refreshes) In those machine a partition of storage say, the swap partition, will be the volatile memory section of the machine, zeroed out on boot and used as RAM. So this future above does not exist. Pages can just be allocated from the cheapest memory you have and be done with it. (BTW all this can already be done now, I have demonstrated it in the lab, a reserved NvDIMM memory region is memory_hot_plugged and is there after used as regular RAM) Thanks Boaz ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [Linux-nvdimm] [RFC PATCH 0/7] evacuate struct page from the block layer 2015-03-22 10:30 ` Boaz Harrosh @ 2015-03-22 10:30 ` Boaz Harrosh 0 siblings, 0 replies; 59+ messages in thread From: Boaz Harrosh @ 2015-03-22 10:30 UTC (permalink / raw) To: Andrew Morton Cc: Matthew Wilcox, linux-arch, axboe, riel, hch, linux-nvdimm, Dave Hansen, linux-kernel, linux-raid, mgorman, linux-fsdevel On 03/19/2015 09:59 PM, Andrew Morton wrote: > On Thu, 19 Mar 2015 17:54:15 +0200 Boaz Harrosh <boaz@plexistor.com> wrote: > >> On 03/19/2015 03:43 PM, Matthew Wilcox wrote: >> <> >>> >>> Dan missed "Support O_DIRECT to a mapped DAX file". More generally, if we >>> want to be able to do any kind of I/O directly to persistent memory, >>> and I think we do, we need to do one of: >>> >>> 1. Construct struct pages for persistent memory >>> 1a. Permanently >>> 1b. While the pages are under I/O >>> 2. Teach the I/O layers to deal in PFNs instead of struct pages >>> 3. Replace struct page with some other structure that can represent both >>> DRAM and PMEM >>> >>> I'm personally a fan of #3, and I was looking at the scatterlist as >>> my preferred data structure. I now believe the scatterlist as it is >>> currently defined isn't sufficient, so we probably end up needing a new >>> data structure. I think Dan's preferred method of replacing struct >>> pages with PFNs is actually less instrusive, but doesn't give us as >>> much advantage (an entirely new data structure would let us move to an >>> extent based system at the same time, instead of sticking with an array >>> of pages). Clearly Boaz prefers 1a, which works well enough for the >>> 8GB NV-DIMMs, but not well enough for the 400GB NV-DIMMs. >>> >>> What's your preference? I guess option 0 is "force all I/O to go >>> through the page cache and then get copied", but that feels like a nasty >>> performance hit. >> >> Thanks Matthew, you have summarized it perfectly. >> >> I think #1b might have merit, as well. > > It would be interesting to see what a 1b implementation looks like and > how it performs. We already allocate a bunch of temporary things to > support in-flight IO (bio, request) and allocating pageframes on the > same basis seems a fairly logical fit. There is a couple of ways we can do this, they are all kind of "hacks" to me, along the line of how transparent huge pages is an hack, a very nice one at that, and every one that knows me knows I love hacks, be so it is never the less. So it is all about designating the page to mean something else at a set of a flag. And actually the transparent-huge-pages is the core of this. because there is already a switch on core page operations when it is present. (for example get/put_page ) And because we do not want to allocate pages inline, as part of a section, we also need a bit of a memory_model.h new define. (May this can avoided I need to stare harder on this) > > It is all a bit of a stopgap, designed to shoehorn > direct-io-to-dax-mapped-memory into the existing world. Longer term > I'd expect us to move to something more powerful, but it's unclear what > that will be at this time, so a stopgap isn't too bad? > I'd bet real huge-pages is the long term. The one stop gap for huge-pages is that no one wants to dirty a full 2M for two changed bytes. 4k is kind of the IO performance granularity we all calculate for. This can be solved in couple of ways, all very invasive to lots of Kernel areas. Lots of times the problem is "where do you start?" > > This is all contingent upon the prevalence of machines which have vast > amounts of nv memory and relatively small amounts of regular memory. > How confident are we that this really is the future? > One thing you guys are ignoring is that the 1.5% "waste" can come from nv-memory. If real ram is scarce and nv-ram is hips cheep, just allocate the pages from nvram then. Do not forget that very soon after the availability of real nvram, I mean not the backed up one, but the real like mram or reram. Lots of machines will be 100% nv-ram + sram caches. This is nothing to do with storage speed, it is to do with power consumption. The machine shuts-off and picks up exactly where it was. (Even at power on they consume much less, no refreshes) In those machine a partition of storage say, the swap partition, will be the volatile memory section of the machine, zeroed out on boot and used as RAM. So this future above does not exist. Pages can just be allocated from the cheapest memory you have and be done with it. (BTW all this can already be done now, I have demonstrated it in the lab, a reserved NvDIMM memory region is memory_hot_plugged and is there after used as regular RAM) Thanks Boaz ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 0/7] evacuate struct page from the block layer 2015-03-19 13:43 ` Matthew Wilcox 2015-03-19 13:43 ` Matthew Wilcox 2015-03-19 15:54 ` [Linux-nvdimm] " Boaz Harrosh @ 2015-03-19 18:17 ` Christoph Hellwig 2015-03-19 19:31 ` Matthew Wilcox 2015-03-22 16:46 ` Boaz Harrosh 2015-03-20 16:21 ` Rik van Riel 3 siblings, 2 replies; 59+ messages in thread From: Christoph Hellwig @ 2015-03-19 18:17 UTC (permalink / raw) To: Matthew Wilcox Cc: Andrew Morton, Dan Williams, linux-kernel, linux-arch, axboe, riel, linux-nvdimm, Dave Hansen, linux-raid, mgorman, hch, linux-fsdevel On Thu, Mar 19, 2015 at 09:43:13AM -0400, Matthew Wilcox wrote: > Dan missed "Support O_DIRECT to a mapped DAX file". More generally, if we > want to be able to do any kind of I/O directly to persistent memory, > and I think we do, we need to do one of: > > 1. Construct struct pages for persistent memory > 1a. Permanently > 1b. While the pages are under I/O > 2. Teach the I/O layers to deal in PFNs instead of struct pages > 3. Replace struct page with some other structure that can represent both > DRAM and PMEM > > I'm personally a fan of #3, and I was looking at the scatterlist as > my preferred data structure. I now believe the scatterlist as it is > currently defined isn't sufficient, so we probably end up needing a new > data structure. I think Dan's preferred method of replacing struct > pages with PFNs is actually less instrusive, but doesn't give us as > much advantage (an entirely new data structure would let us move to an > extent based system at the same time, instead of sticking with an array > of pages). Clearly Boaz prefers 1a, which works well enough for the > 8GB NV-DIMMs, but not well enough for the 400GB NV-DIMMs. > > What's your preference? I guess option 0 is "force all I/O to go > through the page cache and then get copied", but that feels like a nasty > performance hit. In addition to the options there's also a time line. At least for the short term where we want to get something going 1a seems like the absolutely be option. It works perfectly fine for the lots of small capacity dram-like nvdimms, and it works funtionally fine for the special huge ones, although the resource use for it is highly annoying. If it turns out to be too annoying we can also offer a no I/O possible option for them in the short run. In the long run option 2) sounds like a good plan to me, but not as a parallel I/O path, but as the main one. Doing so will in fact give us options to experiment with 3). Given that we're moving towards an increasinly huge page using world replacing the good old struct page with something extent-like and/or temporary might be needed for dram as well in the future. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 0/7] evacuate struct page from the block layer 2015-03-19 18:17 ` Christoph Hellwig @ 2015-03-19 19:31 ` Matthew Wilcox 2015-03-22 16:46 ` Boaz Harrosh 1 sibling, 0 replies; 59+ messages in thread From: Matthew Wilcox @ 2015-03-19 19:31 UTC (permalink / raw) To: Christoph Hellwig Cc: Andrew Morton, Dan Williams, linux-kernel, linux-arch, axboe, riel, linux-nvdimm, Dave Hansen, linux-raid, mgorman, linux-fsdevel On Thu, Mar 19, 2015 at 11:17:25AM -0700, Christoph Hellwig wrote: > On Thu, Mar 19, 2015 at 09:43:13AM -0400, Matthew Wilcox wrote: > > 1. Construct struct pages for persistent memory > > 1a. Permanently > > 1b. While the pages are under I/O > > 2. Teach the I/O layers to deal in PFNs instead of struct pages > > 3. Replace struct page with some other structure that can represent both > > DRAM and PMEM > > In addition to the options there's also a time line. At least for the > short term where we want to get something going 1a seems like the > absolutely be option. It works perfectly fine for the lots of small (assuming "best option") > capacity dram-like nvdimms, and it works funtionally fine for the > special huge ones, although the resource use for it is highly annoying. > If it turns out to be too annoying we can also offer a no I/O possible > option for them in the short run. > > In the long run option 2) sounds like a good plan to me, but not as a > parallel I/O path, but as the main one. Doing so will in fact give us > options to experiment with 3). Given that we're moving towards an > increasinly huge page using world replacing the good old struct page > with something extent-like and/or temporary might be needed for dram > as well in the future. Dan's patches don't actually make it a "parallel I/O path", that was Boaz's mischaracterisation. They move all scatterlists and bios over to using PFNs, at least on architectures which have been converted. Speaking of architectures not being converted, it is really past time for architectures to be switched to supporting SG chaining. It was introduced in 2007, and not having it generically available causes problems for the crypto layer, as well as making further enhancements more tricky. Assuming 'select ARCH_HAS_SG_CHAIN' is sufficient to tell, the following architectures do support it: arm arm64 ia64 powerpc s390 sparc x86 which means the following architectures are 8 years delinquent in adding support: alpha arc avr32 blackfin c6x cris frv hexagon m32r m68k metag microblaze mips mn10300 nios2 openrisc parisc score sh tile um unicore32 xtensa Perhaps we could deliberately make asm-generic/scatterlist.h not build for architectures that don't select it in order to make them convert ... ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 0/7] evacuate struct page from the block layer 2015-03-19 18:17 ` Christoph Hellwig 2015-03-19 19:31 ` Matthew Wilcox @ 2015-03-22 16:46 ` Boaz Harrosh 1 sibling, 0 replies; 59+ messages in thread From: Boaz Harrosh @ 2015-03-22 16:46 UTC (permalink / raw) To: Christoph Hellwig, Matthew Wilcox Cc: Andrew Morton, Dan Williams, linux-kernel, linux-arch, axboe, riel, linux-nvdimm, Dave Hansen, linux-raid, mgorman, linux-fsdevel On 03/19/2015 08:17 PM, Christoph Hellwig wrote: <> > > In addition to the options there's also a time line. At least for the > short term where we want to get something going 1a seems like the > absolutely be option. It works perfectly fine for the lots of small > capacity dram-like nvdimms, and it works funtionally fine for the > special huge ones, although the resource use for it is highly annoying. > If it turns out to be too annoying we can also offer a no I/O possible > option for them in the short run. > Finally some voice in the dessert. > In the long run option 2) sounds like a good plan to me, but not as a > parallel I/O path, but as the main one. Doing so will in fact give us > options to experiment with 3). Given that we're moving towards an > increasinly huge page using world replacing the good old struct page > with something extent-like and/or temporary might be needed for dram > as well in the future. Why ? why not just make page mean page_size(page) and mostly even that is not needed. Any changes to bio will only solve bio. And will push the problem to the next subsystem. Fix the PAGE_SIZE problem and you fixed it for all subsystems, not only bio. And I believe it is the smaller change by far. Because in most places PAGE_SIZE just means MIN_PAGE_SIZE when we try calculate some array sizes for storage of a given "io-length", this is surly 4k, but then when the actual run time is preformed we usually have a length specifier like bv_len. (And the few places that do not are easy to fix I believe) Thanks Boaz ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 0/7] evacuate struct page from the block layer 2015-03-19 13:43 ` Matthew Wilcox ` (2 preceding siblings ...) 2015-03-19 18:17 ` Christoph Hellwig @ 2015-03-20 16:21 ` Rik van Riel 2015-03-20 20:31 ` Matthew Wilcox 2015-03-22 15:51 ` Boaz Harrosh 3 siblings, 2 replies; 59+ messages in thread From: Rik van Riel @ 2015-03-20 16:21 UTC (permalink / raw) To: Matthew Wilcox, Andrew Morton Cc: Dan Williams, linux-kernel, linux-arch, axboe, linux-nvdimm, Dave Hansen, linux-raid, mgorman, hch, linux-fsdevel, Michael S. Tsirkin On 03/19/2015 09:43 AM, Matthew Wilcox wrote: > 1. Construct struct pages for persistent memory > 1a. Permanently > 1b. While the pages are under I/O Michael Tsirkin and I have been doing some thinking about what it would take to allocate struct pages per 2MB area permanently, and allocate additional struct pages for 4kB pages on demand, when a 2MB area is broken up into 4kB pages. This should work for both DRAM and persistent memory. I am still not convinced it is worthwhile to have struct pages for persistent memory though, but I am willing to change my mind. -- All rights reversed ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 0/7] evacuate struct page from the block layer 2015-03-20 16:21 ` Rik van Riel @ 2015-03-20 20:31 ` Matthew Wilcox 2015-03-20 21:08 ` Rik van Riel ` (2 more replies) 2015-03-22 15:51 ` Boaz Harrosh 1 sibling, 3 replies; 59+ messages in thread From: Matthew Wilcox @ 2015-03-20 20:31 UTC (permalink / raw) To: Rik van Riel Cc: Andrew Morton, Dan Williams, linux-kernel, linux-arch, axboe, linux-nvdimm, Dave Hansen, linux-raid, mgorman, hch, linux-fsdevel, Michael S. Tsirkin On Fri, Mar 20, 2015 at 12:21:34PM -0400, Rik van Riel wrote: > On 03/19/2015 09:43 AM, Matthew Wilcox wrote: > > > 1. Construct struct pages for persistent memory > > 1a. Permanently > > 1b. While the pages are under I/O > > Michael Tsirkin and I have been doing some thinking about what > it would take to allocate struct pages per 2MB area permanently, > and allocate additional struct pages for 4kB pages on demand, > when a 2MB area is broken up into 4kB pages. Ah! I've looked at that a couple of times as well. I asked our database performance team what impact freeing up the memmap would have on their performance. They told me that doubling the amount of memory generally resulted in approximately a 40% performance improvement. So freeing up 1.5% additional memory would result in about 0.6% performance improvement, which I thought was probably too small a return on investment to justify turning memmap into a two-level data structure. Persistent memory might change that calculation somewhat ... but I'm not convinced. Certainly, if we already had the ability to allocate 'struct superpage', I wouldn't be pushing for page-less I/Os, I'd just allocate these data structures for PM. Even if they were 128 bytes in size, that's only a 25MB overhead per 400GB NV-DIMM, which feels quite reasonable to me. > This should work for both DRAM and persistent memory. > > I am still not convinced it is worthwhile to have struct pages > for persistent memory though, but I am willing to change my mind. There's a lot of code out there that relies on struct page being PAGE_SIZE bytes. I'm cool with replacing 'struct page' with 'struct superpage' [1] in the biovec and auditing all of the code which touches it ... but that's going to be a lot of code! I'm not sure it's less code than going directly to 'just do I/O on PFNs'. [1] Please, somebody come up with a better name! ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 0/7] evacuate struct page from the block layer 2015-03-20 20:31 ` Matthew Wilcox @ 2015-03-20 21:08 ` Rik van Riel 2015-03-20 21:08 ` Rik van Riel 2015-03-22 17:06 ` Boaz Harrosh 2015-03-20 21:17 ` Wols Lists 2015-03-22 16:24 ` Boaz Harrosh 2 siblings, 2 replies; 59+ messages in thread From: Rik van Riel @ 2015-03-20 21:08 UTC (permalink / raw) To: Matthew Wilcox Cc: Andrew Morton, Dan Williams, linux-kernel, linux-arch, axboe, linux-nvdimm, Dave Hansen, linux-raid, mgorman, hch, linux-fsdevel, Michael S. Tsirkin On 03/20/2015 04:31 PM, Matthew Wilcox wrote: > On Fri, Mar 20, 2015 at 12:21:34PM -0400, Rik van Riel wrote: >> On 03/19/2015 09:43 AM, Matthew Wilcox wrote: >> >>> 1. Construct struct pages for persistent memory >>> 1a. Permanently >>> 1b. While the pages are under I/O >> >> Michael Tsirkin and I have been doing some thinking about what >> it would take to allocate struct pages per 2MB area permanently, >> and allocate additional struct pages for 4kB pages on demand, >> when a 2MB area is broken up into 4kB pages. > > Ah! I've looked at that a couple of times as well. I asked our database > performance team what impact freeing up the memmap would have on their > performance. They told me that doubling the amount of memory generally > resulted in approximately a 40% performance improvement. So freeing up > 1.5% additional memory would result in about 0.6% performance improvement, > which I thought was probably too small a return on investment to justify > turning memmap into a two-level data structure. Agreed, it should not be done for memory savings alone, but only if it helps improve all kinds of other things. >> This should work for both DRAM and persistent memory. >> >> I am still not convinced it is worthwhile to have struct pages >> for persistent memory though, but I am willing to change my mind. > > There's a lot of code out there that relies on struct page being PAGE_SIZE > bytes. I'm cool with replacing 'struct page' with 'struct superpage' > [1] in the biovec and auditing all of the code which touches it ... but > that's going to be a lot of code! I'm not sure it's less code than > going directly to 'just do I/O on PFNs'. Totally agreed here. I see absolutely no advantage to teaching the IO layer about a "struct superpage" when it could operate on PFNs just as easily. -- All rights reversed ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 0/7] evacuate struct page from the block layer 2015-03-20 21:08 ` Rik van Riel @ 2015-03-20 21:08 ` Rik van Riel 2015-03-22 17:06 ` Boaz Harrosh 1 sibling, 0 replies; 59+ messages in thread From: Rik van Riel @ 2015-03-20 21:08 UTC (permalink / raw) To: Matthew Wilcox Cc: Andrew Morton, Dan Williams, linux-kernel, linux-arch, axboe, linux-nvdimm, Dave Hansen, linux-raid, mgorman, hch, linux-fsdevel, Michael S. Tsirkin On 03/20/2015 04:31 PM, Matthew Wilcox wrote: > On Fri, Mar 20, 2015 at 12:21:34PM -0400, Rik van Riel wrote: >> On 03/19/2015 09:43 AM, Matthew Wilcox wrote: >> >>> 1. Construct struct pages for persistent memory >>> 1a. Permanently >>> 1b. While the pages are under I/O >> >> Michael Tsirkin and I have been doing some thinking about what >> it would take to allocate struct pages per 2MB area permanently, >> and allocate additional struct pages for 4kB pages on demand, >> when a 2MB area is broken up into 4kB pages. > > Ah! I've looked at that a couple of times as well. I asked our database > performance team what impact freeing up the memmap would have on their > performance. They told me that doubling the amount of memory generally > resulted in approximately a 40% performance improvement. So freeing up > 1.5% additional memory would result in about 0.6% performance improvement, > which I thought was probably too small a return on investment to justify > turning memmap into a two-level data structure. Agreed, it should not be done for memory savings alone, but only if it helps improve all kinds of other things. >> This should work for both DRAM and persistent memory. >> >> I am still not convinced it is worthwhile to have struct pages >> for persistent memory though, but I am willing to change my mind. > > There's a lot of code out there that relies on struct page being PAGE_SIZE > bytes. I'm cool with replacing 'struct page' with 'struct superpage' > [1] in the biovec and auditing all of the code which touches it ... but > that's going to be a lot of code! I'm not sure it's less code than > going directly to 'just do I/O on PFNs'. Totally agreed here. I see absolutely no advantage to teaching the IO layer about a "struct superpage" when it could operate on PFNs just as easily. -- All rights reversed ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 0/7] evacuate struct page from the block layer 2015-03-20 21:08 ` Rik van Riel 2015-03-20 21:08 ` Rik van Riel @ 2015-03-22 17:06 ` Boaz Harrosh 2015-03-22 17:22 ` Dan Williams 1 sibling, 1 reply; 59+ messages in thread From: Boaz Harrosh @ 2015-03-22 17:06 UTC (permalink / raw) To: Rik van Riel, Matthew Wilcox Cc: Andrew Morton, Dan Williams, linux-kernel, linux-arch, axboe, linux-nvdimm, Dave Hansen, linux-raid, mgorman, hch, linux-fsdevel, Michael S. Tsirkin On 03/20/2015 11:08 PM, Rik van Riel wrote: > On 03/20/2015 04:31 PM, Matthew Wilcox wrote: <> >> There's a lot of code out there that relies on struct page being PAGE_SIZE >> bytes. I'm cool with replacing 'struct page' with 'struct superpage' >> [1] in the biovec and auditing all of the code which touches it ... but >> that's going to be a lot of code! I'm not sure it's less code than >> going directly to 'just do I/O on PFNs'. > > Totally agreed here. I see absolutely no advantage to teaching the > IO layer about a "struct superpage" when it could operate on PFNs > just as easily. > Or teaching 'struct page' to be variable length, This is already so at bio and sg level so you fixed nothing. Moving to pfn's only means that all this unnamed code above that "relies on struct page being PAGE_SIZE" is now not allowed to interfaced with bio and sg list. Which in current code and in Dan's patches means two tons of BUG_ONS and return -ENOTSUPP . For all these subsystems below the bio and sglist that operate on page_structs Say the "relies on struct page being PAGE_SIZE" is such an hard work, which is not at all at the bio and sg-list level, will it not be worth while fixing this instead of alienating the all Kernel from the IO subsystem. But I believe it is the much much smaller change? Specially considering Networking, RDMA shared memory ... Cheers Boaz ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 0/7] evacuate struct page from the block layer 2015-03-22 17:06 ` Boaz Harrosh @ 2015-03-22 17:22 ` Dan Williams 2015-03-22 17:22 ` Dan Williams 2015-03-22 17:39 ` Boaz Harrosh 0 siblings, 2 replies; 59+ messages in thread From: Dan Williams @ 2015-03-22 17:22 UTC (permalink / raw) To: Boaz Harrosh Cc: Rik van Riel, Matthew Wilcox, Andrew Morton, linux-kernel@vger.kernel.org, linux-arch, Jens Axboe, linux-nvdimm, Dave Hansen, linux-raid, Mel Gorman, Christoph Hellwig, linux-fsdevel, Michael S. Tsirkin On Sun, Mar 22, 2015 at 10:06 AM, Boaz Harrosh <boaz@plexistor.com> wrote: > On 03/20/2015 11:08 PM, Rik van Riel wrote: >> On 03/20/2015 04:31 PM, Matthew Wilcox wrote: > <> >>> There's a lot of code out there that relies on struct page being PAGE_SIZE >>> bytes. I'm cool with replacing 'struct page' with 'struct superpage' >>> [1] in the biovec and auditing all of the code which touches it ... but >>> that's going to be a lot of code! I'm not sure it's less code than >>> going directly to 'just do I/O on PFNs'. >> >> Totally agreed here. I see absolutely no advantage to teaching the >> IO layer about a "struct superpage" when it could operate on PFNs >> just as easily. >> > > Or teaching 'struct page' to be variable length, This is already so at > bio and sg level so you fixed nothing. > > Moving to pfn's only means that all this unnamed code above that > "relies on struct page being PAGE_SIZE" is now not allowed to > interfaced with bio and sg list. Which in current code and in Dan's patches > means two tons of BUG_ONS and return -ENOTSUPP . For all these > subsystems below the bio and sglist that operate on page_structs I'm not convinced it will be that bad. In hyperbolic terms, continuing to overload struct page means we get to let floppy.c do i/o from pmem, who needs that level of compatibility? Similar to sg_chain support I think it's fine to let sub-systems / archs add pmem i/o support over time. It's a scaling problem our development model is good at. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 0/7] evacuate struct page from the block layer 2015-03-22 17:22 ` Dan Williams @ 2015-03-22 17:22 ` Dan Williams 2015-03-22 17:39 ` Boaz Harrosh 1 sibling, 0 replies; 59+ messages in thread From: Dan Williams @ 2015-03-22 17:22 UTC (permalink / raw) To: Boaz Harrosh Cc: Rik van Riel, Matthew Wilcox, Andrew Morton, linux-kernel@vger.kernel.org, linux-arch, Jens Axboe, linux-nvdimm, Dave Hansen, linux-raid, Mel Gorman, Christoph Hellwig, linux-fsdevel, Michael S. Tsirkin On Sun, Mar 22, 2015 at 10:06 AM, Boaz Harrosh <boaz@plexistor.com> wrote: > On 03/20/2015 11:08 PM, Rik van Riel wrote: >> On 03/20/2015 04:31 PM, Matthew Wilcox wrote: > <> >>> There's a lot of code out there that relies on struct page being PAGE_SIZE >>> bytes. I'm cool with replacing 'struct page' with 'struct superpage' >>> [1] in the biovec and auditing all of the code which touches it ... but >>> that's going to be a lot of code! I'm not sure it's less code than >>> going directly to 'just do I/O on PFNs'. >> >> Totally agreed here. I see absolutely no advantage to teaching the >> IO layer about a "struct superpage" when it could operate on PFNs >> just as easily. >> > > Or teaching 'struct page' to be variable length, This is already so at > bio and sg level so you fixed nothing. > > Moving to pfn's only means that all this unnamed code above that > "relies on struct page being PAGE_SIZE" is now not allowed to > interfaced with bio and sg list. Which in current code and in Dan's patches > means two tons of BUG_ONS and return -ENOTSUPP . For all these > subsystems below the bio and sglist that operate on page_structs I'm not convinced it will be that bad. In hyperbolic terms, continuing to overload struct page means we get to let floppy.c do i/o from pmem, who needs that level of compatibility? Similar to sg_chain support I think it's fine to let sub-systems / archs add pmem i/o support over time. It's a scaling problem our development model is good at. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 0/7] evacuate struct page from the block layer 2015-03-22 17:22 ` Dan Williams 2015-03-22 17:22 ` Dan Williams @ 2015-03-22 17:39 ` Boaz Harrosh 2015-03-22 17:39 ` Boaz Harrosh 1 sibling, 1 reply; 59+ messages in thread From: Boaz Harrosh @ 2015-03-22 17:39 UTC (permalink / raw) To: Dan Williams Cc: Rik van Riel, Matthew Wilcox, Andrew Morton, linux-kernel@vger.kernel.org, linux-arch, Jens Axboe, linux-nvdimm, Dave Hansen, linux-raid, Mel Gorman, Christoph Hellwig, linux-fsdevel, Michael S. Tsirkin On 03/22/2015 07:22 PM, Dan Williams wrote: > On Sun, Mar 22, 2015 at 10:06 AM, Boaz Harrosh <boaz@plexistor.com> wrote: <> >> >> Moving to pfn's only means that all this unnamed code above that >> "relies on struct page being PAGE_SIZE" is now not allowed to >> interfaced with bio and sg list. Which in current code and in Dan's patches >> means two tons of BUG_ONS and return -ENOTSUPP . For all these >> subsystems below the bio and sglist that operate on page_structs > > I'm not convinced it will be that bad. In hyperbolic terms, > continuing to overload struct page means we get to let floppy.c do i/o > from pmem, who needs that level of compatibility? > But you do need to make sure it does not crash. right? > Similar to sg_chain support I think it's fine to let sub-systems / > archs add pmem i/o support over time. It's a scaling problem our > development model is good at. > You are so eager to do all this massive change, and willing to do it over a decade (Judging by your own example of sg-chain) But you completely ignore the fact that what I'm saying is that nothing needs to fundamentally change at all. No support over time and no "scaling problem" at all. All we want to fix is that page-struct means NOT PAGE_SIZE but some other size. The much smaller change and full cross Kernel compatibility. What's not to like ? Cheers Boaz ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 0/7] evacuate struct page from the block layer 2015-03-22 17:39 ` Boaz Harrosh @ 2015-03-22 17:39 ` Boaz Harrosh 0 siblings, 0 replies; 59+ messages in thread From: Boaz Harrosh @ 2015-03-22 17:39 UTC (permalink / raw) To: Dan Williams Cc: Rik van Riel, Matthew Wilcox, Andrew Morton, linux-kernel@vger.kernel.org, linux-arch, Jens Axboe, linux-nvdimm, Dave Hansen, linux-raid, Mel Gorman, Christoph Hellwig, linux-fsdevel, Michael S. Tsirkin On 03/22/2015 07:22 PM, Dan Williams wrote: > On Sun, Mar 22, 2015 at 10:06 AM, Boaz Harrosh <boaz@plexistor.com> wrote: <> >> >> Moving to pfn's only means that all this unnamed code above that >> "relies on struct page being PAGE_SIZE" is now not allowed to >> interfaced with bio and sg list. Which in current code and in Dan's patches >> means two tons of BUG_ONS and return -ENOTSUPP . For all these >> subsystems below the bio and sglist that operate on page_structs > > I'm not convinced it will be that bad. In hyperbolic terms, > continuing to overload struct page means we get to let floppy.c do i/o > from pmem, who needs that level of compatibility? > But you do need to make sure it does not crash. right? > Similar to sg_chain support I think it's fine to let sub-systems / > archs add pmem i/o support over time. It's a scaling problem our > development model is good at. > You are so eager to do all this massive change, and willing to do it over a decade (Judging by your own example of sg-chain) But you completely ignore the fact that what I'm saying is that nothing needs to fundamentally change at all. No support over time and no "scaling problem" at all. All we want to fix is that page-struct means NOT PAGE_SIZE but some other size. The much smaller change and full cross Kernel compatibility. What's not to like ? Cheers Boaz ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 0/7] evacuate struct page from the block layer 2015-03-20 20:31 ` Matthew Wilcox 2015-03-20 21:08 ` Rik van Riel @ 2015-03-20 21:17 ` Wols Lists 2015-03-20 21:17 ` Wols Lists 2015-03-22 16:24 ` Boaz Harrosh 2 siblings, 1 reply; 59+ messages in thread From: Wols Lists @ 2015-03-20 21:17 UTC (permalink / raw) To: Matthew Wilcox, Rik van Riel Cc: Andrew Morton, Dan Williams, linux-kernel, linux-arch, axboe, linux-nvdimm, Dave Hansen, linux-raid, mgorman, hch, linux-fsdevel, Michael S. Tsirkin On 20/03/15 20:31, Matthew Wilcox wrote: > Ah! I've looked at that a couple of times as well. I asked our database > performance team what impact freeing up the memmap would have on their > performance. They told me that doubling the amount of memory generally > resulted in approximately a 40% performance improvement. So freeing up > 1.5% additional memory would result in about 0.6% performance improvement, > which I thought was probably too small a return on investment to justify > turning memmap into a two-level data structure. Don't get me started on databases! This is very much a relational problem, other databases don't suffer like this. (imho relational theory is totally inappropriate for an engineering problem, like designing a database engine ...) Cheers, Wol ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 0/7] evacuate struct page from the block layer 2015-03-20 21:17 ` Wols Lists @ 2015-03-20 21:17 ` Wols Lists 0 siblings, 0 replies; 59+ messages in thread From: Wols Lists @ 2015-03-20 21:17 UTC (permalink / raw) To: Matthew Wilcox, Rik van Riel Cc: Andrew Morton, Dan Williams, linux-kernel, linux-arch, axboe, linux-nvdimm, Dave Hansen, linux-raid, mgorman, hch, linux-fsdevel, Michael S. Tsirkin On 20/03/15 20:31, Matthew Wilcox wrote: > Ah! I've looked at that a couple of times as well. I asked our database > performance team what impact freeing up the memmap would have on their > performance. They told me that doubling the amount of memory generally > resulted in approximately a 40% performance improvement. So freeing up > 1.5% additional memory would result in about 0.6% performance improvement, > which I thought was probably too small a return on investment to justify > turning memmap into a two-level data structure. Don't get me started on databases! This is very much a relational problem, other databases don't suffer like this. (imho relational theory is totally inappropriate for an engineering problem, like designing a database engine ...) Cheers, Wol ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 0/7] evacuate struct page from the block layer 2015-03-20 20:31 ` Matthew Wilcox 2015-03-20 21:08 ` Rik van Riel 2015-03-20 21:17 ` Wols Lists @ 2015-03-22 16:24 ` Boaz Harrosh 2015-03-22 16:24 ` Boaz Harrosh 2 siblings, 1 reply; 59+ messages in thread From: Boaz Harrosh @ 2015-03-22 16:24 UTC (permalink / raw) To: Matthew Wilcox, Rik van Riel Cc: Andrew Morton, Dan Williams, linux-kernel, linux-arch, axboe, linux-nvdimm, Dave Hansen, linux-raid, mgorman, hch, linux-fsdevel, Michael S. Tsirkin On 03/20/2015 10:31 PM, Matthew Wilcox wrote: <> > > There's a lot of code out there that relies on struct page being PAGE_SIZE > bytes. Not so much really. Not at the lower end of the stack. You can actually feed a vp = kmalloc(64K); bv_page = virt_to_page(vp) bv_len = 64k And feed that to an hard drive. It works. The only last stronghold of PAGE_SIZE is at the page-cache and page-fault granularity where the minimum is the better. But it should not be hard to clean up the lower end of the stack. Even introduce a: page_size(page) You will find that every subsystem that can work with a sub-page size similar to above bv_len. Will also work well with bigger than PAGE_SIZE bv_len equivalent. Only the BUG_ONs need to convert to page_size(page) instead of PAGE_SIZE > I'm cool with replacing 'struct page' with 'struct superpage' > [1] in the biovec and auditing all of the code which touches it ... but > that's going to be a lot of code! I'm not sure it's less code than > going directly to 'just do I/O on PFNs'. > struct page already knows how to be a super-page. with the THP mechanics. All a page_size(page) needs is a call to its section, we do not need any added storage at page-struct. (And we can cache this as a flag we actually already have a flag) It looks like you are very trigger happy to change "biovec and auditing all of the code which touches it" I believe long long term your #1b is the correct "full audit" path: Page Is the virtual-2-page-2-physical descriptor + state. It is variable size > [1] Please, somebody come up with a better name! sure struct page *page. The one to kill is PAGE_SIZE. In most current code it can just be MIN_PAGE_SIZE and CACHE_PAGE_SIZE == MIN_PAGE_SIZE. Only novelty is enhance of the split_huge_page in the case of "page-fault-granularity". Thanks Boaz ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 0/7] evacuate struct page from the block layer 2015-03-22 16:24 ` Boaz Harrosh @ 2015-03-22 16:24 ` Boaz Harrosh 0 siblings, 0 replies; 59+ messages in thread From: Boaz Harrosh @ 2015-03-22 16:24 UTC (permalink / raw) To: Matthew Wilcox, Rik van Riel Cc: Andrew Morton, Dan Williams, linux-kernel, linux-arch, axboe, linux-nvdimm, Dave Hansen, linux-raid, mgorman, hch, linux-fsdevel, Michael S. Tsirkin On 03/20/2015 10:31 PM, Matthew Wilcox wrote: <> > > There's a lot of code out there that relies on struct page being PAGE_SIZE > bytes. Not so much really. Not at the lower end of the stack. You can actually feed a vp = kmalloc(64K); bv_page = virt_to_page(vp) bv_len = 64k And feed that to an hard drive. It works. The only last stronghold of PAGE_SIZE is at the page-cache and page-fault granularity where the minimum is the better. But it should not be hard to clean up the lower end of the stack. Even introduce a: page_size(page) You will find that every subsystem that can work with a sub-page size similar to above bv_len. Will also work well with bigger than PAGE_SIZE bv_len equivalent. Only the BUG_ONs need to convert to page_size(page) instead of PAGE_SIZE > I'm cool with replacing 'struct page' with 'struct superpage' > [1] in the biovec and auditing all of the code which touches it ... but > that's going to be a lot of code! I'm not sure it's less code than > going directly to 'just do I/O on PFNs'. > struct page already knows how to be a super-page. with the THP mechanics. All a page_size(page) needs is a call to its section, we do not need any added storage at page-struct. (And we can cache this as a flag we actually already have a flag) It looks like you are very trigger happy to change "biovec and auditing all of the code which touches it" I believe long long term your #1b is the correct "full audit" path: Page Is the virtual-2-page-2-physical descriptor + state. It is variable size > [1] Please, somebody come up with a better name! sure struct page *page. The one to kill is PAGE_SIZE. In most current code it can just be MIN_PAGE_SIZE and CACHE_PAGE_SIZE == MIN_PAGE_SIZE. Only novelty is enhance of the split_huge_page in the case of "page-fault-granularity". Thanks Boaz ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 0/7] evacuate struct page from the block layer 2015-03-20 16:21 ` Rik van Riel 2015-03-20 20:31 ` Matthew Wilcox @ 2015-03-22 15:51 ` Boaz Harrosh 2015-03-23 15:19 ` Rik van Riel 1 sibling, 1 reply; 59+ messages in thread From: Boaz Harrosh @ 2015-03-22 15:51 UTC (permalink / raw) To: Rik van Riel, Matthew Wilcox, Andrew Morton Cc: Dan Williams, linux-kernel, linux-arch, axboe, linux-nvdimm, Dave Hansen, linux-raid, mgorman, hch, linux-fsdevel, Michael S. Tsirkin On 03/20/2015 06:21 PM, Rik van Riel wrote: > On 03/19/2015 09:43 AM, Matthew Wilcox wrote: > >> 1. Construct struct pages for persistent memory >> 1a. Permanently >> 1b. While the pages are under I/O > > Michael Tsirkin and I have been doing some thinking about what > it would take to allocate struct pages per 2MB area permanently, > and allocate additional struct pages for 4kB pages on demand, > when a 2MB area is broken up into 4kB pages. > > This should work for both DRAM and persistent memory. > My thoughts as well, this need *not* be a huge evasive change. Is however a careful surgery in very core code. And lots of sleepless scary nights and testing to make sure all the side effects are wrinkled out. BTW: Basic core block code may very well work with: bv_page, bv_len > PAGE_SIZE bv_offset > PAGE_SIZE. Meaning bv_page-pfn is contiguous in physical space (and virtual of course). So much so that there are already rumors that this suppose to be supported, and there are already out-of-tree drivers that use this today by kmalloc a page-order and feeding BIOs with bv_len=64K But going out of block-layer and say to networking say via iscsi and this breaks pretty fast. Lets fix that then lets introduce a: page_size(page) page already knows its size (ie belonging to a 2M THP) > I am still not convinced it is worthwhile to have struct pages > for persistent memory though, but I am willing to change my mind. > If we want copy-less, we need a common memory descriptor career. Today this is page-struct. So for me your above statement means: "still not convinced I care about copy-less pmem" Otherwise you either enhance what you have today or devise a new system, which means change the all Kernel. Lastly: Why does pmem need to wait out-of-tree. Even you say above that machines with lots of DRAM can enjoy the HUGE-to-4k split. So why not let pmem waist 4k pages like everyone else and fix it as above down the line, both for pmem and ram. And save both ways. Why do we need to first change the all Kernel, then have pmem. Why not use current infra structure, for good or for worth, and incrementally do better. May I call you on the phone to try and work things out. I believe the huge page thing + 4k on demand is not a very big change, as long as struct page *page is left as is, everywhere. But may *now* carry a different physical/virtual contiguous payload bigger then 4k. Is not the PAGE_SIZE the real bug? lets fix that problem. Thanks Boaz ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 0/7] evacuate struct page from the block layer 2015-03-22 15:51 ` Boaz Harrosh @ 2015-03-23 15:19 ` Rik van Riel 2015-03-23 15:19 ` Rik van Riel ` (2 more replies) 0 siblings, 3 replies; 59+ messages in thread From: Rik van Riel @ 2015-03-23 15:19 UTC (permalink / raw) To: Boaz Harrosh, Matthew Wilcox, Andrew Morton Cc: Dan Williams, linux-kernel, linux-arch, axboe, linux-nvdimm, Dave Hansen, linux-raid, mgorman, hch, linux-fsdevel, Michael S. Tsirkin On 03/22/2015 11:51 AM, Boaz Harrosh wrote: > On 03/20/2015 06:21 PM, Rik van Riel wrote: >> On 03/19/2015 09:43 AM, Matthew Wilcox wrote: >> >>> 1. Construct struct pages for persistent memory >>> 1a. Permanently >>> 1b. While the pages are under I/O >> >> Michael Tsirkin and I have been doing some thinking about what >> it would take to allocate struct pages per 2MB area permanently, >> and allocate additional struct pages for 4kB pages on demand, >> when a 2MB area is broken up into 4kB pages. >> >> This should work for both DRAM and persistent memory. > > My thoughts as well, this need *not* be a huge evasive change. Is however > a careful surgery in very core code. And lots of sleepless scary nights > and testing to make sure all the side effects are wrinkled out. Even the above IS a huge invasive change, and I do not see it as much better than the work Dan and Matthew are doing. > If we want copy-less, we need a common memory descriptor career. Today this > is page-struct. So for me your above statement means: > "still not convinced I care about copy-less pmem" > > Otherwise you either enhance what you have today or devise a new > system, which means change the all Kernel. We do not necessarily need a common descriptor, as much as one that abstracts out what is happening. Something like a struct bio could be a good I/O descriptor, and releasing the backing memory after IO completion could be a function of the bio freeing function itself. > Lastly: Why does pmem need to wait out-of-tree. Even you say above that > machines with lots of DRAM can enjoy the HUGE-to-4k split. So why > not let pmem waist 4k pages like everyone else and fix it as above > down the line, both for pmem and ram. And save both ways. > Why do we need to first change the all Kernel, then have pmem. Why not > use current infra structure, for good or for worth, and incrementally > do better. There are two things going on here: 1) You want to keep using struct page for now, while there are subsystems that require it. This is perfectly legitimate. 2) Matthew and Dan are changing over some subsystems to no longer require struct page. This is perfectly legitimate. I do not understand why either of you would have to object to what the other is doing. There is room to keep using struct page until the rest of the kernel no longer requires it. -- All rights reversed ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 0/7] evacuate struct page from the block layer 2015-03-23 15:19 ` Rik van Riel @ 2015-03-23 15:19 ` Rik van Riel 2015-03-23 19:30 ` Christoph Hellwig 2015-03-24 9:41 ` Boaz Harrosh 2 siblings, 0 replies; 59+ messages in thread From: Rik van Riel @ 2015-03-23 15:19 UTC (permalink / raw) To: Boaz Harrosh, Matthew Wilcox, Andrew Morton Cc: Dan Williams, linux-kernel, linux-arch, axboe, linux-nvdimm, Dave Hansen, linux-raid, mgorman, hch, linux-fsdevel, Michael S. Tsirkin On 03/22/2015 11:51 AM, Boaz Harrosh wrote: > On 03/20/2015 06:21 PM, Rik van Riel wrote: >> On 03/19/2015 09:43 AM, Matthew Wilcox wrote: >> >>> 1. Construct struct pages for persistent memory >>> 1a. Permanently >>> 1b. While the pages are under I/O >> >> Michael Tsirkin and I have been doing some thinking about what >> it would take to allocate struct pages per 2MB area permanently, >> and allocate additional struct pages for 4kB pages on demand, >> when a 2MB area is broken up into 4kB pages. >> >> This should work for both DRAM and persistent memory. > > My thoughts as well, this need *not* be a huge evasive change. Is however > a careful surgery in very core code. And lots of sleepless scary nights > and testing to make sure all the side effects are wrinkled out. Even the above IS a huge invasive change, and I do not see it as much better than the work Dan and Matthew are doing. > If we want copy-less, we need a common memory descriptor career. Today this > is page-struct. So for me your above statement means: > "still not convinced I care about copy-less pmem" > > Otherwise you either enhance what you have today or devise a new > system, which means change the all Kernel. We do not necessarily need a common descriptor, as much as one that abstracts out what is happening. Something like a struct bio could be a good I/O descriptor, and releasing the backing memory after IO completion could be a function of the bio freeing function itself. > Lastly: Why does pmem need to wait out-of-tree. Even you say above that > machines with lots of DRAM can enjoy the HUGE-to-4k split. So why > not let pmem waist 4k pages like everyone else and fix it as above > down the line, both for pmem and ram. And save both ways. > Why do we need to first change the all Kernel, then have pmem. Why not > use current infra structure, for good or for worth, and incrementally > do better. There are two things going on here: 1) You want to keep using struct page for now, while there are subsystems that require it. This is perfectly legitimate. 2) Matthew and Dan are changing over some subsystems to no longer require struct page. This is perfectly legitimate. I do not understand why either of you would have to object to what the other is doing. There is room to keep using struct page until the rest of the kernel no longer requires it. -- All rights reversed ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 0/7] evacuate struct page from the block layer 2015-03-23 15:19 ` Rik van Riel 2015-03-23 15:19 ` Rik van Riel @ 2015-03-23 19:30 ` Christoph Hellwig 2015-03-23 19:30 ` Christoph Hellwig 2015-03-24 9:41 ` Boaz Harrosh 2 siblings, 1 reply; 59+ messages in thread From: Christoph Hellwig @ 2015-03-23 19:30 UTC (permalink / raw) To: Rik van Riel Cc: Boaz Harrosh, Matthew Wilcox, Andrew Morton, Dan Williams, linux-kernel, linux-arch, axboe, linux-nvdimm, Dave Hansen, linux-raid, mgorman, hch, linux-fsdevel, Michael S. Tsirkin On Mon, Mar 23, 2015 at 11:19:07AM -0400, Rik van Riel wrote: > There are two things going on here: > > 1) You want to keep using struct page for now, while there are > subsystems that require it. This is perfectly legitimate. > > 2) Matthew and Dan are changing over some subsystems to no longer > require struct page. This is perfectly legitimate. > > I do not understand why either of you would have to object to what > the other is doing. There is room to keep using struct page until > the rest of the kernel no longer requires it. *nod* I'd really like to merge the struct page based pmem driver ASAP. We can then look into work that avoid the need for struct page, and I think Dan is doing some good work in that direction. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 0/7] evacuate struct page from the block layer 2015-03-23 19:30 ` Christoph Hellwig @ 2015-03-23 19:30 ` Christoph Hellwig 0 siblings, 0 replies; 59+ messages in thread From: Christoph Hellwig @ 2015-03-23 19:30 UTC (permalink / raw) To: Rik van Riel Cc: Boaz Harrosh, Matthew Wilcox, Andrew Morton, Dan Williams, linux-kernel, linux-arch, axboe, linux-nvdimm, Dave Hansen, linux-raid, mgorman, hch, linux-fsdevel, Michael S. Tsirkin On Mon, Mar 23, 2015 at 11:19:07AM -0400, Rik van Riel wrote: > There are two things going on here: > > 1) You want to keep using struct page for now, while there are > subsystems that require it. This is perfectly legitimate. > > 2) Matthew and Dan are changing over some subsystems to no longer > require struct page. This is perfectly legitimate. > > I do not understand why either of you would have to object to what > the other is doing. There is room to keep using struct page until > the rest of the kernel no longer requires it. *nod* I'd really like to merge the struct page based pmem driver ASAP. We can then look into work that avoid the need for struct page, and I think Dan is doing some good work in that direction. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 0/7] evacuate struct page from the block layer 2015-03-23 15:19 ` Rik van Riel 2015-03-23 15:19 ` Rik van Riel 2015-03-23 19:30 ` Christoph Hellwig @ 2015-03-24 9:41 ` Boaz Harrosh 2015-03-24 9:41 ` Boaz Harrosh 2015-03-24 16:57 ` Rik van Riel 2 siblings, 2 replies; 59+ messages in thread From: Boaz Harrosh @ 2015-03-24 9:41 UTC (permalink / raw) To: Rik van Riel, Boaz Harrosh, Matthew Wilcox, Andrew Morton Cc: Dan Williams, linux-kernel, linux-arch, axboe, linux-nvdimm, Dave Hansen, linux-raid, mgorman, hch, linux-fsdevel, Michael S. Tsirkin On 03/23/2015 05:19 PM, Rik van Riel wrote: >>> Michael Tsirkin and I have been doing some thinking about what >>> it would take to allocate struct pages per 2MB area permanently, >>> and allocate additional struct pages for 4kB pages on demand, >>> when a 2MB area is broken up into 4kB pages. >> >> My thoughts as well, this need *not* be a huge evasive change. Is however >> a careful surgery in very core code. And lots of sleepless scary nights >> and testing to make sure all the side effects are wrinkled out. > > Even the above IS a huge invasive change, and I do not see it > as much better than the work Dan and Matthew are doing. > You lost me again. Sorry for my slowness. The code I envision is not invasive at all. Nothing is touched at all, except a few core places at the page level. The contract with Kernel stays the same: page_to_pfn, pfn_to_page, page_address (which is kmap_atomic in 64bit) virt_to_page, page_get/put and so on... So none of the Kernel code need change at all. You were saying that we might have a 2M page and on demand we can allocate a 4k page shove it down the stack, which does not change at all, and once back from io, the 4k pages can be freed and recycled for reuse with other IO. This is what I thought you said. This is doable, and not that much work and for the life of me I do not see any "invasive". (Yes a few core headers that make everything compile ;-)) That said I do not even think we need that (2M split to 4k on demand) we can even do better and make sure 2M pages just work as is. It is very possible today (Tested) to push a 2M page into a bio and write to a bdev. Yes lots of side code will break, but the core path is clean. Let us fix that then. (Need I send code to show you how a 2M page is written with a single bvec?) >> If we want copy-less, we need a common memory descriptor career. Today this >> is page-struct. So for me your above statement means: >> "still not convinced I care about copy-less pmem" >> >> Otherwise you either enhance what you have today or devise a new >> system, which means change the all Kernel. > > We do not necessarily need a common descriptor, as much as > one that abstracts out what is happening. Something like a > struct bio could be a good I/O descriptor, and releasing the > backing memory after IO completion could be a function of the > bio freeing function itself. > Lost me again sorry. What backing memory. struct bio is already an I/O descriptor which gets freed after use. How is that relevant to pfn vs page ? >> Lastly: Why does pmem need to wait out-of-tree. Even you say above that >> machines with lots of DRAM can enjoy the HUGE-to-4k split. So why >> not let pmem waist 4k pages like everyone else and fix it as above >> down the line, both for pmem and ram. And save both ways. >> Why do we need to first change the all Kernel, then have pmem. Why not >> use current infra structure, for good or for worth, and incrementally >> do better. > > There are two things going on here: > > 1) You want to keep using struct page for now, while there are > subsystems that require it. This is perfectly legitimate. > > 2) Matthew and Dan are changing over some subsystems to no longer > require struct page. This is perfectly legitimate. > How is this legitimate when you need to Interface the [1] subsystems under the [2] subsystem? A subsystem that expects pages is now not usable by [2]. Today *All* the Kernel subsystems are [1] Period. How does it become legitimate to now start *two* competing, do the same differently, abstraction, in our kernel. We have two much diversity not to little. > I do not understand why either of you would have to object to what > the other is doing. There is room to keep using struct page until > the rest of the kernel no longer requires it. > So this is your vision "until the rest of the kernel no longer requires pages" Really? Sigh, coming from other Kernels I thought pages were a breeze of fresh air. I thought it was very clever. And BTW good luck with that. BTW: you have not solved the basic problem yet. for one pfn_kmap() given a pfn what is its virtual address. would you like to loop through the Kernel's range tables to look for the registered ioremap ? its a long annoying loop. The page was invented exactly for this reason, to go through the section object. And actually it is not that easy because if it is an ioremap pointer it is in one list and if a page it is another way, and on top of all this, it is ARCH dependent. And you are trashing highmem, because the state and locks of that are at the page level. Not that I care about highmem but I hate double coding. For god sake what do you guys have with poor old pages, they were invented to exacly do this, abstract away management of a single pfn-to-virt. All I see is complains about page being 4K well it need not be. page can be any size, and hell it can be variable size. (And no we do not need to add an extra size member, all we need is the one bit) Cheers Boaz ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 0/7] evacuate struct page from the block layer 2015-03-24 9:41 ` Boaz Harrosh @ 2015-03-24 9:41 ` Boaz Harrosh 2015-03-24 16:57 ` Rik van Riel 1 sibling, 0 replies; 59+ messages in thread From: Boaz Harrosh @ 2015-03-24 9:41 UTC (permalink / raw) To: Rik van Riel, Boaz Harrosh, Matthew Wilcox, Andrew Morton Cc: Dan Williams, linux-kernel, linux-arch, axboe, linux-nvdimm, Dave Hansen, linux-raid, mgorman, hch, linux-fsdevel, Michael S. Tsirkin On 03/23/2015 05:19 PM, Rik van Riel wrote: >>> Michael Tsirkin and I have been doing some thinking about what >>> it would take to allocate struct pages per 2MB area permanently, >>> and allocate additional struct pages for 4kB pages on demand, >>> when a 2MB area is broken up into 4kB pages. >> >> My thoughts as well, this need *not* be a huge evasive change. Is however >> a careful surgery in very core code. And lots of sleepless scary nights >> and testing to make sure all the side effects are wrinkled out. > > Even the above IS a huge invasive change, and I do not see it > as much better than the work Dan and Matthew are doing. > You lost me again. Sorry for my slowness. The code I envision is not invasive at all. Nothing is touched at all, except a few core places at the page level. The contract with Kernel stays the same: page_to_pfn, pfn_to_page, page_address (which is kmap_atomic in 64bit) virt_to_page, page_get/put and so on... So none of the Kernel code need change at all. You were saying that we might have a 2M page and on demand we can allocate a 4k page shove it down the stack, which does not change at all, and once back from io, the 4k pages can be freed and recycled for reuse with other IO. This is what I thought you said. This is doable, and not that much work and for the life of me I do not see any "invasive". (Yes a few core headers that make everything compile ;-)) That said I do not even think we need that (2M split to 4k on demand) we can even do better and make sure 2M pages just work as is. It is very possible today (Tested) to push a 2M page into a bio and write to a bdev. Yes lots of side code will break, but the core path is clean. Let us fix that then. (Need I send code to show you how a 2M page is written with a single bvec?) >> If we want copy-less, we need a common memory descriptor career. Today this >> is page-struct. So for me your above statement means: >> "still not convinced I care about copy-less pmem" >> >> Otherwise you either enhance what you have today or devise a new >> system, which means change the all Kernel. > > We do not necessarily need a common descriptor, as much as > one that abstracts out what is happening. Something like a > struct bio could be a good I/O descriptor, and releasing the > backing memory after IO completion could be a function of the > bio freeing function itself. > Lost me again sorry. What backing memory. struct bio is already an I/O descriptor which gets freed after use. How is that relevant to pfn vs page ? >> Lastly: Why does pmem need to wait out-of-tree. Even you say above that >> machines with lots of DRAM can enjoy the HUGE-to-4k split. So why >> not let pmem waist 4k pages like everyone else and fix it as above >> down the line, both for pmem and ram. And save both ways. >> Why do we need to first change the all Kernel, then have pmem. Why not >> use current infra structure, for good or for worth, and incrementally >> do better. > > There are two things going on here: > > 1) You want to keep using struct page for now, while there are > subsystems that require it. This is perfectly legitimate. > > 2) Matthew and Dan are changing over some subsystems to no longer > require struct page. This is perfectly legitimate. > How is this legitimate when you need to Interface the [1] subsystems under the [2] subsystem? A subsystem that expects pages is now not usable by [2]. Today *All* the Kernel subsystems are [1] Period. How does it become legitimate to now start *two* competing, do the same differently, abstraction, in our kernel. We have two much diversity not to little. > I do not understand why either of you would have to object to what > the other is doing. There is room to keep using struct page until > the rest of the kernel no longer requires it. > So this is your vision "until the rest of the kernel no longer requires pages" Really? Sigh, coming from other Kernels I thought pages were a breeze of fresh air. I thought it was very clever. And BTW good luck with that. BTW: you have not solved the basic problem yet. for one pfn_kmap() given a pfn what is its virtual address. would you like to loop through the Kernel's range tables to look for the registered ioremap ? its a long annoying loop. The page was invented exactly for this reason, to go through the section object. And actually it is not that easy because if it is an ioremap pointer it is in one list and if a page it is another way, and on top of all this, it is ARCH dependent. And you are trashing highmem, because the state and locks of that are at the page level. Not that I care about highmem but I hate double coding. For god sake what do you guys have with poor old pages, they were invented to exacly do this, abstract away management of a single pfn-to-virt. All I see is complains about page being 4K well it need not be. page can be any size, and hell it can be variable size. (And no we do not need to add an extra size member, all we need is the one bit) Cheers Boaz ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 0/7] evacuate struct page from the block layer 2015-03-24 9:41 ` Boaz Harrosh 2015-03-24 9:41 ` Boaz Harrosh @ 2015-03-24 16:57 ` Rik van Riel 2015-03-24 16:57 ` Rik van Riel 1 sibling, 1 reply; 59+ messages in thread From: Rik van Riel @ 2015-03-24 16:57 UTC (permalink / raw) To: Boaz Harrosh, Matthew Wilcox, Andrew Morton Cc: Dan Williams, linux-kernel, linux-arch, axboe, linux-nvdimm, Dave Hansen, linux-raid, mgorman, hch, linux-fsdevel, Michael S. Tsirkin On 03/24/2015 05:41 AM, Boaz Harrosh wrote: > On 03/23/2015 05:19 PM, Rik van Riel wrote: >> There are two things going on here: >> >> 1) You want to keep using struct page for now, while there are >> subsystems that require it. This is perfectly legitimate. >> >> 2) Matthew and Dan are changing over some subsystems to no longer >> require struct page. This is perfectly legitimate. >> > > How is this legitimate when you need to Interface the [1] subsystems > under the [2] subsystem? A subsystem that expects pages is now not > usable by [2]. > > Today *All* the Kernel subsystems are [1] Period. That's not true. In the graphics subsystem it is very normal to mmap graphics memory without ever using a struct page. There are other callers of remap_pfn_range() too. In these cases, refcounting is done by keeping a refcount on the entire object, not on individual pages (since we have none). > How does it become > legitimate to now start *two* competing, do the same differently, abstraction, > in our kernel. We have two much diversity not to little. We are already able to refcount either the whole object, or an individual page. One issue is that not every subsystem can do the whole object refcounting, and that it would be nice to have the refcounting done by one single interface. If we want the code to be the same everywhere, we could achieve that just as well with an abstraction as with a single data structure. Maybe even something as simplistic as these, with the internals automatically taking and releasing a refcount on the proper object: get_reference(file, memory_address) put_reference(file, memory_address) -- All rights reversed ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 0/7] evacuate struct page from the block layer 2015-03-24 16:57 ` Rik van Riel @ 2015-03-24 16:57 ` Rik van Riel 0 siblings, 0 replies; 59+ messages in thread From: Rik van Riel @ 2015-03-24 16:57 UTC (permalink / raw) To: Boaz Harrosh, Matthew Wilcox, Andrew Morton Cc: Dan Williams, linux-kernel, linux-arch, axboe, linux-nvdimm, Dave Hansen, linux-raid, mgorman, hch, linux-fsdevel, Michael S. Tsirkin On 03/24/2015 05:41 AM, Boaz Harrosh wrote: > On 03/23/2015 05:19 PM, Rik van Riel wrote: >> There are two things going on here: >> >> 1) You want to keep using struct page for now, while there are >> subsystems that require it. This is perfectly legitimate. >> >> 2) Matthew and Dan are changing over some subsystems to no longer >> require struct page. This is perfectly legitimate. >> > > How is this legitimate when you need to Interface the [1] subsystems > under the [2] subsystem? A subsystem that expects pages is now not > usable by [2]. > > Today *All* the Kernel subsystems are [1] Period. That's not true. In the graphics subsystem it is very normal to mmap graphics memory without ever using a struct page. There are other callers of remap_pfn_range() too. In these cases, refcounting is done by keeping a refcount on the entire object, not on individual pages (since we have none). > How does it become > legitimate to now start *two* competing, do the same differently, abstraction, > in our kernel. We have two much diversity not to little. We are already able to refcount either the whole object, or an individual page. One issue is that not every subsystem can do the whole object refcounting, and that it would be nice to have the refcounting done by one single interface. If we want the code to be the same everywhere, we could achieve that just as well with an abstraction as with a single data structure. Maybe even something as simplistic as these, with the internals automatically taking and releasing a refcount on the proper object: get_reference(file, memory_address) put_reference(file, memory_address) -- All rights reversed ^ permalink raw reply [flat|nested] 59+ messages in thread
end of thread, other threads:[~2015-03-24 16:58 UTC | newest] Thread overview: 59+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-03-16 20:25 [RFC PATCH 0/7] evacuate struct page from the block layer Dan Williams 2015-03-16 20:25 ` Dan Williams 2015-03-16 20:25 ` [RFC PATCH 2/7] block: convert bio_vec.bv_page to bv_pfn Dan Williams 2015-03-16 20:25 ` Dan Williams 2015-03-16 23:05 ` Al Viro 2015-03-16 23:05 ` Al Viro 2015-03-17 13:02 ` Matthew Wilcox 2015-03-17 13:02 ` Matthew Wilcox 2015-03-17 15:53 ` Dan Williams 2015-03-17 15:53 ` Dan Williams 2015-03-18 10:47 ` [RFC PATCH 0/7] evacuate struct page from the block layer Boaz Harrosh 2015-03-18 10:47 ` Boaz Harrosh 2015-03-18 13:06 ` Matthew Wilcox 2015-03-18 13:06 ` Matthew Wilcox 2015-03-18 14:38 ` [Linux-nvdimm] " Boaz Harrosh 2015-03-18 14:38 ` Boaz Harrosh 2015-03-20 15:56 ` Rik van Riel 2015-03-20 15:56 ` Rik van Riel 2015-03-22 11:53 ` Boaz Harrosh 2015-03-22 11:53 ` Boaz Harrosh 2015-03-18 15:35 ` Dan Williams 2015-03-18 20:26 ` Andrew Morton 2015-03-19 13:43 ` Matthew Wilcox 2015-03-19 13:43 ` Matthew Wilcox 2015-03-19 15:54 ` [Linux-nvdimm] " Boaz Harrosh 2015-03-19 15:54 ` Boaz Harrosh 2015-03-19 19:59 ` Andrew Morton 2015-03-19 20:59 ` Dan Williams 2015-03-22 17:22 ` Boaz Harrosh 2015-03-22 17:22 ` Boaz Harrosh 2015-03-20 17:32 ` Wols Lists 2015-03-20 17:32 ` Wols Lists 2015-03-22 10:30 ` Boaz Harrosh 2015-03-22 10:30 ` Boaz Harrosh 2015-03-19 18:17 ` Christoph Hellwig 2015-03-19 19:31 ` Matthew Wilcox 2015-03-22 16:46 ` Boaz Harrosh 2015-03-20 16:21 ` Rik van Riel 2015-03-20 20:31 ` Matthew Wilcox 2015-03-20 21:08 ` Rik van Riel 2015-03-20 21:08 ` Rik van Riel 2015-03-22 17:06 ` Boaz Harrosh 2015-03-22 17:22 ` Dan Williams 2015-03-22 17:22 ` Dan Williams 2015-03-22 17:39 ` Boaz Harrosh 2015-03-22 17:39 ` Boaz Harrosh 2015-03-20 21:17 ` Wols Lists 2015-03-20 21:17 ` Wols Lists 2015-03-22 16:24 ` Boaz Harrosh 2015-03-22 16:24 ` Boaz Harrosh 2015-03-22 15:51 ` Boaz Harrosh 2015-03-23 15:19 ` Rik van Riel 2015-03-23 15:19 ` Rik van Riel 2015-03-23 19:30 ` Christoph Hellwig 2015-03-23 19:30 ` Christoph Hellwig 2015-03-24 9:41 ` Boaz Harrosh 2015-03-24 9:41 ` Boaz Harrosh 2015-03-24 16:57 ` Rik van Riel 2015-03-24 16:57 ` Rik van Riel
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).