[RFC PATCH 0/7] evacuate struct page from the block layer

linux-arch.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/7] evacuate struct page from the block layer
@ 2015-03-16 20:25 Dan Williams
  2015-03-16 20:25 ` Dan Williams
                   ` (3 more replies)
  0 siblings, 4 replies; 59+ messages in thread
From: Dan Williams @ 2015-03-16 20:25 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-arch, axboe, riel, linux-nvdimm, Dave Hansen, linux-raid,
	mgorman, hch, linux-fsdevel, Matthew Wilcox

Avoid the impending disaster of requiring struct page coverage for what
is expected to be ever increasing capacities of persistent memory.  In
conversations with Rik van Riel, Mel Gorman, and Jens Axboe at the
recently concluded Linux Storage Summit it became clear that struct page
is not required in many places, it was simply convenient to re-use.

Introduce helpers and infrastructure to remove struct page usage where
it is not necessary.  One use case for these changes is to implement a
write-back-cache in persistent memory for software-RAID.  Another use
case for the scatterlist changes is RDMA to a pfn-range.

This compiles and boots, but 0day-kbuild-robot coverage is needed before
this set exits "RFC".  Obviously, the coccinelle script needs to be
re-run on the block updates for kernel.next.  As is, this only includes
the resulting auto-generated-patch against 4.0-rc3.

---

Dan Williams (6):
      block: add helpers for accessing a bio_vec page
      block: convert bio_vec.bv_page to bv_pfn
      dma-mapping: allow archs to optionally specify a ->map_pfn() operation
      scatterlist: use sg_phys()
      x86: support dma_map_pfn()
      block: base support for pfn i/o

Matthew Wilcox (1):
      scatterlist: support "page-less" (__pfn_t only) entries


 arch/Kconfig                                 |    3 +
 arch/arm/mm/dma-mapping.c                    |    2 -
 arch/microblaze/kernel/dma.c                 |    2 -
 arch/powerpc/sysdev/axonram.c                |    2 -
 arch/x86/Kconfig                             |   12 +++
 arch/x86/kernel/amd_gart_64.c                |   22 ++++--
 arch/x86/kernel/pci-nommu.c                  |   22 ++++--
 arch/x86/kernel/pci-swiotlb.c                |    4 +
 arch/x86/pci/sta2x11-fixup.c                 |    4 +
 arch/x86/xen/pci-swiotlb-xen.c               |    4 +
 block/bio-integrity.c                        |    8 +-
 block/bio.c                                  |   83 +++++++++++++++------
 block/blk-core.c                             |    9 ++
 block/blk-integrity.c                        |    7 +-
 block/blk-lib.c                              |    2 -
 block/blk-merge.c                            |   15 ++--
 block/bounce.c                               |   26 +++----
 drivers/block/aoe/aoecmd.c                   |    8 +-
 drivers/block/brd.c                          |    2 -
 drivers/block/drbd/drbd_bitmap.c             |    5 +
 drivers/block/drbd/drbd_main.c               |    4 +
 drivers/block/drbd/drbd_receiver.c           |    4 +
 drivers/block/drbd/drbd_worker.c             |    3 +
 drivers/block/floppy.c                       |    6 +-
 drivers/block/loop.c                         |    8 +-
 drivers/block/nbd.c                          |    8 +-
 drivers/block/nvme-core.c                    |    2 -
 drivers/block/pktcdvd.c                      |   11 ++-
 drivers/block/ps3disk.c                      |    2 -
 drivers/block/ps3vram.c                      |    2 -
 drivers/block/rbd.c                          |    2 -
 drivers/block/rsxx/dma.c                     |    3 +
 drivers/block/umem.c                         |    2 -
 drivers/block/zram/zram_drv.c                |   10 +--
 drivers/dma/ste_dma40.c                      |    5 -
 drivers/iommu/amd_iommu.c                    |   21 ++++-
 drivers/iommu/intel-iommu.c                  |   26 +++++--
 drivers/iommu/iommu.c                        |    2 -
 drivers/md/bcache/btree.c                    |    4 +
 drivers/md/bcache/debug.c                    |    6 +-
 drivers/md/bcache/movinggc.c                 |    2 -
 drivers/md/bcache/request.c                  |    6 +-
 drivers/md/bcache/super.c                    |   10 +--
 drivers/md/bcache/util.c                     |    5 +
 drivers/md/bcache/writeback.c                |    2 -
 drivers/md/dm-crypt.c                        |   12 ++-
 drivers/md/dm-io.c                           |    2 -
 drivers/md/dm-verity.c                       |    2 -
 drivers/md/raid1.c                           |   50 +++++++------
 drivers/md/raid10.c                          |   38 +++++-----
 drivers/md/raid5.c                           |    6 +-
 drivers/mmc/card/queue.c                     |    4 +
 drivers/s390/block/dasd_diag.c               |    2 -
 drivers/s390/block/dasd_eckd.c               |   14 ++--
 drivers/s390/block/dasd_fba.c                |    6 +-
 drivers/s390/block/dcssblk.c                 |    2 -
 drivers/s390/block/scm_blk.c                 |    2 -
 drivers/s390/block/scm_blk_cluster.c         |    2 -
 drivers/s390/block/xpram.c                   |    2 -
 drivers/scsi/mpt2sas/mpt2sas_transport.c     |    6 +-
 drivers/scsi/mpt3sas/mpt3sas_transport.c     |    6 +-
 drivers/scsi/sd_dif.c                        |    4 +
 drivers/staging/android/ion/ion_chunk_heap.c |    4 +
 drivers/staging/lustre/lustre/llite/lloop.c  |    2 -
 drivers/xen/biomerge.c                       |    4 +
 drivers/xen/swiotlb-xen.c                    |   29 +++++--
 fs/btrfs/check-integrity.c                   |    6 +-
 fs/btrfs/compression.c                       |   12 ++-
 fs/btrfs/disk-io.c                           |    4 +
 fs/btrfs/extent_io.c                         |    8 +-
 fs/btrfs/file-item.c                         |    8 +-
 fs/btrfs/inode.c                             |   18 +++--
 fs/btrfs/raid56.c                            |    4 +
 fs/btrfs/volumes.c                           |    2 -
 fs/buffer.c                                  |    4 +
 fs/direct-io.c                               |    2 -
 fs/exofs/ore.c                               |    4 +
 fs/exofs/ore_raid.c                          |    2 -
 fs/ext4/page-io.c                            |    2 -
 fs/f2fs/data.c                               |    4 +
 fs/f2fs/segment.c                            |    2 -
 fs/gfs2/lops.c                               |    4 +
 fs/jfs/jfs_logmgr.c                          |    4 +
 fs/logfs/dev_bdev.c                          |   10 +--
 fs/mpage.c                                   |    2 -
 fs/splice.c                                  |    2 -
 include/asm-generic/dma-mapping-common.h     |   30 ++++++++
 include/asm-generic/memory_model.h           |    4 +
 include/asm-generic/scatterlist.h            |    6 ++
 include/crypto/scatterwalk.h                 |   10 +++
 include/linux/bio.h                          |   24 +++---
 include/linux/blk_types.h                    |   21 +++++
 include/linux/blkdev.h                       |    2 +
 include/linux/dma-debug.h                    |   23 +++++-
 include/linux/dma-mapping.h                  |    8 ++
 include/linux/scatterlist.h                  |  101 ++++++++++++++++++++++++--
 include/linux/swiotlb.h                      |    5 +
 kernel/power/block_io.c                      |    2 -
 lib/dma-debug.c                              |    4 +
 lib/swiotlb.c                                |   20 ++++-
 mm/iov_iter.c                                |   22 +++---
 mm/page_io.c                                 |    8 +-
 net/ceph/messenger.c                         |    2 -
 103 files changed, 658 insertions(+), 335 deletions(-)

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-16 20:25 [RFC PATCH 0/7] evacuate struct page from the block layer Dan Williams
@ 2015-03-16 20:25 ` Dan Williams
  2015-03-16 20:25 ` [RFC PATCH 2/7] block: convert bio_vec.bv_page to bv_pfn Dan Williams
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 59+ messages in thread
From: Dan Williams @ 2015-03-16 20:25 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-arch, axboe, riel, linux-nvdimm, Dave Hansen, linux-raid,
	mgorman, hch, linux-fsdevel, Matthew Wilcox

Avoid the impending disaster of requiring struct page coverage for what
is expected to be ever increasing capacities of persistent memory.  In
conversations with Rik van Riel, Mel Gorman, and Jens Axboe at the
recently concluded Linux Storage Summit it became clear that struct page
is not required in many places, it was simply convenient to re-use.

Introduce helpers and infrastructure to remove struct page usage where
it is not necessary.  One use case for these changes is to implement a
write-back-cache in persistent memory for software-RAID.  Another use
case for the scatterlist changes is RDMA to a pfn-range.

This compiles and boots, but 0day-kbuild-robot coverage is needed before
this set exits "RFC".  Obviously, the coccinelle script needs to be
re-run on the block updates for kernel.next.  As is, this only includes
the resulting auto-generated-patch against 4.0-rc3.

---

Dan Williams (6):
      block: add helpers for accessing a bio_vec page
      block: convert bio_vec.bv_page to bv_pfn
      dma-mapping: allow archs to optionally specify a ->map_pfn() operation
      scatterlist: use sg_phys()
      x86: support dma_map_pfn()
      block: base support for pfn i/o

Matthew Wilcox (1):
      scatterlist: support "page-less" (__pfn_t only) entries


 arch/Kconfig                                 |    3 +
 arch/arm/mm/dma-mapping.c                    |    2 -
 arch/microblaze/kernel/dma.c                 |    2 -
 arch/powerpc/sysdev/axonram.c                |    2 -
 arch/x86/Kconfig                             |   12 +++
 arch/x86/kernel/amd_gart_64.c                |   22 ++++--
 arch/x86/kernel/pci-nommu.c                  |   22 ++++--
 arch/x86/kernel/pci-swiotlb.c                |    4 +
 arch/x86/pci/sta2x11-fixup.c                 |    4 +
 arch/x86/xen/pci-swiotlb-xen.c               |    4 +
 block/bio-integrity.c                        |    8 +-
 block/bio.c                                  |   83 +++++++++++++++------
 block/blk-core.c                             |    9 ++
 block/blk-integrity.c                        |    7 +-
 block/blk-lib.c                              |    2 -
 block/blk-merge.c                            |   15 ++--
 block/bounce.c                               |   26 +++----
 drivers/block/aoe/aoecmd.c                   |    8 +-
 drivers/block/brd.c                          |    2 -
 drivers/block/drbd/drbd_bitmap.c             |    5 +
 drivers/block/drbd/drbd_main.c               |    4 +
 drivers/block/drbd/drbd_receiver.c           |    4 +
 drivers/block/drbd/drbd_worker.c             |    3 +
 drivers/block/floppy.c                       |    6 +-
 drivers/block/loop.c                         |    8 +-
 drivers/block/nbd.c                          |    8 +-
 drivers/block/nvme-core.c                    |    2 -
 drivers/block/pktcdvd.c                      |   11 ++-
 drivers/block/ps3disk.c                      |    2 -
 drivers/block/ps3vram.c                      |    2 -
 drivers/block/rbd.c                          |    2 -
 drivers/block/rsxx/dma.c                     |    3 +
 drivers/block/umem.c                         |    2 -
 drivers/block/zram/zram_drv.c                |   10 +--
 drivers/dma/ste_dma40.c                      |    5 -
 drivers/iommu/amd_iommu.c                    |   21 ++++-
 drivers/iommu/intel-iommu.c                  |   26 +++++--
 drivers/iommu/iommu.c                        |    2 -
 drivers/md/bcache/btree.c                    |    4 +
 drivers/md/bcache/debug.c                    |    6 +-
 drivers/md/bcache/movinggc.c                 |    2 -
 drivers/md/bcache/request.c                  |    6 +-
 drivers/md/bcache/super.c                    |   10 +--
 drivers/md/bcache/util.c                     |    5 +
 drivers/md/bcache/writeback.c                |    2 -
 drivers/md/dm-crypt.c                        |   12 ++-
 drivers/md/dm-io.c                           |    2 -
 drivers/md/dm-verity.c                       |    2 -
 drivers/md/raid1.c                           |   50 +++++++------
 drivers/md/raid10.c                          |   38 +++++-----
 drivers/md/raid5.c                           |    6 +-
 drivers/mmc/card/queue.c                     |    4 +
 drivers/s390/block/dasd_diag.c               |    2 -
 drivers/s390/block/dasd_eckd.c               |   14 ++--
 drivers/s390/block/dasd_fba.c                |    6 +-
 drivers/s390/block/dcssblk.c                 |    2 -
 drivers/s390/block/scm_blk.c                 |    2 -
 drivers/s390/block/scm_blk_cluster.c         |    2 -
 drivers/s390/block/xpram.c                   |    2 -
 drivers/scsi/mpt2sas/mpt2sas_transport.c     |    6 +-
 drivers/scsi/mpt3sas/mpt3sas_transport.c     |    6 +-
 drivers/scsi/sd_dif.c                        |    4 +
 drivers/staging/android/ion/ion_chunk_heap.c |    4 +
 drivers/staging/lustre/lustre/llite/lloop.c  |    2 -
 drivers/xen/biomerge.c                       |    4 +
 drivers/xen/swiotlb-xen.c                    |   29 +++++--
 fs/btrfs/check-integrity.c                   |    6 +-
 fs/btrfs/compression.c                       |   12 ++-
 fs/btrfs/disk-io.c                           |    4 +
 fs/btrfs/extent_io.c                         |    8 +-
 fs/btrfs/file-item.c                         |    8 +-
 fs/btrfs/inode.c                             |   18 +++--
 fs/btrfs/raid56.c                            |    4 +
 fs/btrfs/volumes.c                           |    2 -
 fs/buffer.c                                  |    4 +
 fs/direct-io.c                               |    2 -
 fs/exofs/ore.c                               |    4 +
 fs/exofs/ore_raid.c                          |    2 -
 fs/ext4/page-io.c                            |    2 -
 fs/f2fs/data.c                               |    4 +
 fs/f2fs/segment.c                            |    2 -
 fs/gfs2/lops.c                               |    4 +
 fs/jfs/jfs_logmgr.c                          |    4 +
 fs/logfs/dev_bdev.c                          |   10 +--
 fs/mpage.c                                   |    2 -
 fs/splice.c                                  |    2 -
 include/asm-generic/dma-mapping-common.h     |   30 ++++++++
 include/asm-generic/memory_model.h           |    4 +
 include/asm-generic/scatterlist.h            |    6 ++
 include/crypto/scatterwalk.h                 |   10 +++
 include/linux/bio.h                          |   24 +++---
 include/linux/blk_types.h                    |   21 +++++
 include/linux/blkdev.h                       |    2 +
 include/linux/dma-debug.h                    |   23 +++++-
 include/linux/dma-mapping.h                  |    8 ++
 include/linux/scatterlist.h                  |  101 ++++++++++++++++++++++++--
 include/linux/swiotlb.h                      |    5 +
 kernel/power/block_io.c                      |    2 -
 lib/dma-debug.c                              |    4 +
 lib/swiotlb.c                                |   20 ++++-
 mm/iov_iter.c                                |   22 +++---
 mm/page_io.c                                 |    8 +-
 net/ceph/messenger.c                         |    2 -
 103 files changed, 658 insertions(+), 335 deletions(-)

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [RFC PATCH 2/7] block: convert bio_vec.bv_page to bv_pfn
  2015-03-16 20:25 [RFC PATCH 0/7] evacuate struct page from the block layer Dan Williams
  2015-03-16 20:25 ` Dan Williams
@ 2015-03-16 20:25 ` Dan Williams
  2015-03-16 20:25   ` Dan Williams
  2015-03-16 23:05   ` Al Viro
  2015-03-18 10:47 ` [RFC PATCH 0/7] evacuate struct page from the block layer Boaz Harrosh
  2015-03-18 20:26 ` Andrew Morton
  3 siblings, 2 replies; 59+ messages in thread
From: Dan Williams @ 2015-03-16 20:25 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-arch, axboe, riel, linux-nvdimm, Dave Hansen, linux-raid,
	mgorman, hch, linux-fsdevel, Matthew Wilcox

Carry a pfn in a bio_vec rather than a page in support of allowing
bio(s) to reference unmapped (not struct page backed) persistent memory.

As Dave Hansen points out, it would be unfortunate if we ended up with
less type safety after this conversion, so introduce __pfn_t.

Cc: Matthew Wilcox <willy@linux.intel.com>
[willy: use pfn_t]
[kvm: "no, use __pfn_t, we already stole pfn_t"]
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 block/bio.c                        |    1 +
 block/blk-integrity.c              |    4 ++--
 block/blk-merge.c                  |    6 +++---
 block/bounce.c                     |    2 +-
 drivers/md/bcache/btree.c          |    2 +-
 include/asm-generic/memory_model.h |    4 ++++
 include/linux/bio.h                |   20 +++++++++++---------
 include/linux/blk_types.h          |   14 +++++++++++---
 include/linux/scatterlist.h        |   16 ++++++++++++++++
 include/linux/swiotlb.h            |    1 +
 mm/iov_iter.c                      |   22 +++++++++++-----------
 mm/page_io.c                       |    2 +-
 12 files changed, 63 insertions(+), 31 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 7100fd6d5898..3d494e85e16d 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -28,6 +28,7 @@
 #include <linux/mempool.h>
 #include <linux/workqueue.h>
 #include <linux/cgroup.h>
+#include <linux/scatterlist.h>
 
 #include <trace/events/block.h>
 
diff --git a/block/blk-integrity.c b/block/blk-integrity.c
index 6c8b1d63e90b..34e53951a0d1 100644
--- a/block/blk-integrity.c
+++ b/block/blk-integrity.c
@@ -43,7 +43,7 @@ static const char *bi_unsupported_name = "unsupported";
  */
 int blk_rq_count_integrity_sg(struct request_queue *q, struct bio *bio)
 {
-	struct bio_vec iv, ivprv = { NULL };
+	struct bio_vec iv, ivprv = BIO_VEC_INIT(ivprv);
 	unsigned int segments = 0;
 	unsigned int seg_size = 0;
 	struct bvec_iter iter;
@@ -89,7 +89,7 @@ EXPORT_SYMBOL(blk_rq_count_integrity_sg);
 int blk_rq_map_integrity_sg(struct request_queue *q, struct bio *bio,
 			    struct scatterlist *sglist)
 {
-	struct bio_vec iv, ivprv = { NULL };
+	struct bio_vec iv, ivprv = BIO_VEC_INIT(ivprv);
 	struct scatterlist *sg = NULL;
 	unsigned int segments = 0;
 	struct bvec_iter iter;
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 39bd9925c057..8420d553b8ef 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -13,7 +13,7 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
 					     struct bio *bio,
 					     bool no_sg_merge)
 {
-	struct bio_vec bv, bvprv = { NULL };
+	struct bio_vec bv, bvprv = BIO_VEC_INIT(bvprv);
 	int cluster, high, highprv = 1;
 	unsigned int seg_size, nr_phys_segs;
 	struct bio *fbio, *bbio;
@@ -123,7 +123,7 @@ EXPORT_SYMBOL(blk_recount_segments);
 static int blk_phys_contig_segment(struct request_queue *q, struct bio *bio,
 				   struct bio *nxt)
 {
-	struct bio_vec end_bv = { NULL }, nxt_bv;
+	struct bio_vec end_bv = BIO_VEC_INIT(end_bv), nxt_bv;
 	struct bvec_iter iter;
 
 	if (!blk_queue_cluster(q))
@@ -202,7 +202,7 @@ static int __blk_bios_map_sg(struct request_queue *q, struct bio *bio,
 			     struct scatterlist *sglist,
 			     struct scatterlist **sg)
 {
-	struct bio_vec bvec, bvprv = { NULL };
+	struct bio_vec bvec, bvprv = BIO_VEC_INIT(bvprv);
 	struct bvec_iter iter;
 	int nsegs, cluster;
 
diff --git a/block/bounce.c b/block/bounce.c
index 0390e44d6e1b..4a3098067c81 100644
--- a/block/bounce.c
+++ b/block/bounce.c
@@ -64,7 +64,7 @@ static void bounce_copy_vec(struct bio_vec *to, unsigned char *vfrom)
 #else /* CONFIG_HIGHMEM */
 
 #define bounce_copy_vec(to, vfrom)	\
-	memcpy(page_address((to)->bv_page) + (to)->bv_offset, vfrom, (to)->bv_len)
+	memcpy(page_address(bvec_page(to)) + (to)->bv_offset, vfrom, (to)->bv_len)
 
 #endif /* CONFIG_HIGHMEM */
 
diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 2e76e8b62902..36bbe29a806b 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -426,7 +426,7 @@ static void do_btree_node_write(struct btree *b)
 		void *base = (void *) ((unsigned long) i & ~(PAGE_SIZE - 1));
 
 		bio_for_each_segment_all(bv, b->bio, j)
-			memcpy(page_address(bv->bv_page),
+			memcpy(page_address(bvec_page(bv)),
 			       base + j * PAGE_SIZE, PAGE_SIZE);
 
 		bch_submit_bbio(b->bio, b->c, &k.key, 0);
diff --git a/include/asm-generic/memory_model.h b/include/asm-generic/memory_model.h
index 14909b0b9cae..e6c2fda25820 100644
--- a/include/asm-generic/memory_model.h
+++ b/include/asm-generic/memory_model.h
@@ -72,6 +72,10 @@
 #define page_to_pfn __page_to_pfn
 #define pfn_to_page __pfn_to_page
 
+typedef struct {
+	unsigned long pfn;
+} __pfn_t;
+
 #endif /* __ASSEMBLY__ */
 
 #endif
diff --git a/include/linux/bio.h b/include/linux/bio.h
index f6a2427980f3..f35c90d5fd4d 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -63,8 +63,8 @@
  */
 #define __bvec_iter_bvec(bvec, iter)	(&(bvec)[(iter).bi_idx])
 
-#define bvec_iter_page(bvec, iter)				\
-	(__bvec_iter_bvec((bvec), (iter))->bv_page)
+#define bvec_iter_pfn(bvec, iter)				\
+	(__bvec_iter_bvec((bvec), (iter))->bv_pfn)
 
 #define bvec_iter_len(bvec, iter)				\
 	min((iter).bi_size,					\
@@ -75,7 +75,7 @@
 
 #define bvec_iter_bvec(bvec, iter)				\
 ((struct bio_vec) {						\
-	.bv_page	= bvec_iter_page((bvec), (iter)),	\
+	.bv_pfn		= bvec_iter_pfn((bvec), (iter)),	\
 	.bv_len		= bvec_iter_len((bvec), (iter)),	\
 	.bv_offset	= bvec_iter_offset((bvec), (iter)),	\
 })
@@ -83,14 +83,16 @@
 #define bio_iter_iovec(bio, iter)				\
 	bvec_iter_bvec((bio)->bi_io_vec, (iter))
 
-#define bio_iter_page(bio, iter)				\
-	bvec_iter_page((bio)->bi_io_vec, (iter))
+#define bio_iter_pfn(bio, iter)				\
+	bvec_iter_pfn((bio)->bi_io_vec, (iter))
 #define bio_iter_len(bio, iter)					\
 	bvec_iter_len((bio)->bi_io_vec, (iter))
 #define bio_iter_offset(bio, iter)				\
 	bvec_iter_offset((bio)->bi_io_vec, (iter))
 
-#define bio_page(bio)		bio_iter_page((bio), (bio)->bi_iter)
+#define bio_page(bio)	\
+		pfn_to_page((bio_iter_pfn((bio), (bio)->bi_iter)).pfn)
+#define bio_pfn(bio)		bio_iter_pfn((bio), (bio)->bi_iter)
 #define bio_offset(bio)		bio_iter_offset((bio), (bio)->bi_iter)
 #define bio_iovec(bio)		bio_iter_iovec((bio), (bio)->bi_iter)
 
@@ -150,8 +152,8 @@ static inline void *bio_data(struct bio *bio)
 /*
  * will die
  */
-#define bio_to_phys(bio)	(page_to_phys(bio_page((bio))) + (unsigned long) bio_offset((bio)))
-#define bvec_to_phys(bv)	(page_to_phys((bv)->bv_page) + (unsigned long) (bv)->bv_offset)
+#define bio_to_phys(bio)	(pfn_to_phys(bio_pfn((bio))) + (unsigned long) bio_offset((bio)))
+#define bvec_to_phys(bv)	(pfn_to_phys((bv)->bv_pfn) + (unsigned long) (bv)->bv_offset)
 
 /*
  * queues that have highmem support enabled may still need to revert to
@@ -160,7 +162,7 @@ static inline void *bio_data(struct bio *bio)
  * I/O completely on that queue (see ide-dma for example)
  */
 #define __bio_kmap_atomic(bio, iter)				\
-	(kmap_atomic(bio_iter_iovec((bio), (iter)).bv_page) +	\
+	(kmap_atomic(bio_iter_iovec((bio), bvec_page(iter)) +	\
 		bio_iter_iovec((bio), (iter)).bv_offset)
 
 #define __bio_kunmap_atomic(addr)	kunmap_atomic(addr)
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 3193a0b7051f..7f63fa3e4fda 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -5,7 +5,9 @@
 #ifndef __LINUX_BLK_TYPES_H
 #define __LINUX_BLK_TYPES_H
 
+#include <linux/scatterlist.h>
 #include <linux/types.h>
+#include <asm/pgtable.h>
 
 struct bio_set;
 struct bio;
@@ -21,19 +23,25 @@ typedef void (bio_destructor_t) (struct bio *);
  * was unsigned short, but we might as well be ready for > 64kB I/O pages
  */
 struct bio_vec {
-	struct page	*bv_page;
+	__pfn_t		bv_pfn;
 	unsigned int	bv_len;
 	unsigned int	bv_offset;
 };
 
+#define BIO_VEC_INIT(name) { .bv_pfn = { .pfn = 0 }, .bv_len = 0, \
+	.bv_offset = 0 }
+
+#define BIO_VEC(name) \
+	struct bio_vec name = BIO_VEC_INIT(name)
+
 static inline struct page *bvec_page(const struct bio_vec *bvec)
 {
-	return bvec->bv_page;
+	return pfn_to_page(bvec->bv_pfn.pfn);
 }
 
 static inline void bvec_set_page(struct bio_vec *bvec, struct page *page)
 {
-	bvec->bv_page = page;
+	bvec->bv_pfn = page_to_pfn_typed(page);
 }
 
 #ifdef CONFIG_BLOCK
diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index ed8f9e70df9b..5a15b1ce3c9e 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -9,6 +9,22 @@
 #include <asm/scatterlist.h>
 #include <asm/io.h>
 
+#ifndef __pfn_to_phys
+#define __pfn_to_phys(pfn)      ((dma_addr_t)(pfn) << PAGE_SHIFT)
+#endif
+
+static inline dma_addr_t pfn_to_phys(__pfn_t pfn)
+{
+	return __pfn_to_phys(pfn.pfn);
+}
+
+static inline __pfn_t page_to_pfn_typed(struct page *page)
+{
+	__pfn_t pfn = { .pfn = page_to_pfn(page) };
+
+	return pfn;
+}
+
 struct sg_table {
 	struct scatterlist *sgl;	/* the list */
 	unsigned int nents;		/* number of mapped entries */
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index e7a018eaf3a2..dc3a94ce3b45 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -1,6 +1,7 @@
 #ifndef __LINUX_SWIOTLB_H
 #define __LINUX_SWIOTLB_H
 
+#include <linux/dma-direction.h>
 #include <linux/types.h>
 
 struct device;
diff --git a/mm/iov_iter.c b/mm/iov_iter.c
index 827732047da1..be9a7c5b8703 100644
--- a/mm/iov_iter.c
+++ b/mm/iov_iter.c
@@ -61,7 +61,7 @@
 	__p = i->bvec;					\
 	__v.bv_len = min_t(size_t, n, __p->bv_len - skip);	\
 	if (likely(__v.bv_len)) {			\
-		__v.bv_page = __p->bv_page;		\
+		__v.bv_pfn = __p->bv_pfn;		\
 		__v.bv_offset = __p->bv_offset + skip; 	\
 		(void)(STEP);				\
 		skip += __v.bv_len;			\
@@ -72,7 +72,7 @@
 		__v.bv_len = min_t(size_t, n, __p->bv_len);	\
 		if (unlikely(!__v.bv_len))		\
 			continue;			\
-		__v.bv_page = __p->bv_page;		\
+		__v.bv_pfn = __p->bv_pfn;		\
 		__v.bv_offset = __p->bv_offset;		\
 		(void)(STEP);				\
 		skip = __v.bv_len;			\
@@ -369,7 +369,7 @@ size_t copy_to_iter(void *addr, size_t bytes, struct iov_iter *i)
 	iterate_and_advance(i, bytes, v,
 		__copy_to_user(v.iov_base, (from += v.iov_len) - v.iov_len,
 			       v.iov_len),
-		memcpy_to_page(v.bv_page, v.bv_offset,
+		memcpy_to_page(bvec_page(&v), v.bv_offset,
 			       (from += v.bv_len) - v.bv_len, v.bv_len),
 		memcpy(v.iov_base, (from += v.iov_len) - v.iov_len, v.iov_len)
 	)
@@ -390,7 +390,7 @@ size_t copy_from_iter(void *addr, size_t bytes, struct iov_iter *i)
 	iterate_and_advance(i, bytes, v,
 		__copy_from_user((to += v.iov_len) - v.iov_len, v.iov_base,
 				 v.iov_len),
-		memcpy_from_page((to += v.bv_len) - v.bv_len, v.bv_page,
+		memcpy_from_page((to += v.bv_len) - v.bv_len, bvec_page(&v),
 				 v.bv_offset, v.bv_len),
 		memcpy((to += v.iov_len) - v.iov_len, v.iov_base, v.iov_len)
 	)
@@ -411,7 +411,7 @@ size_t copy_from_iter_nocache(void *addr, size_t bytes, struct iov_iter *i)
 	iterate_and_advance(i, bytes, v,
 		__copy_from_user_nocache((to += v.iov_len) - v.iov_len,
 					 v.iov_base, v.iov_len),
-		memcpy_from_page((to += v.bv_len) - v.bv_len, v.bv_page,
+		memcpy_from_page((to += v.bv_len) - v.bv_len, bvec_page(&v),
 				 v.bv_offset, v.bv_len),
 		memcpy((to += v.iov_len) - v.iov_len, v.iov_base, v.iov_len)
 	)
@@ -456,7 +456,7 @@ size_t iov_iter_zero(size_t bytes, struct iov_iter *i)
 
 	iterate_and_advance(i, bytes, v,
 		__clear_user(v.iov_base, v.iov_len),
-		memzero_page(v.bv_page, v.bv_offset, v.bv_len),
+		memzero_page(bvec_page(&v), v.bv_offset, v.bv_len),
 		memset(v.iov_base, 0, v.iov_len)
 	)
 
@@ -471,7 +471,7 @@ size_t iov_iter_copy_from_user_atomic(struct page *page,
 	iterate_all_kinds(i, bytes, v,
 		__copy_from_user_inatomic((p += v.iov_len) - v.iov_len,
 					  v.iov_base, v.iov_len),
-		memcpy_from_page((p += v.bv_len) - v.bv_len, v.bv_page,
+		memcpy_from_page((p += v.bv_len) - v.bv_len, bvec_page(&v),
 				 v.bv_offset, v.bv_len),
 		memcpy((p += v.iov_len) - v.iov_len, v.iov_base, v.iov_len)
 	)
@@ -570,7 +570,7 @@ ssize_t iov_iter_get_pages(struct iov_iter *i,
 	0;}),({
 		/* can't be more than PAGE_SIZE */
 		*start = v.bv_offset;
-		get_page(*pages = v.bv_page);
+		get_page(*pages = bvec_page(&v));
 		return v.bv_len;
 	}),({
 		return -EFAULT;
@@ -624,7 +624,7 @@ ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
 		*pages = p = get_pages_array(1);
 		if (!p)
 			return -ENOMEM;
-		get_page(*p = v.bv_page);
+		get_page(*p = bvec_page(&v));
 		return v.bv_len;
 	}),({
 		return -EFAULT;
@@ -658,7 +658,7 @@ size_t csum_and_copy_from_iter(void *addr, size_t bytes, __wsum *csum,
 		}
 		err ? v.iov_len : 0;
 	}), ({
-		char *p = kmap_atomic(v.bv_page);
+		char *p = kmap_atomic(bvec_page(&v));
 		next = csum_partial_copy_nocheck(p + v.bv_offset,
 						 (to += v.bv_len) - v.bv_len,
 						 v.bv_len, 0);
@@ -702,7 +702,7 @@ size_t csum_and_copy_to_iter(void *addr, size_t bytes, __wsum *csum,
 		}
 		err ? v.iov_len : 0;
 	}), ({
-		char *p = kmap_atomic(v.bv_page);
+		char *p = kmap_atomic(bvec_page(&v));
 		next = csum_partial_copy_nocheck((from += v.bv_len) - v.bv_len,
 						 p + v.bv_offset,
 						 v.bv_len, 0);
diff --git a/mm/page_io.c b/mm/page_io.c
index c540dbc6a9e5..b7c8d2c3f8f9 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -265,7 +265,7 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,
 		struct file *swap_file = sis->swap_file;
 		struct address_space *mapping = swap_file->f_mapping;
 		struct bio_vec bv = {
-			.bv_page = page,
+			.bv_pfn = page_to_pfn_typed(page),
 			.bv_len  = PAGE_SIZE,
 			.bv_offset = 0
 		};


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [RFC PATCH 2/7] block: convert bio_vec.bv_page to bv_pfn
  2015-03-16 20:25 ` [RFC PATCH 2/7] block: convert bio_vec.bv_page to bv_pfn Dan Williams
@ 2015-03-16 20:25   ` Dan Williams
  2015-03-16 23:05   ` Al Viro
  1 sibling, 0 replies; 59+ messages in thread
From: Dan Williams @ 2015-03-16 20:25 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-arch, axboe, riel, linux-nvdimm, Dave Hansen, linux-raid,
	mgorman, hch, linux-fsdevel, Matthew Wilcox

Carry a pfn in a bio_vec rather than a page in support of allowing
bio(s) to reference unmapped (not struct page backed) persistent memory.

As Dave Hansen points out, it would be unfortunate if we ended up with
less type safety after this conversion, so introduce __pfn_t.

Cc: Matthew Wilcox <willy@linux.intel.com>
[willy: use pfn_t]
[kvm: "no, use __pfn_t, we already stole pfn_t"]
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 block/bio.c                        |    1 +
 block/blk-integrity.c              |    4 ++--
 block/blk-merge.c                  |    6 +++---
 block/bounce.c                     |    2 +-
 drivers/md/bcache/btree.c          |    2 +-
 include/asm-generic/memory_model.h |    4 ++++
 include/linux/bio.h                |   20 +++++++++++---------
 include/linux/blk_types.h          |   14 +++++++++++---
 include/linux/scatterlist.h        |   16 ++++++++++++++++
 include/linux/swiotlb.h            |    1 +
 mm/iov_iter.c                      |   22 +++++++++++-----------
 mm/page_io.c                       |    2 +-
 12 files changed, 63 insertions(+), 31 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 7100fd6d5898..3d494e85e16d 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -28,6 +28,7 @@
 #include <linux/mempool.h>
 #include <linux/workqueue.h>
 #include <linux/cgroup.h>
+#include <linux/scatterlist.h>
 
 #include <trace/events/block.h>
 
diff --git a/block/blk-integrity.c b/block/blk-integrity.c
index 6c8b1d63e90b..34e53951a0d1 100644
--- a/block/blk-integrity.c
+++ b/block/blk-integrity.c
@@ -43,7 +43,7 @@ static const char *bi_unsupported_name = "unsupported";
  */
 int blk_rq_count_integrity_sg(struct request_queue *q, struct bio *bio)
 {
-	struct bio_vec iv, ivprv = { NULL };
+	struct bio_vec iv, ivprv = BIO_VEC_INIT(ivprv);
 	unsigned int segments = 0;
 	unsigned int seg_size = 0;
 	struct bvec_iter iter;
@@ -89,7 +89,7 @@ EXPORT_SYMBOL(blk_rq_count_integrity_sg);
 int blk_rq_map_integrity_sg(struct request_queue *q, struct bio *bio,
 			    struct scatterlist *sglist)
 {
-	struct bio_vec iv, ivprv = { NULL };
+	struct bio_vec iv, ivprv = BIO_VEC_INIT(ivprv);
 	struct scatterlist *sg = NULL;
 	unsigned int segments = 0;
 	struct bvec_iter iter;
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 39bd9925c057..8420d553b8ef 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -13,7 +13,7 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
 					     struct bio *bio,
 					     bool no_sg_merge)
 {
-	struct bio_vec bv, bvprv = { NULL };
+	struct bio_vec bv, bvprv = BIO_VEC_INIT(bvprv);
 	int cluster, high, highprv = 1;
 	unsigned int seg_size, nr_phys_segs;
 	struct bio *fbio, *bbio;
@@ -123,7 +123,7 @@ EXPORT_SYMBOL(blk_recount_segments);
 static int blk_phys_contig_segment(struct request_queue *q, struct bio *bio,
 				   struct bio *nxt)
 {
-	struct bio_vec end_bv = { NULL }, nxt_bv;
+	struct bio_vec end_bv = BIO_VEC_INIT(end_bv), nxt_bv;
 	struct bvec_iter iter;
 
 	if (!blk_queue_cluster(q))
@@ -202,7 +202,7 @@ static int __blk_bios_map_sg(struct request_queue *q, struct bio *bio,
 			     struct scatterlist *sglist,
 			     struct scatterlist **sg)
 {
-	struct bio_vec bvec, bvprv = { NULL };
+	struct bio_vec bvec, bvprv = BIO_VEC_INIT(bvprv);
 	struct bvec_iter iter;
 	int nsegs, cluster;
 
diff --git a/block/bounce.c b/block/bounce.c
index 0390e44d6e1b..4a3098067c81 100644
--- a/block/bounce.c
+++ b/block/bounce.c
@@ -64,7 +64,7 @@ static void bounce_copy_vec(struct bio_vec *to, unsigned char *vfrom)
 #else /* CONFIG_HIGHMEM */
 
 #define bounce_copy_vec(to, vfrom)	\
-	memcpy(page_address((to)->bv_page) + (to)->bv_offset, vfrom, (to)->bv_len)
+	memcpy(page_address(bvec_page(to)) + (to)->bv_offset, vfrom, (to)->bv_len)
 
 #endif /* CONFIG_HIGHMEM */
 
diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 2e76e8b62902..36bbe29a806b 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -426,7 +426,7 @@ static void do_btree_node_write(struct btree *b)
 		void *base = (void *) ((unsigned long) i & ~(PAGE_SIZE - 1));
 
 		bio_for_each_segment_all(bv, b->bio, j)
-			memcpy(page_address(bv->bv_page),
+			memcpy(page_address(bvec_page(bv)),
 			       base + j * PAGE_SIZE, PAGE_SIZE);
 
 		bch_submit_bbio(b->bio, b->c, &k.key, 0);
diff --git a/include/asm-generic/memory_model.h b/include/asm-generic/memory_model.h
index 14909b0b9cae..e6c2fda25820 100644
--- a/include/asm-generic/memory_model.h
+++ b/include/asm-generic/memory_model.h
@@ -72,6 +72,10 @@
 #define page_to_pfn __page_to_pfn
 #define pfn_to_page __pfn_to_page
 
+typedef struct {
+	unsigned long pfn;
+} __pfn_t;
+
 #endif /* __ASSEMBLY__ */
 
 #endif
diff --git a/include/linux/bio.h b/include/linux/bio.h
index f6a2427980f3..f35c90d5fd4d 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -63,8 +63,8 @@
  */
 #define __bvec_iter_bvec(bvec, iter)	(&(bvec)[(iter).bi_idx])
 
-#define bvec_iter_page(bvec, iter)				\
-	(__bvec_iter_bvec((bvec), (iter))->bv_page)
+#define bvec_iter_pfn(bvec, iter)				\
+	(__bvec_iter_bvec((bvec), (iter))->bv_pfn)
 
 #define bvec_iter_len(bvec, iter)				\
 	min((iter).bi_size,					\
@@ -75,7 +75,7 @@
 
 #define bvec_iter_bvec(bvec, iter)				\
 ((struct bio_vec) {						\
-	.bv_page	= bvec_iter_page((bvec), (iter)),	\
+	.bv_pfn		= bvec_iter_pfn((bvec), (iter)),	\
 	.bv_len		= bvec_iter_len((bvec), (iter)),	\
 	.bv_offset	= bvec_iter_offset((bvec), (iter)),	\
 })
@@ -83,14 +83,16 @@
 #define bio_iter_iovec(bio, iter)				\
 	bvec_iter_bvec((bio)->bi_io_vec, (iter))
 
-#define bio_iter_page(bio, iter)				\
-	bvec_iter_page((bio)->bi_io_vec, (iter))
+#define bio_iter_pfn(bio, iter)				\
+	bvec_iter_pfn((bio)->bi_io_vec, (iter))
 #define bio_iter_len(bio, iter)					\
 	bvec_iter_len((bio)->bi_io_vec, (iter))
 #define bio_iter_offset(bio, iter)				\
 	bvec_iter_offset((bio)->bi_io_vec, (iter))
 
-#define bio_page(bio)		bio_iter_page((bio), (bio)->bi_iter)
+#define bio_page(bio)	\
+		pfn_to_page((bio_iter_pfn((bio), (bio)->bi_iter)).pfn)
+#define bio_pfn(bio)		bio_iter_pfn((bio), (bio)->bi_iter)
 #define bio_offset(bio)		bio_iter_offset((bio), (bio)->bi_iter)
 #define bio_iovec(bio)		bio_iter_iovec((bio), (bio)->bi_iter)
 
@@ -150,8 +152,8 @@ static inline void *bio_data(struct bio *bio)
 /*
  * will die
  */
-#define bio_to_phys(bio)	(page_to_phys(bio_page((bio))) + (unsigned long) bio_offset((bio)))
-#define bvec_to_phys(bv)	(page_to_phys((bv)->bv_page) + (unsigned long) (bv)->bv_offset)
+#define bio_to_phys(bio)	(pfn_to_phys(bio_pfn((bio))) + (unsigned long) bio_offset((bio)))
+#define bvec_to_phys(bv)	(pfn_to_phys((bv)->bv_pfn) + (unsigned long) (bv)->bv_offset)
 
 /*
  * queues that have highmem support enabled may still need to revert to
@@ -160,7 +162,7 @@ static inline void *bio_data(struct bio *bio)
  * I/O completely on that queue (see ide-dma for example)
  */
 #define __bio_kmap_atomic(bio, iter)				\
-	(kmap_atomic(bio_iter_iovec((bio), (iter)).bv_page) +	\
+	(kmap_atomic(bio_iter_iovec((bio), bvec_page(iter)) +	\
 		bio_iter_iovec((bio), (iter)).bv_offset)
 
 #define __bio_kunmap_atomic(addr)	kunmap_atomic(addr)
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 3193a0b7051f..7f63fa3e4fda 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -5,7 +5,9 @@
 #ifndef __LINUX_BLK_TYPES_H
 #define __LINUX_BLK_TYPES_H
 
+#include <linux/scatterlist.h>
 #include <linux/types.h>
+#include <asm/pgtable.h>
 
 struct bio_set;
 struct bio;
@@ -21,19 +23,25 @@ typedef void (bio_destructor_t) (struct bio *);
  * was unsigned short, but we might as well be ready for > 64kB I/O pages
  */
 struct bio_vec {
-	struct page	*bv_page;
+	__pfn_t		bv_pfn;
 	unsigned int	bv_len;
 	unsigned int	bv_offset;
 };
 
+#define BIO_VEC_INIT(name) { .bv_pfn = { .pfn = 0 }, .bv_len = 0, \
+	.bv_offset = 0 }
+
+#define BIO_VEC(name) \
+	struct bio_vec name = BIO_VEC_INIT(name)
+
 static inline struct page *bvec_page(const struct bio_vec *bvec)
 {
-	return bvec->bv_page;
+	return pfn_to_page(bvec->bv_pfn.pfn);
 }
 
 static inline void bvec_set_page(struct bio_vec *bvec, struct page *page)
 {
-	bvec->bv_page = page;
+	bvec->bv_pfn = page_to_pfn_typed(page);
 }
 
 #ifdef CONFIG_BLOCK
diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index ed8f9e70df9b..5a15b1ce3c9e 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -9,6 +9,22 @@
 #include <asm/scatterlist.h>
 #include <asm/io.h>
 
+#ifndef __pfn_to_phys
+#define __pfn_to_phys(pfn)      ((dma_addr_t)(pfn) << PAGE_SHIFT)
+#endif
+
+static inline dma_addr_t pfn_to_phys(__pfn_t pfn)
+{
+	return __pfn_to_phys(pfn.pfn);
+}
+
+static inline __pfn_t page_to_pfn_typed(struct page *page)
+{
+	__pfn_t pfn = { .pfn = page_to_pfn(page) };
+
+	return pfn;
+}
+
 struct sg_table {
 	struct scatterlist *sgl;	/* the list */
 	unsigned int nents;		/* number of mapped entries */
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index e7a018eaf3a2..dc3a94ce3b45 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -1,6 +1,7 @@
 #ifndef __LINUX_SWIOTLB_H
 #define __LINUX_SWIOTLB_H
 
+#include <linux/dma-direction.h>
 #include <linux/types.h>
 
 struct device;
diff --git a/mm/iov_iter.c b/mm/iov_iter.c
index 827732047da1..be9a7c5b8703 100644
--- a/mm/iov_iter.c
+++ b/mm/iov_iter.c
@@ -61,7 +61,7 @@
 	__p = i->bvec;					\
 	__v.bv_len = min_t(size_t, n, __p->bv_len - skip);	\
 	if (likely(__v.bv_len)) {			\
-		__v.bv_page = __p->bv_page;		\
+		__v.bv_pfn = __p->bv_pfn;		\
 		__v.bv_offset = __p->bv_offset + skip; 	\
 		(void)(STEP);				\
 		skip += __v.bv_len;			\
@@ -72,7 +72,7 @@
 		__v.bv_len = min_t(size_t, n, __p->bv_len);	\
 		if (unlikely(!__v.bv_len))		\
 			continue;			\
-		__v.bv_page = __p->bv_page;		\
+		__v.bv_pfn = __p->bv_pfn;		\
 		__v.bv_offset = __p->bv_offset;		\
 		(void)(STEP);				\
 		skip = __v.bv_len;			\
@@ -369,7 +369,7 @@ size_t copy_to_iter(void *addr, size_t bytes, struct iov_iter *i)
 	iterate_and_advance(i, bytes, v,
 		__copy_to_user(v.iov_base, (from += v.iov_len) - v.iov_len,
 			       v.iov_len),
-		memcpy_to_page(v.bv_page, v.bv_offset,
+		memcpy_to_page(bvec_page(&v), v.bv_offset,
 			       (from += v.bv_len) - v.bv_len, v.bv_len),
 		memcpy(v.iov_base, (from += v.iov_len) - v.iov_len, v.iov_len)
 	)
@@ -390,7 +390,7 @@ size_t copy_from_iter(void *addr, size_t bytes, struct iov_iter *i)
 	iterate_and_advance(i, bytes, v,
 		__copy_from_user((to += v.iov_len) - v.iov_len, v.iov_base,
 				 v.iov_len),
-		memcpy_from_page((to += v.bv_len) - v.bv_len, v.bv_page,
+		memcpy_from_page((to += v.bv_len) - v.bv_len, bvec_page(&v),
 				 v.bv_offset, v.bv_len),
 		memcpy((to += v.iov_len) - v.iov_len, v.iov_base, v.iov_len)
 	)
@@ -411,7 +411,7 @@ size_t copy_from_iter_nocache(void *addr, size_t bytes, struct iov_iter *i)
 	iterate_and_advance(i, bytes, v,
 		__copy_from_user_nocache((to += v.iov_len) - v.iov_len,
 					 v.iov_base, v.iov_len),
-		memcpy_from_page((to += v.bv_len) - v.bv_len, v.bv_page,
+		memcpy_from_page((to += v.bv_len) - v.bv_len, bvec_page(&v),
 				 v.bv_offset, v.bv_len),
 		memcpy((to += v.iov_len) - v.iov_len, v.iov_base, v.iov_len)
 	)
@@ -456,7 +456,7 @@ size_t iov_iter_zero(size_t bytes, struct iov_iter *i)
 
 	iterate_and_advance(i, bytes, v,
 		__clear_user(v.iov_base, v.iov_len),
-		memzero_page(v.bv_page, v.bv_offset, v.bv_len),
+		memzero_page(bvec_page(&v), v.bv_offset, v.bv_len),
 		memset(v.iov_base, 0, v.iov_len)
 	)
 
@@ -471,7 +471,7 @@ size_t iov_iter_copy_from_user_atomic(struct page *page,
 	iterate_all_kinds(i, bytes, v,
 		__copy_from_user_inatomic((p += v.iov_len) - v.iov_len,
 					  v.iov_base, v.iov_len),
-		memcpy_from_page((p += v.bv_len) - v.bv_len, v.bv_page,
+		memcpy_from_page((p += v.bv_len) - v.bv_len, bvec_page(&v),
 				 v.bv_offset, v.bv_len),
 		memcpy((p += v.iov_len) - v.iov_len, v.iov_base, v.iov_len)
 	)
@@ -570,7 +570,7 @@ ssize_t iov_iter_get_pages(struct iov_iter *i,
 	0;}),({
 		/* can't be more than PAGE_SIZE */
 		*start = v.bv_offset;
-		get_page(*pages = v.bv_page);
+		get_page(*pages = bvec_page(&v));
 		return v.bv_len;
 	}),({
 		return -EFAULT;
@@ -624,7 +624,7 @@ ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
 		*pages = p = get_pages_array(1);
 		if (!p)
 			return -ENOMEM;
-		get_page(*p = v.bv_page);
+		get_page(*p = bvec_page(&v));
 		return v.bv_len;
 	}),({
 		return -EFAULT;
@@ -658,7 +658,7 @@ size_t csum_and_copy_from_iter(void *addr, size_t bytes, __wsum *csum,
 		}
 		err ? v.iov_len : 0;
 	}), ({
-		char *p = kmap_atomic(v.bv_page);
+		char *p = kmap_atomic(bvec_page(&v));
 		next = csum_partial_copy_nocheck(p + v.bv_offset,
 						 (to += v.bv_len) - v.bv_len,
 						 v.bv_len, 0);
@@ -702,7 +702,7 @@ size_t csum_and_copy_to_iter(void *addr, size_t bytes, __wsum *csum,
 		}
 		err ? v.iov_len : 0;
 	}), ({
-		char *p = kmap_atomic(v.bv_page);
+		char *p = kmap_atomic(bvec_page(&v));
 		next = csum_partial_copy_nocheck((from += v.bv_len) - v.bv_len,
 						 p + v.bv_offset,
 						 v.bv_len, 0);
diff --git a/mm/page_io.c b/mm/page_io.c
index c540dbc6a9e5..b7c8d2c3f8f9 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -265,7 +265,7 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,
 		struct file *swap_file = sis->swap_file;
 		struct address_space *mapping = swap_file->f_mapping;
 		struct bio_vec bv = {
-			.bv_page = page,
+			.bv_pfn = page_to_pfn_typed(page),
 			.bv_len  = PAGE_SIZE,
 			.bv_offset = 0
 		};


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 2/7] block: convert bio_vec.bv_page to bv_pfn
  2015-03-16 20:25 ` [RFC PATCH 2/7] block: convert bio_vec.bv_page to bv_pfn Dan Williams
  2015-03-16 20:25   ` Dan Williams
@ 2015-03-16 23:05   ` Al Viro
  2015-03-16 23:05     ` Al Viro
  2015-03-17 13:02     ` Matthew Wilcox
  1 sibling, 2 replies; 59+ messages in thread
From: Al Viro @ 2015-03-16 23:05 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-kernel, linux-arch, axboe, riel, linux-nvdimm, Dave Hansen,
	linux-raid, mgorman, hch, linux-fsdevel, Matthew Wilcox

> diff --git a/mm/iov_iter.c b/mm/iov_iter.c
> index 827732047da1..be9a7c5b8703 100644
> --- a/mm/iov_iter.c
> +++ b/mm/iov_iter.c
> @@ -61,7 +61,7 @@
>  	__p = i->bvec;					\
>  	__v.bv_len = min_t(size_t, n, __p->bv_len - skip);	\
>  	if (likely(__v.bv_len)) {			\
> -		__v.bv_page = __p->bv_page;		\
> +		__v.bv_pfn = __p->bv_pfn;		\
>  		__v.bv_offset = __p->bv_offset + skip; 	\
>  		(void)(STEP);				\
>  		skip += __v.bv_len;			\
> @@ -72,7 +72,7 @@
>  		__v.bv_len = min_t(size_t, n, __p->bv_len);	\
>  		if (unlikely(!__v.bv_len))		\
>  			continue;			\
> -		__v.bv_page = __p->bv_page;		\
> +		__v.bv_pfn = __p->bv_pfn;		\
>  		__v.bv_offset = __p->bv_offset;		\
>  		(void)(STEP);				\
>  		skip = __v.bv_len;			\
> @@ -369,7 +369,7 @@ size_t copy_to_iter(void *addr, size_t bytes, struct iov_iter *i)
>  	iterate_and_advance(i, bytes, v,
>  		__copy_to_user(v.iov_base, (from += v.iov_len) - v.iov_len,
>  			       v.iov_len),
> -		memcpy_to_page(v.bv_page, v.bv_offset,
> +		memcpy_to_page(bvec_page(&v), v.bv_offset,

How had memcpy_to_page(NULL, ...) worked for you?

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 2/7] block: convert bio_vec.bv_page to bv_pfn
  2015-03-16 23:05   ` Al Viro
@ 2015-03-16 23:05     ` Al Viro
  2015-03-17 13:02     ` Matthew Wilcox
  1 sibling, 0 replies; 59+ messages in thread
From: Al Viro @ 2015-03-16 23:05 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-kernel, linux-arch, axboe, riel, linux-nvdimm, Dave Hansen,
	linux-raid, mgorman, hch, linux-fsdevel, Matthew Wilcox

> diff --git a/mm/iov_iter.c b/mm/iov_iter.c
> index 827732047da1..be9a7c5b8703 100644
> --- a/mm/iov_iter.c
> +++ b/mm/iov_iter.c
> @@ -61,7 +61,7 @@
>  	__p = i->bvec;					\
>  	__v.bv_len = min_t(size_t, n, __p->bv_len - skip);	\
>  	if (likely(__v.bv_len)) {			\
> -		__v.bv_page = __p->bv_page;		\
> +		__v.bv_pfn = __p->bv_pfn;		\
>  		__v.bv_offset = __p->bv_offset + skip; 	\
>  		(void)(STEP);				\
>  		skip += __v.bv_len;			\
> @@ -72,7 +72,7 @@
>  		__v.bv_len = min_t(size_t, n, __p->bv_len);	\
>  		if (unlikely(!__v.bv_len))		\
>  			continue;			\
> -		__v.bv_page = __p->bv_page;		\
> +		__v.bv_pfn = __p->bv_pfn;		\
>  		__v.bv_offset = __p->bv_offset;		\
>  		(void)(STEP);				\
>  		skip = __v.bv_len;			\
> @@ -369,7 +369,7 @@ size_t copy_to_iter(void *addr, size_t bytes, struct iov_iter *i)
>  	iterate_and_advance(i, bytes, v,
>  		__copy_to_user(v.iov_base, (from += v.iov_len) - v.iov_len,
>  			       v.iov_len),
> -		memcpy_to_page(v.bv_page, v.bv_offset,
> +		memcpy_to_page(bvec_page(&v), v.bv_offset,

How had memcpy_to_page(NULL, ...) worked for you?

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 2/7] block: convert bio_vec.bv_page to bv_pfn
  2015-03-16 23:05   ` Al Viro
  2015-03-16 23:05     ` Al Viro
@ 2015-03-17 13:02     ` Matthew Wilcox
  2015-03-17 13:02       ` Matthew Wilcox
  2015-03-17 15:53       ` Dan Williams
  1 sibling, 2 replies; 59+ messages in thread
From: Matthew Wilcox @ 2015-03-17 13:02 UTC (permalink / raw)
  To: Al Viro
  Cc: Dan Williams, linux-kernel, linux-arch, axboe, riel, linux-nvdimm,
	Dave Hansen, linux-raid, mgorman, hch, linux-fsdevel

On Mon, Mar 16, 2015 at 11:05:33PM +0000, Al Viro wrote:
> > diff --git a/mm/iov_iter.c b/mm/iov_iter.c
> > index 827732047da1..be9a7c5b8703 100644
> > --- a/mm/iov_iter.c
> > +++ b/mm/iov_iter.c
> > @@ -61,7 +61,7 @@
> >  	__p = i->bvec;					\
> >  	__v.bv_len = min_t(size_t, n, __p->bv_len - skip);	\
> >  	if (likely(__v.bv_len)) {			\
> > -		__v.bv_page = __p->bv_page;		\
> > +		__v.bv_pfn = __p->bv_pfn;		\
> >  		__v.bv_offset = __p->bv_offset + skip; 	\
> >  		(void)(STEP);				\
> >  		skip += __v.bv_len;			\
> > @@ -72,7 +72,7 @@
> >  		__v.bv_len = min_t(size_t, n, __p->bv_len);	\
> >  		if (unlikely(!__v.bv_len))		\
> >  			continue;			\
> > -		__v.bv_page = __p->bv_page;		\
> > +		__v.bv_pfn = __p->bv_pfn;		\
> >  		__v.bv_offset = __p->bv_offset;		\
> >  		(void)(STEP);				\
> >  		skip = __v.bv_len;			\
> > @@ -369,7 +369,7 @@ size_t copy_to_iter(void *addr, size_t bytes, struct iov_iter *i)
> >  	iterate_and_advance(i, bytes, v,
> >  		__copy_to_user(v.iov_base, (from += v.iov_len) - v.iov_len,
> >  			       v.iov_len),
> > -		memcpy_to_page(v.bv_page, v.bv_offset,
> > +		memcpy_to_page(bvec_page(&v), v.bv_offset,
> 
> How had memcpy_to_page(NULL, ...) worked for you?

 static inline struct page *bvec_page(const struct bio_vec *bvec)
 {
-       return bvec->bv_page;
+       return pfn_to_page(bvec->bv_pfn.pfn);
 }

(yes, more work to be done here to make copy_to_iter work to a bvec that
is actually targetting a page-less address, but these are RFC patches
showing the direction we're heading in while keeping current code working)

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 2/7] block: convert bio_vec.bv_page to bv_pfn
  2015-03-17 13:02     ` Matthew Wilcox
@ 2015-03-17 13:02       ` Matthew Wilcox
  2015-03-17 15:53       ` Dan Williams
  1 sibling, 0 replies; 59+ messages in thread
From: Matthew Wilcox @ 2015-03-17 13:02 UTC (permalink / raw)
  To: Al Viro
  Cc: Dan Williams, linux-kernel, linux-arch, axboe, riel, linux-nvdimm,
	Dave Hansen, linux-raid, mgorman, hch, linux-fsdevel

On Mon, Mar 16, 2015 at 11:05:33PM +0000, Al Viro wrote:
> > diff --git a/mm/iov_iter.c b/mm/iov_iter.c
> > index 827732047da1..be9a7c5b8703 100644
> > --- a/mm/iov_iter.c
> > +++ b/mm/iov_iter.c
> > @@ -61,7 +61,7 @@
> >  	__p = i->bvec;					\
> >  	__v.bv_len = min_t(size_t, n, __p->bv_len - skip);	\
> >  	if (likely(__v.bv_len)) {			\
> > -		__v.bv_page = __p->bv_page;		\
> > +		__v.bv_pfn = __p->bv_pfn;		\
> >  		__v.bv_offset = __p->bv_offset + skip; 	\
> >  		(void)(STEP);				\
> >  		skip += __v.bv_len;			\
> > @@ -72,7 +72,7 @@
> >  		__v.bv_len = min_t(size_t, n, __p->bv_len);	\
> >  		if (unlikely(!__v.bv_len))		\
> >  			continue;			\
> > -		__v.bv_page = __p->bv_page;		\
> > +		__v.bv_pfn = __p->bv_pfn;		\
> >  		__v.bv_offset = __p->bv_offset;		\
> >  		(void)(STEP);				\
> >  		skip = __v.bv_len;			\
> > @@ -369,7 +369,7 @@ size_t copy_to_iter(void *addr, size_t bytes, struct iov_iter *i)
> >  	iterate_and_advance(i, bytes, v,
> >  		__copy_to_user(v.iov_base, (from += v.iov_len) - v.iov_len,
> >  			       v.iov_len),
> > -		memcpy_to_page(v.bv_page, v.bv_offset,
> > +		memcpy_to_page(bvec_page(&v), v.bv_offset,
> 
> How had memcpy_to_page(NULL, ...) worked for you?

 static inline struct page *bvec_page(const struct bio_vec *bvec)
 {
-       return bvec->bv_page;
+       return pfn_to_page(bvec->bv_pfn.pfn);
 }

(yes, more work to be done here to make copy_to_iter work to a bvec that
is actually targetting a page-less address, but these are RFC patches
showing the direction we're heading in while keeping current code working)


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 2/7] block: convert bio_vec.bv_page to bv_pfn
  2015-03-17 13:02     ` Matthew Wilcox
  2015-03-17 13:02       ` Matthew Wilcox
@ 2015-03-17 15:53       ` Dan Williams
  2015-03-17 15:53         ` Dan Williams
  1 sibling, 1 reply; 59+ messages in thread
From: Dan Williams @ 2015-03-17 15:53 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Al Viro, linux-kernel@vger.kernel.org, linux-arch, Jens Axboe,
	riel, linux-nvdimm, Dave Hansen, linux-raid, mgorman,
	Christoph Hellwig, linux-fsdevel

On Tue, Mar 17, 2015 at 6:02 AM, Matthew Wilcox <willy@linux.intel.com> wrote:
> On Mon, Mar 16, 2015 at 11:05:33PM +0000, Al Viro wrote:
>> > diff --git a/mm/iov_iter.c b/mm/iov_iter.c
>> > index 827732047da1..be9a7c5b8703 100644
>> > --- a/mm/iov_iter.c
>> > +++ b/mm/iov_iter.c
>> > @@ -61,7 +61,7 @@
>> >     __p = i->bvec;                                  \
>> >     __v.bv_len = min_t(size_t, n, __p->bv_len - skip);      \
>> >     if (likely(__v.bv_len)) {                       \
>> > -           __v.bv_page = __p->bv_page;             \
>> > +           __v.bv_pfn = __p->bv_pfn;               \
>> >             __v.bv_offset = __p->bv_offset + skip;  \
>> >             (void)(STEP);                           \
>> >             skip += __v.bv_len;                     \
>> > @@ -72,7 +72,7 @@
>> >             __v.bv_len = min_t(size_t, n, __p->bv_len);     \
>> >             if (unlikely(!__v.bv_len))              \
>> >                     continue;                       \
>> > -           __v.bv_page = __p->bv_page;             \
>> > +           __v.bv_pfn = __p->bv_pfn;               \
>> >             __v.bv_offset = __p->bv_offset;         \
>> >             (void)(STEP);                           \
>> >             skip = __v.bv_len;                      \
>> > @@ -369,7 +369,7 @@ size_t copy_to_iter(void *addr, size_t bytes, struct iov_iter *i)
>> >     iterate_and_advance(i, bytes, v,
>> >             __copy_to_user(v.iov_base, (from += v.iov_len) - v.iov_len,
>> >                            v.iov_len),
>> > -           memcpy_to_page(v.bv_page, v.bv_offset,
>> > +           memcpy_to_page(bvec_page(&v), v.bv_offset,
>>
>> How had memcpy_to_page(NULL, ...) worked for you?
>
>  static inline struct page *bvec_page(const struct bio_vec *bvec)
>  {
> -       return bvec->bv_page;
> +       return pfn_to_page(bvec->bv_pfn.pfn);
>  }
>
> (yes, more work to be done here to make copy_to_iter work to a bvec that
> is actually targetting a page-less address, but these are RFC patches
> showing the direction we're heading in while keeping current code working)
>

Right, the next item to tackle is kmap() and kmap_atomic() before we
can start converting paths to be "native" pfn-only.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 2/7] block: convert bio_vec.bv_page to bv_pfn
  2015-03-17 15:53       ` Dan Williams
@ 2015-03-17 15:53         ` Dan Williams
  0 siblings, 0 replies; 59+ messages in thread
From: Dan Williams @ 2015-03-17 15:53 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Al Viro, linux-kernel@vger.kernel.org, linux-arch, Jens Axboe,
	riel, linux-nvdimm, Dave Hansen, linux-raid, mgorman,
	Christoph Hellwig, linux-fsdevel

On Tue, Mar 17, 2015 at 6:02 AM, Matthew Wilcox <willy@linux.intel.com> wrote:
> On Mon, Mar 16, 2015 at 11:05:33PM +0000, Al Viro wrote:
>> > diff --git a/mm/iov_iter.c b/mm/iov_iter.c
>> > index 827732047da1..be9a7c5b8703 100644
>> > --- a/mm/iov_iter.c
>> > +++ b/mm/iov_iter.c
>> > @@ -61,7 +61,7 @@
>> >     __p = i->bvec;                                  \
>> >     __v.bv_len = min_t(size_t, n, __p->bv_len - skip);      \
>> >     if (likely(__v.bv_len)) {                       \
>> > -           __v.bv_page = __p->bv_page;             \
>> > +           __v.bv_pfn = __p->bv_pfn;               \
>> >             __v.bv_offset = __p->bv_offset + skip;  \
>> >             (void)(STEP);                           \
>> >             skip += __v.bv_len;                     \
>> > @@ -72,7 +72,7 @@
>> >             __v.bv_len = min_t(size_t, n, __p->bv_len);     \
>> >             if (unlikely(!__v.bv_len))              \
>> >                     continue;                       \
>> > -           __v.bv_page = __p->bv_page;             \
>> > +           __v.bv_pfn = __p->bv_pfn;               \
>> >             __v.bv_offset = __p->bv_offset;         \
>> >             (void)(STEP);                           \
>> >             skip = __v.bv_len;                      \
>> > @@ -369,7 +369,7 @@ size_t copy_to_iter(void *addr, size_t bytes, struct iov_iter *i)
>> >     iterate_and_advance(i, bytes, v,
>> >             __copy_to_user(v.iov_base, (from += v.iov_len) - v.iov_len,
>> >                            v.iov_len),
>> > -           memcpy_to_page(v.bv_page, v.bv_offset,
>> > +           memcpy_to_page(bvec_page(&v), v.bv_offset,
>>
>> How had memcpy_to_page(NULL, ...) worked for you?
>
>  static inline struct page *bvec_page(const struct bio_vec *bvec)
>  {
> -       return bvec->bv_page;
> +       return pfn_to_page(bvec->bv_pfn.pfn);
>  }
>
> (yes, more work to be done here to make copy_to_iter work to a bvec that
> is actually targetting a page-less address, but these are RFC patches
> showing the direction we're heading in while keeping current code working)
>

Right, the next item to tackle is kmap() and kmap_atomic() before we
can start converting paths to be "native" pfn-only.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-16 20:25 [RFC PATCH 0/7] evacuate struct page from the block layer Dan Williams
  2015-03-16 20:25 ` Dan Williams
  2015-03-16 20:25 ` [RFC PATCH 2/7] block: convert bio_vec.bv_page to bv_pfn Dan Williams
@ 2015-03-18 10:47 ` Boaz Harrosh
  2015-03-18 10:47   ` Boaz Harrosh
                     ` (2 more replies)
  2015-03-18 20:26 ` Andrew Morton
  3 siblings, 3 replies; 59+ messages in thread
From: Boaz Harrosh @ 2015-03-18 10:47 UTC (permalink / raw)
  To: Dan Williams, linux-kernel, axboe, hch, Al Viro, Andrew Morton,
	Linus Torvalds
  Cc: linux-arch, riel, linux-nvdimm, Dave Hansen, linux-raid, mgorman,
	linux-fsdevel, Matthew Wilcox

On 03/16/2015 10:25 PM, Dan Williams wrote:
> Avoid the impending disaster of requiring struct page coverage for what
> is expected to be ever increasing capacities of persistent memory.  

If you are saying "disaster", than we need to believe you. Or is there
a scientific proof for this.

Actually what you are proposing below, is the "real disaster".
(I do hope it is not impending)

> In conversations with Rik van Riel, Mel Gorman, and Jens Axboe at the
> recently concluded Linux Storage Summit it became clear that struct page
> is not required in many places, it was simply convenient to re-use.
> 
> Introduce helpers and infrastructure to remove struct page usage where
> it is not necessary.  One use case for these changes is to implement a
> write-back-cache in persistent memory for software-RAID.  Another use
> case for the scatterlist changes is RDMA to a pfn-range.
> 
> This compiles and boots, but 0day-kbuild-robot coverage is needed before
> this set exits "RFC".  Obviously, the coccinelle script needs to be
> re-run on the block updates for kernel.next.  As is, this only includes
> the resulting auto-generated-patch against 4.0-rc3.
> 
> ---
> 
> Dan Williams (6):
>       block: add helpers for accessing a bio_vec page
>       block: convert bio_vec.bv_page to bv_pfn
>       dma-mapping: allow archs to optionally specify a ->map_pfn() operation
>       scatterlist: use sg_phys()
>       x86: support dma_map_pfn()
>       block: base support for pfn i/o
> 
> Matthew Wilcox (1):
>       scatterlist: support "page-less" (__pfn_t only) entries
> 
> 
>  arch/Kconfig                                 |    3 +
>  arch/arm/mm/dma-mapping.c                    |    2 -
>  arch/microblaze/kernel/dma.c                 |    2 -
>  arch/powerpc/sysdev/axonram.c                |    2 -
>  arch/x86/Kconfig                             |   12 +++
>  arch/x86/kernel/amd_gart_64.c                |   22 ++++--
>  arch/x86/kernel/pci-nommu.c                  |   22 ++++--
>  arch/x86/kernel/pci-swiotlb.c                |    4 +
>  arch/x86/pci/sta2x11-fixup.c                 |    4 +
>  arch/x86/xen/pci-swiotlb-xen.c               |    4 +
>  block/bio-integrity.c                        |    8 +-
>  block/bio.c                                  |   83 +++++++++++++++------
>  block/blk-core.c                             |    9 ++
>  block/blk-integrity.c                        |    7 +-
>  block/blk-lib.c                              |    2 -
>  block/blk-merge.c                            |   15 ++--
>  block/bounce.c                               |   26 +++----
>  drivers/block/aoe/aoecmd.c                   |    8 +-
>  drivers/block/brd.c                          |    2 -
>  drivers/block/drbd/drbd_bitmap.c             |    5 +
>  drivers/block/drbd/drbd_main.c               |    4 +
>  drivers/block/drbd/drbd_receiver.c           |    4 +
>  drivers/block/drbd/drbd_worker.c             |    3 +
>  drivers/block/floppy.c                       |    6 +-
>  drivers/block/loop.c                         |    8 +-
>  drivers/block/nbd.c                          |    8 +-
>  drivers/block/nvme-core.c                    |    2 -
>  drivers/block/pktcdvd.c                      |   11 ++-
>  drivers/block/ps3disk.c                      |    2 -
>  drivers/block/ps3vram.c                      |    2 -
>  drivers/block/rbd.c                          |    2 -
>  drivers/block/rsxx/dma.c                     |    3 +
>  drivers/block/umem.c                         |    2 -
>  drivers/block/zram/zram_drv.c                |   10 +--
>  drivers/dma/ste_dma40.c                      |    5 -
>  drivers/iommu/amd_iommu.c                    |   21 ++++-
>  drivers/iommu/intel-iommu.c                  |   26 +++++--
>  drivers/iommu/iommu.c                        |    2 -
>  drivers/md/bcache/btree.c                    |    4 +
>  drivers/md/bcache/debug.c                    |    6 +-
>  drivers/md/bcache/movinggc.c                 |    2 -
>  drivers/md/bcache/request.c                  |    6 +-
>  drivers/md/bcache/super.c                    |   10 +--
>  drivers/md/bcache/util.c                     |    5 +
>  drivers/md/bcache/writeback.c                |    2 -
>  drivers/md/dm-crypt.c                        |   12 ++-
>  drivers/md/dm-io.c                           |    2 -
>  drivers/md/dm-verity.c                       |    2 -
>  drivers/md/raid1.c                           |   50 +++++++------
>  drivers/md/raid10.c                          |   38 +++++-----
>  drivers/md/raid5.c                           |    6 +-
>  drivers/mmc/card/queue.c                     |    4 +
>  drivers/s390/block/dasd_diag.c               |    2 -
>  drivers/s390/block/dasd_eckd.c               |   14 ++--
>  drivers/s390/block/dasd_fba.c                |    6 +-
>  drivers/s390/block/dcssblk.c                 |    2 -
>  drivers/s390/block/scm_blk.c                 |    2 -
>  drivers/s390/block/scm_blk_cluster.c         |    2 -
>  drivers/s390/block/xpram.c                   |    2 -
>  drivers/scsi/mpt2sas/mpt2sas_transport.c     |    6 +-
>  drivers/scsi/mpt3sas/mpt3sas_transport.c     |    6 +-
>  drivers/scsi/sd_dif.c                        |    4 +
>  drivers/staging/android/ion/ion_chunk_heap.c |    4 +
>  drivers/staging/lustre/lustre/llite/lloop.c  |    2 -
>  drivers/xen/biomerge.c                       |    4 +
>  drivers/xen/swiotlb-xen.c                    |   29 +++++--
>  fs/btrfs/check-integrity.c                   |    6 +-
>  fs/btrfs/compression.c                       |   12 ++-
>  fs/btrfs/disk-io.c                           |    4 +
>  fs/btrfs/extent_io.c                         |    8 +-
>  fs/btrfs/file-item.c                         |    8 +-
>  fs/btrfs/inode.c                             |   18 +++--
>  fs/btrfs/raid56.c                            |    4 +
>  fs/btrfs/volumes.c                           |    2 -
>  fs/buffer.c                                  |    4 +
>  fs/direct-io.c                               |    2 -
>  fs/exofs/ore.c                               |    4 +
>  fs/exofs/ore_raid.c                          |    2 -
>  fs/ext4/page-io.c                            |    2 -
>  fs/f2fs/data.c                               |    4 +
>  fs/f2fs/segment.c                            |    2 -
>  fs/gfs2/lops.c                               |    4 +
>  fs/jfs/jfs_logmgr.c                          |    4 +
>  fs/logfs/dev_bdev.c                          |   10 +--
>  fs/mpage.c                                   |    2 -
>  fs/splice.c                                  |    2 -
>  include/asm-generic/dma-mapping-common.h     |   30 ++++++++
>  include/asm-generic/memory_model.h           |    4 +
>  include/asm-generic/scatterlist.h            |    6 ++
>  include/crypto/scatterwalk.h                 |   10 +++
>  include/linux/bio.h                          |   24 +++---
>  include/linux/blk_types.h                    |   21 +++++
>  include/linux/blkdev.h                       |    2 +
>  include/linux/dma-debug.h                    |   23 +++++-
>  include/linux/dma-mapping.h                  |    8 ++
>  include/linux/scatterlist.h                  |  101 ++++++++++++++++++++++++--
>  include/linux/swiotlb.h                      |    5 +
>  kernel/power/block_io.c                      |    2 -
>  lib/dma-debug.c                              |    4 +
>  lib/swiotlb.c                                |   20 ++++-
>  mm/iov_iter.c                                |   22 +++---
>  mm/page_io.c                                 |    8 +-
>  net/ceph/messenger.c                         |    2 -

God! Look at this endless list of files and it is only the very beginning.
It does not even work and touches only 10% of what will need to be touched
for this to work, and very very marginally at that. There will always be
"another subsystem" that will not work. For example NUMA how will you do
NUMA aware pmem? and this is just a simple example. (I'm saying NUMA
because our tests show a huge drop in performance if you do not do
NUMA aware allocation)

Al, Jens, Christoph Andrew. Think of the immediate stability nightmare and
the long term torture to maintain two code paths. Two set of tests, and
the combinatorial explosions of tests.

I'm not the one afraid of hard work, if it was for a good cause, but for what?
really for what? The block layer, and RDMA, and networking, and spline, and what
ever the heck any one wants to imagine to do with pmem, already works perfectly
stable. right now!

We have set up RDMA pmem target without a single line of extra code,
and the RDMA client was trivial to write. We are sending down block layer
BIOs from pmem from day one, and even iscsi NFS and any kind of networking
directly from pmem, for almost a year now.

All it takes is two simple patches to mm that creates a pages-section
for pmem. The Kernel DOCs do says that a page is a construct that keeps track
of the sate of a physical page in memory. A memory mapped pmem is perfectly
that, and it has state that needs tracking just the same, Say that converted
block layers of yours now happens to be an iscsi and goes through the network
stack, it starts to need ref-counting, flags ... It has state.

Matthew Dan. I don't get it. Don't you guys at Intel have nothing to do? why
change half the Kernel? for what? to achieve what? all your wildest dreams
about pmem are right here already. What is it that you guys want to do with
this code that we cannot already do? And I can show you two tons of things
you cannot do with this code that we can already do. With two simple patches.

If it is stability that you are concerned with, "what if a pmem-page gets
to the wrong mm subsystem?" There are a couple small hardening patches and
and extra page-flag allocated, that can make the all thing foolproof. Though
up until now I have not encountered any problem.

>  103 files changed, 658 insertions(+), 335 deletions(-)

Please look, this is only the beginning. And does not even work. Let us come
back to our senses. As true hackers lets do the minimum effort to achieve new
heights. All it really takes to do all this is 2 little patches.

Cheers
Boaz

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-18 10:47 ` [RFC PATCH 0/7] evacuate struct page from the block layer Boaz Harrosh
@ 2015-03-18 10:47   ` Boaz Harrosh
  2015-03-18 13:06   ` Matthew Wilcox
  2015-03-18 15:35   ` Dan Williams
  2 siblings, 0 replies; 59+ messages in thread
From: Boaz Harrosh @ 2015-03-18 10:47 UTC (permalink / raw)
  To: Dan Williams, linux-kernel, axboe, hch, Al Viro, Andrew Morton,
	Linus Torvalds
  Cc: linux-arch, riel, linux-nvdimm, Dave Hansen, linux-raid, mgorman,
	linux-fsdevel, Matthew Wilcox

On 03/16/2015 10:25 PM, Dan Williams wrote:
> Avoid the impending disaster of requiring struct page coverage for what
> is expected to be ever increasing capacities of persistent memory.  

If you are saying "disaster", than we need to believe you. Or is there
a scientific proof for this.

Actually what you are proposing below, is the "real disaster".
(I do hope it is not impending)

> In conversations with Rik van Riel, Mel Gorman, and Jens Axboe at the
> recently concluded Linux Storage Summit it became clear that struct page
> is not required in many places, it was simply convenient to re-use.
> 
> Introduce helpers and infrastructure to remove struct page usage where
> it is not necessary.  One use case for these changes is to implement a
> write-back-cache in persistent memory for software-RAID.  Another use
> case for the scatterlist changes is RDMA to a pfn-range.
> 
> This compiles and boots, but 0day-kbuild-robot coverage is needed before
> this set exits "RFC".  Obviously, the coccinelle script needs to be
> re-run on the block updates for kernel.next.  As is, this only includes
> the resulting auto-generated-patch against 4.0-rc3.
> 
> ---
> 
> Dan Williams (6):
>       block: add helpers for accessing a bio_vec page
>       block: convert bio_vec.bv_page to bv_pfn
>       dma-mapping: allow archs to optionally specify a ->map_pfn() operation
>       scatterlist: use sg_phys()
>       x86: support dma_map_pfn()
>       block: base support for pfn i/o
> 
> Matthew Wilcox (1):
>       scatterlist: support "page-less" (__pfn_t only) entries
> 
> 
>  arch/Kconfig                                 |    3 +
>  arch/arm/mm/dma-mapping.c                    |    2 -
>  arch/microblaze/kernel/dma.c                 |    2 -
>  arch/powerpc/sysdev/axonram.c                |    2 -
>  arch/x86/Kconfig                             |   12 +++
>  arch/x86/kernel/amd_gart_64.c                |   22 ++++--
>  arch/x86/kernel/pci-nommu.c                  |   22 ++++--
>  arch/x86/kernel/pci-swiotlb.c                |    4 +
>  arch/x86/pci/sta2x11-fixup.c                 |    4 +
>  arch/x86/xen/pci-swiotlb-xen.c               |    4 +
>  block/bio-integrity.c                        |    8 +-
>  block/bio.c                                  |   83 +++++++++++++++------
>  block/blk-core.c                             |    9 ++
>  block/blk-integrity.c                        |    7 +-
>  block/blk-lib.c                              |    2 -
>  block/blk-merge.c                            |   15 ++--
>  block/bounce.c                               |   26 +++----
>  drivers/block/aoe/aoecmd.c                   |    8 +-
>  drivers/block/brd.c                          |    2 -
>  drivers/block/drbd/drbd_bitmap.c             |    5 +
>  drivers/block/drbd/drbd_main.c               |    4 +
>  drivers/block/drbd/drbd_receiver.c           |    4 +
>  drivers/block/drbd/drbd_worker.c             |    3 +
>  drivers/block/floppy.c                       |    6 +-
>  drivers/block/loop.c                         |    8 +-
>  drivers/block/nbd.c                          |    8 +-
>  drivers/block/nvme-core.c                    |    2 -
>  drivers/block/pktcdvd.c                      |   11 ++-
>  drivers/block/ps3disk.c                      |    2 -
>  drivers/block/ps3vram.c                      |    2 -
>  drivers/block/rbd.c                          |    2 -
>  drivers/block/rsxx/dma.c                     |    3 +
>  drivers/block/umem.c                         |    2 -
>  drivers/block/zram/zram_drv.c                |   10 +--
>  drivers/dma/ste_dma40.c                      |    5 -
>  drivers/iommu/amd_iommu.c                    |   21 ++++-
>  drivers/iommu/intel-iommu.c                  |   26 +++++--
>  drivers/iommu/iommu.c                        |    2 -
>  drivers/md/bcache/btree.c                    |    4 +
>  drivers/md/bcache/debug.c                    |    6 +-
>  drivers/md/bcache/movinggc.c                 |    2 -
>  drivers/md/bcache/request.c                  |    6 +-
>  drivers/md/bcache/super.c                    |   10 +--
>  drivers/md/bcache/util.c                     |    5 +
>  drivers/md/bcache/writeback.c                |    2 -
>  drivers/md/dm-crypt.c                        |   12 ++-
>  drivers/md/dm-io.c                           |    2 -
>  drivers/md/dm-verity.c                       |    2 -
>  drivers/md/raid1.c                           |   50 +++++++------
>  drivers/md/raid10.c                          |   38 +++++-----
>  drivers/md/raid5.c                           |    6 +-
>  drivers/mmc/card/queue.c                     |    4 +
>  drivers/s390/block/dasd_diag.c               |    2 -
>  drivers/s390/block/dasd_eckd.c               |   14 ++--
>  drivers/s390/block/dasd_fba.c                |    6 +-
>  drivers/s390/block/dcssblk.c                 |    2 -
>  drivers/s390/block/scm_blk.c                 |    2 -
>  drivers/s390/block/scm_blk_cluster.c         |    2 -
>  drivers/s390/block/xpram.c                   |    2 -
>  drivers/scsi/mpt2sas/mpt2sas_transport.c     |    6 +-
>  drivers/scsi/mpt3sas/mpt3sas_transport.c     |    6 +-
>  drivers/scsi/sd_dif.c                        |    4 +
>  drivers/staging/android/ion/ion_chunk_heap.c |    4 +
>  drivers/staging/lustre/lustre/llite/lloop.c  |    2 -
>  drivers/xen/biomerge.c                       |    4 +
>  drivers/xen/swiotlb-xen.c                    |   29 +++++--
>  fs/btrfs/check-integrity.c                   |    6 +-
>  fs/btrfs/compression.c                       |   12 ++-
>  fs/btrfs/disk-io.c                           |    4 +
>  fs/btrfs/extent_io.c                         |    8 +-
>  fs/btrfs/file-item.c                         |    8 +-
>  fs/btrfs/inode.c                             |   18 +++--
>  fs/btrfs/raid56.c                            |    4 +
>  fs/btrfs/volumes.c                           |    2 -
>  fs/buffer.c                                  |    4 +
>  fs/direct-io.c                               |    2 -
>  fs/exofs/ore.c                               |    4 +
>  fs/exofs/ore_raid.c                          |    2 -
>  fs/ext4/page-io.c                            |    2 -
>  fs/f2fs/data.c                               |    4 +
>  fs/f2fs/segment.c                            |    2 -
>  fs/gfs2/lops.c                               |    4 +
>  fs/jfs/jfs_logmgr.c                          |    4 +
>  fs/logfs/dev_bdev.c                          |   10 +--
>  fs/mpage.c                                   |    2 -
>  fs/splice.c                                  |    2 -
>  include/asm-generic/dma-mapping-common.h     |   30 ++++++++
>  include/asm-generic/memory_model.h           |    4 +
>  include/asm-generic/scatterlist.h            |    6 ++
>  include/crypto/scatterwalk.h                 |   10 +++
>  include/linux/bio.h                          |   24 +++---
>  include/linux/blk_types.h                    |   21 +++++
>  include/linux/blkdev.h                       |    2 +
>  include/linux/dma-debug.h                    |   23 +++++-
>  include/linux/dma-mapping.h                  |    8 ++
>  include/linux/scatterlist.h                  |  101 ++++++++++++++++++++++++--
>  include/linux/swiotlb.h                      |    5 +
>  kernel/power/block_io.c                      |    2 -
>  lib/dma-debug.c                              |    4 +
>  lib/swiotlb.c                                |   20 ++++-
>  mm/iov_iter.c                                |   22 +++---
>  mm/page_io.c                                 |    8 +-
>  net/ceph/messenger.c                         |    2 -

God! Look at this endless list of files and it is only the very beginning.
It does not even work and touches only 10% of what will need to be touched
for this to work, and very very marginally at that. There will always be
"another subsystem" that will not work. For example NUMA how will you do
NUMA aware pmem? and this is just a simple example. (I'm saying NUMA
because our tests show a huge drop in performance if you do not do
NUMA aware allocation)

Al, Jens, Christoph Andrew. Think of the immediate stability nightmare and
the long term torture to maintain two code paths. Two set of tests, and
the combinatorial explosions of tests.

I'm not the one afraid of hard work, if it was for a good cause, but for what?
really for what? The block layer, and RDMA, and networking, and spline, and what
ever the heck any one wants to imagine to do with pmem, already works perfectly
stable. right now!

We have set up RDMA pmem target without a single line of extra code,
and the RDMA client was trivial to write. We are sending down block layer
BIOs from pmem from day one, and even iscsi NFS and any kind of networking
directly from pmem, for almost a year now.

All it takes is two simple patches to mm that creates a pages-section
for pmem. The Kernel DOCs do says that a page is a construct that keeps track
of the sate of a physical page in memory. A memory mapped pmem is perfectly
that, and it has state that needs tracking just the same, Say that converted
block layers of yours now happens to be an iscsi and goes through the network
stack, it starts to need ref-counting, flags ... It has state.

Matthew Dan. I don't get it. Don't you guys at Intel have nothing to do? why
change half the Kernel? for what? to achieve what? all your wildest dreams
about pmem are right here already. What is it that you guys want to do with
this code that we cannot already do? And I can show you two tons of things
you cannot do with this code that we can already do. With two simple patches.

If it is stability that you are concerned with, "what if a pmem-page gets
to the wrong mm subsystem?" There are a couple small hardening patches and
and extra page-flag allocated, that can make the all thing foolproof. Though
up until now I have not encountered any problem.

>  103 files changed, 658 insertions(+), 335 deletions(-)

Please look, this is only the beginning. And does not even work. Let us come
back to our senses. As true hackers lets do the minimum effort to achieve new
heights. All it really takes to do all this is 2 little patches.

Cheers
Boaz


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-18 10:47 ` [RFC PATCH 0/7] evacuate struct page from the block layer Boaz Harrosh
  2015-03-18 10:47   ` Boaz Harrosh
@ 2015-03-18 13:06   ` Matthew Wilcox
  2015-03-18 13:06     ` Matthew Wilcox
  2015-03-18 14:38     ` [Linux-nvdimm] " Boaz Harrosh
  2015-03-18 15:35   ` Dan Williams
  2 siblings, 2 replies; 59+ messages in thread
From: Matthew Wilcox @ 2015-03-18 13:06 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Dan Williams, linux-kernel, axboe, hch, Al Viro, Andrew Morton,
	Linus Torvalds, linux-arch, riel, linux-nvdimm, Dave Hansen,
	linux-raid, mgorman, linux-fsdevel

On Wed, Mar 18, 2015 at 12:47:21PM +0200, Boaz Harrosh wrote:
> God! Look at this endless list of files and it is only the very beginning.
> It does not even work and touches only 10% of what will need to be touched
> for this to work, and very very marginally at that. There will always be
> "another subsystem" that will not work. For example NUMA how will you do
> NUMA aware pmem? and this is just a simple example. (I'm saying NUMA
> because our tests show a huge drop in performance if you do not do
> NUMA aware allocation)

You're very entertaining, but please, tone down your emails and stick
to facts.  The BIOS presents the persistent memory as one table entry
per NUMA node, so you get one block device per NUMA node.  There's no
mixing of memory from different NUMA nodes within a single filesystem,
unless you have a filesystem that uses multiple block devices.

> I'm not the one afraid of hard work, if it was for a good cause, but for what?
> really for what? The block layer, and RDMA, and networking, and spline, and what
> ever the heck any one wants to imagine to do with pmem, already works perfectly
> stable. right now!

The overhead.  Allocating a struct page for every 4k page in a 400GB DIMM
(the current capacity available from one NV-DIMM vendor) occupies 6.4GB.
That's an unacceptable amount of overhead.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-18 13:06   ` Matthew Wilcox
@ 2015-03-18 13:06     ` Matthew Wilcox
  2015-03-18 14:38     ` [Linux-nvdimm] " Boaz Harrosh
  1 sibling, 0 replies; 59+ messages in thread
From: Matthew Wilcox @ 2015-03-18 13:06 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Dan Williams, linux-kernel, axboe, hch, Al Viro, Andrew Morton,
	Linus Torvalds, linux-arch, riel, linux-nvdimm, Dave Hansen,
	linux-raid, mgorman, linux-fsdevel

On Wed, Mar 18, 2015 at 12:47:21PM +0200, Boaz Harrosh wrote:
> God! Look at this endless list of files and it is only the very beginning.
> It does not even work and touches only 10% of what will need to be touched
> for this to work, and very very marginally at that. There will always be
> "another subsystem" that will not work. For example NUMA how will you do
> NUMA aware pmem? and this is just a simple example. (I'm saying NUMA
> because our tests show a huge drop in performance if you do not do
> NUMA aware allocation)

You're very entertaining, but please, tone down your emails and stick
to facts.  The BIOS presents the persistent memory as one table entry
per NUMA node, so you get one block device per NUMA node.  There's no
mixing of memory from different NUMA nodes within a single filesystem,
unless you have a filesystem that uses multiple block devices.

> I'm not the one afraid of hard work, if it was for a good cause, but for what?
> really for what? The block layer, and RDMA, and networking, and spline, and what
> ever the heck any one wants to imagine to do with pmem, already works perfectly
> stable. right now!

The overhead.  Allocating a struct page for every 4k page in a 400GB DIMM
(the current capacity available from one NV-DIMM vendor) occupies 6.4GB.
That's an unacceptable amount of overhead.


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Linux-nvdimm] [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-18 13:06   ` Matthew Wilcox
  2015-03-18 13:06     ` Matthew Wilcox
@ 2015-03-18 14:38     ` Boaz Harrosh
  2015-03-18 14:38       ` Boaz Harrosh
  2015-03-20 15:56       ` Rik van Riel
  1 sibling, 2 replies; 59+ messages in thread
From: Boaz Harrosh @ 2015-03-18 14:38 UTC (permalink / raw)
  To: Matthew Wilcox, Boaz Harrosh
  Cc: axboe, linux-arch, riel, linux-raid, linux-nvdimm, Dave Hansen,
	linux-kernel, hch, Linus Torvalds, Al Viro, linux-fsdevel,
	Andrew Morton, mgorman

On 03/18/2015 03:06 PM, Matthew Wilcox wrote:
> On Wed, Mar 18, 2015 at 12:47:21PM +0200, Boaz Harrosh wrote:
>> God! Look at this endless list of files and it is only the very beginning.
>> It does not even work and touches only 10% of what will need to be touched
>> for this to work, and very very marginally at that. There will always be
>> "another subsystem" that will not work. For example NUMA how will you do
>> NUMA aware pmem? and this is just a simple example. (I'm saying NUMA
>> because our tests show a huge drop in performance if you do not do
>> NUMA aware allocation)
> 
> You're very entertaining, but please, tone down your emails and stick
> to facts.  The BIOS presents the persistent memory as one table entry
> per NUMA node, so you get one block device per NUMA node.  There's no
> mixing of memory from different NUMA nodes within a single filesystem,
> unless you have a filesystem that uses multiple block devices.
> 

Not current BIOS, if we have them contiguous then they are presented as
one range. (DDR3 BIOS). But I agree it is a bug and in our configuration
we separate them to different pmem devices.

Yes I meant a "filesystem that uses multiple block devices"

>> I'm not the one afraid of hard work, if it was for a good cause, but for what?
>> really for what? The block layer, and RDMA, and networking, and spline, and what
>> ever the heck any one wants to imagine to do with pmem, already works perfectly
>> stable. right now!
> 
> The overhead.  Allocating a struct page for every 4k page in a 400GB DIMM
> (the current capacity available from one NV-DIMM vendor) occupies 6.4GB.
> That's an unacceptable amount of overhead.
> 

So lets fix the stacks to work nice with 2M pages. That said we can
allocate the struct page also from pmem if we need to. The fact remains
that we need state down the different stacks and this is the current
design over all.

I hate it that you introduce a double design a pfn-or-page and the
combinations of them. It is ugliness to much for my guts. I would
like a unified design. that runs all over the stack. Already we have
too much duplication to my taste, and would love to see more
unification and not more splitting.

But the most important for me is do we have to sacrifice the short
term to the long term. Such a massive change as you are proposing
it will take years. for a theoretical 400GB DIMM. What about the
4G DIMM now in peoples hands, need they wait?
(Though I still do not agree with your design)

I love the SPARSE model of the "section" and the page being it's
own identity relative to virtual & PFN of the section. We could
think of a much smaller page-struct that only takes a ref-count
and flags and have bigger page type for regular use, separate the
low common part of the page, lay down clear rules about its use,
and an high part that's per user. But let us think of a unified
design through out. (most members of page are accessed through
wrappers it would be relatively easy to split)

And let us not sacrifice the now for the "far tomorrow", we should
be able to do this incrementally, wasting more space now and saving
later.

[We can even invent a sizeless page you know how we encode
 the section ID directly into the 64 bit address of the page,
 So we can have a flag at the section that says this is a
 zero-size page section and the needed info is stored at
 the section object. But I still think you will need state
 per page and that we do need a minimal size.
]

[BTW: The only 400GB DIMM I know of is a real flash, and not directly
 mapped to CPU, OK maybe read only, but the erase/write makes it
 logical-to-physical managed and not directly accessed
]

And a personal note. I mean only to entertain. If any one feels
I "toned-up", please forgive me. I meant no such thing. As a rule
if I come across strong then please just laugh and don't take me
seriously. I only mean scientific soundness.

Thanks
Boaz

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Linux-nvdimm] [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-18 14:38     ` [Linux-nvdimm] " Boaz Harrosh
@ 2015-03-18 14:38       ` Boaz Harrosh
  2015-03-20 15:56       ` Rik van Riel
  1 sibling, 0 replies; 59+ messages in thread
From: Boaz Harrosh @ 2015-03-18 14:38 UTC (permalink / raw)
  To: Matthew Wilcox, Boaz Harrosh
  Cc: axboe, linux-arch, riel, linux-raid, linux-nvdimm, Dave Hansen,
	linux-kernel, hch, Linus Torvalds, Al Viro, linux-fsdevel,
	Andrew Morton, mgorman

On 03/18/2015 03:06 PM, Matthew Wilcox wrote:
> On Wed, Mar 18, 2015 at 12:47:21PM +0200, Boaz Harrosh wrote:
>> God! Look at this endless list of files and it is only the very beginning.
>> It does not even work and touches only 10% of what will need to be touched
>> for this to work, and very very marginally at that. There will always be
>> "another subsystem" that will not work. For example NUMA how will you do
>> NUMA aware pmem? and this is just a simple example. (I'm saying NUMA
>> because our tests show a huge drop in performance if you do not do
>> NUMA aware allocation)
> 
> You're very entertaining, but please, tone down your emails and stick
> to facts.  The BIOS presents the persistent memory as one table entry
> per NUMA node, so you get one block device per NUMA node.  There's no
> mixing of memory from different NUMA nodes within a single filesystem,
> unless you have a filesystem that uses multiple block devices.
> 

Not current BIOS, if we have them contiguous then they are presented as
one range. (DDR3 BIOS). But I agree it is a bug and in our configuration
we separate them to different pmem devices.

Yes I meant a "filesystem that uses multiple block devices"

>> I'm not the one afraid of hard work, if it was for a good cause, but for what?
>> really for what? The block layer, and RDMA, and networking, and spline, and what
>> ever the heck any one wants to imagine to do with pmem, already works perfectly
>> stable. right now!
> 
> The overhead.  Allocating a struct page for every 4k page in a 400GB DIMM
> (the current capacity available from one NV-DIMM vendor) occupies 6.4GB.
> That's an unacceptable amount of overhead.
> 

So lets fix the stacks to work nice with 2M pages. That said we can
allocate the struct page also from pmem if we need to. The fact remains
that we need state down the different stacks and this is the current
design over all.

I hate it that you introduce a double design a pfn-or-page and the
combinations of them. It is ugliness to much for my guts. I would
like a unified design. that runs all over the stack. Already we have
too much duplication to my taste, and would love to see more
unification and not more splitting.

But the most important for me is do we have to sacrifice the short
term to the long term. Such a massive change as you are proposing
it will take years. for a theoretical 400GB DIMM. What about the
4G DIMM now in peoples hands, need they wait?
(Though I still do not agree with your design)

I love the SPARSE model of the "section" and the page being it's
own identity relative to virtual & PFN of the section. We could
think of a much smaller page-struct that only takes a ref-count
and flags and have bigger page type for regular use, separate the
low common part of the page, lay down clear rules about its use,
and an high part that's per user. But let us think of a unified
design through out. (most members of page are accessed through
wrappers it would be relatively easy to split)

And let us not sacrifice the now for the "far tomorrow", we should
be able to do this incrementally, wasting more space now and saving
later.

[We can even invent a sizeless page you know how we encode
 the section ID directly into the 64 bit address of the page,
 So we can have a flag at the section that says this is a
 zero-size page section and the needed info is stored at
 the section object. But I still think you will need state
 per page and that we do need a minimal size.
]

[BTW: The only 400GB DIMM I know of is a real flash, and not directly
 mapped to CPU, OK maybe read only, but the erase/write makes it
 logical-to-physical managed and not directly accessed
]

And a personal note. I mean only to entertain. If any one feels
I "toned-up", please forgive me. I meant no such thing. As a rule
if I come across strong then please just laugh and don't take me
seriously. I only mean scientific soundness.

Thanks
Boaz

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-18 10:47 ` [RFC PATCH 0/7] evacuate struct page from the block layer Boaz Harrosh
  2015-03-18 10:47   ` Boaz Harrosh
  2015-03-18 13:06   ` Matthew Wilcox
@ 2015-03-18 15:35   ` Dan Williams
  2 siblings, 0 replies; 59+ messages in thread
From: Dan Williams @ 2015-03-18 15:35 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: linux-kernel@vger.kernel.org, Jens Axboe, Christoph Hellwig,
	Al Viro, Andrew Morton, Linus Torvalds, linux-arch, riel,
	linux-nvdimm@lists.01.org, Dave Hansen, linux-raid, mgorman,
	linux-fsdevel, Matthew Wilcox

On Wed, Mar 18, 2015 at 3:47 AM, Boaz Harrosh <openosd@gmail.com> wrote:
> On 03/16/2015 10:25 PM, Dan Williams wrote:
>> Avoid the impending disaster of requiring struct page coverage for what
>> is expected to be ever increasing capacities of persistent memory.
>
> If you are saying "disaster", than we need to believe you. Or is there
> a scientific proof for this.

The same Moore's Law based extrapolation that Dave Chinner did to
determine that major feature development on XFS may cease in 5 - 7
years.  In Dave's words we're looking ahead to "lots and fast".  Given
the time scale of getting kernel changes out to end users in an
enterprise kernel update the "dynamic page struct allocation" approach
is already insufficient.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-16 20:25 [RFC PATCH 0/7] evacuate struct page from the block layer Dan Williams
                   ` (2 preceding siblings ...)
  2015-03-18 10:47 ` [RFC PATCH 0/7] evacuate struct page from the block layer Boaz Harrosh
@ 2015-03-18 20:26 ` Andrew Morton
  2015-03-19 13:43   ` Matthew Wilcox
  3 siblings, 1 reply; 59+ messages in thread
From: Andrew Morton @ 2015-03-18 20:26 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-kernel, linux-arch, axboe, riel, linux-nvdimm, Dave Hansen,
	linux-raid, mgorman, hch, linux-fsdevel, Matthew Wilcox

On Mon, 16 Mar 2015 16:25:25 -0400 Dan Williams <dan.j.williams@intel.com> wrote:

> Avoid the impending disaster of requiring struct page coverage for what
> is expected to be ever increasing capacities of persistent memory.  In
> conversations with Rik van Riel, Mel Gorman, and Jens Axboe at the
> recently concluded Linux Storage Summit it became clear that struct page
> is not required in many places, it was simply convenient to re-use.
> 
> Introduce helpers and infrastructure to remove struct page usage where
> it is not necessary.  One use case for these changes is to implement a
> write-back-cache in persistent memory for software-RAID.  Another use
> case for the scatterlist changes is RDMA to a pfn-range.

Those use-cases sound very thin.  If that's all we have then I'd say
"find another way of implementing those things without creating
pageframes for persistent memory".

IOW, please tell us much much much more about the value of this change.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-18 20:26 ` Andrew Morton
@ 2015-03-19 13:43   ` Matthew Wilcox
  2015-03-19 13:43     ` Matthew Wilcox
                       ` (3 more replies)
  0 siblings, 4 replies; 59+ messages in thread
From: Matthew Wilcox @ 2015-03-19 13:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dan Williams, linux-kernel, linux-arch, axboe, riel, linux-nvdimm,
	Dave Hansen, linux-raid, mgorman, hch, linux-fsdevel

On Wed, Mar 18, 2015 at 01:26:50PM -0700, Andrew Morton wrote:
> On Mon, 16 Mar 2015 16:25:25 -0400 Dan Williams <dan.j.williams@intel.com> wrote:
> 
> > Avoid the impending disaster of requiring struct page coverage for what
> > is expected to be ever increasing capacities of persistent memory.  In
> > conversations with Rik van Riel, Mel Gorman, and Jens Axboe at the
> > recently concluded Linux Storage Summit it became clear that struct page
> > is not required in many places, it was simply convenient to re-use.
> > 
> > Introduce helpers and infrastructure to remove struct page usage where
> > it is not necessary.  One use case for these changes is to implement a
> > write-back-cache in persistent memory for software-RAID.  Another use
> > case for the scatterlist changes is RDMA to a pfn-range.
> 
> Those use-cases sound very thin.  If that's all we have then I'd say
> "find another way of implementing those things without creating
> pageframes for persistent memory".
> 
> IOW, please tell us much much much more about the value of this change.

Dan missed "Support O_DIRECT to a mapped DAX file".  More generally, if we
want to be able to do any kind of I/O directly to persistent memory,
and I think we do, we need to do one of:

1. Construct struct pages for persistent memory
1a. Permanently
1b. While the pages are under I/O
2. Teach the I/O layers to deal in PFNs instead of struct pages
3. Replace struct page with some other structure that can represent both
   DRAM and PMEM

I'm personally a fan of #3, and I was looking at the scatterlist as
my preferred data structure.  I now believe the scatterlist as it is
currently defined isn't sufficient, so we probably end up needing a new
data structure.  I think Dan's preferred method of replacing struct
pages with PFNs is actually less instrusive, but doesn't give us as
much advantage (an entirely new data structure would let us move to an
extent based system at the same time, instead of sticking with an array
of pages).  Clearly Boaz prefers 1a, which works well enough for the
8GB NV-DIMMs, but not well enough for the 400GB NV-DIMMs.

What's your preference?  I guess option 0 is "force all I/O to go
through the page cache and then get copied", but that feels like a nasty
performance hit.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-19 13:43   ` Matthew Wilcox
@ 2015-03-19 13:43     ` Matthew Wilcox
  2015-03-19 15:54     ` [Linux-nvdimm] " Boaz Harrosh
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 59+ messages in thread
From: Matthew Wilcox @ 2015-03-19 13:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dan Williams, linux-kernel, linux-arch, axboe, riel, linux-nvdimm,
	Dave Hansen, linux-raid, mgorman, hch, linux-fsdevel

On Wed, Mar 18, 2015 at 01:26:50PM -0700, Andrew Morton wrote:
> On Mon, 16 Mar 2015 16:25:25 -0400 Dan Williams <dan.j.williams@intel.com> wrote:
> 
> > Avoid the impending disaster of requiring struct page coverage for what
> > is expected to be ever increasing capacities of persistent memory.  In
> > conversations with Rik van Riel, Mel Gorman, and Jens Axboe at the
> > recently concluded Linux Storage Summit it became clear that struct page
> > is not required in many places, it was simply convenient to re-use.
> > 
> > Introduce helpers and infrastructure to remove struct page usage where
> > it is not necessary.  One use case for these changes is to implement a
> > write-back-cache in persistent memory for software-RAID.  Another use
> > case for the scatterlist changes is RDMA to a pfn-range.
> 
> Those use-cases sound very thin.  If that's all we have then I'd say
> "find another way of implementing those things without creating
> pageframes for persistent memory".
> 
> IOW, please tell us much much much more about the value of this change.

Dan missed "Support O_DIRECT to a mapped DAX file".  More generally, if we
want to be able to do any kind of I/O directly to persistent memory,
and I think we do, we need to do one of:

1. Construct struct pages for persistent memory
1a. Permanently
1b. While the pages are under I/O
2. Teach the I/O layers to deal in PFNs instead of struct pages
3. Replace struct page with some other structure that can represent both
   DRAM and PMEM

I'm personally a fan of #3, and I was looking at the scatterlist as
my preferred data structure.  I now believe the scatterlist as it is
currently defined isn't sufficient, so we probably end up needing a new
data structure.  I think Dan's preferred method of replacing struct
pages with PFNs is actually less instrusive, but doesn't give us as
much advantage (an entirely new data structure would let us move to an
extent based system at the same time, instead of sticking with an array
of pages).  Clearly Boaz prefers 1a, which works well enough for the
8GB NV-DIMMs, but not well enough for the 400GB NV-DIMMs.

What's your preference?  I guess option 0 is "force all I/O to go
through the page cache and then get copied", but that feels like a nasty
performance hit.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Linux-nvdimm] [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-19 13:43   ` Matthew Wilcox
  2015-03-19 13:43     ` Matthew Wilcox
@ 2015-03-19 15:54     ` Boaz Harrosh
  2015-03-19 15:54       ` Boaz Harrosh
  2015-03-19 19:59       ` Andrew Morton
  2015-03-19 18:17     ` Christoph Hellwig
  2015-03-20 16:21     ` Rik van Riel
  3 siblings, 2 replies; 59+ messages in thread
From: Boaz Harrosh @ 2015-03-19 15:54 UTC (permalink / raw)
  To: Matthew Wilcox, Andrew Morton
  Cc: linux-arch, axboe, riel, hch, linux-nvdimm, Dave Hansen,
	linux-kernel, linux-raid, mgorman, linux-fsdevel

On 03/19/2015 03:43 PM, Matthew Wilcox wrote:
<>
> 
> Dan missed "Support O_DIRECT to a mapped DAX file".  More generally, if we
> want to be able to do any kind of I/O directly to persistent memory,
> and I think we do, we need to do one of:
> 
> 1. Construct struct pages for persistent memory
> 1a. Permanently
> 1b. While the pages are under I/O
> 2. Teach the I/O layers to deal in PFNs instead of struct pages
> 3. Replace struct page with some other structure that can represent both
>    DRAM and PMEM
> 
> I'm personally a fan of #3, and I was looking at the scatterlist as
> my preferred data structure.  I now believe the scatterlist as it is
> currently defined isn't sufficient, so we probably end up needing a new
> data structure.  I think Dan's preferred method of replacing struct
> pages with PFNs is actually less instrusive, but doesn't give us as
> much advantage (an entirely new data structure would let us move to an
> extent based system at the same time, instead of sticking with an array
> of pages).  Clearly Boaz prefers 1a, which works well enough for the
> 8GB NV-DIMMs, but not well enough for the 400GB NV-DIMMs.
> 
> What's your preference?  I guess option 0 is "force all I/O to go
> through the page cache and then get copied", but that feels like a nasty
> performance hit.

Thanks Matthew, you have summarized it perfectly.

I think #1b might have merit, as well. I have a very surgical small
"hack" that we can do with allocating on demand pages before IO.
It involves adding a new MEMORY_MODEL policy that is derived from
SPARSEMEM but lets you allocate individual pages on demand. And a new
type of page say call it GP_emulated_page.
(Tell me if you find this interesting. It is 1/117 in size of both
 #2 or #3)

In anyway please reconsider a configurable #1a for people that do
not mind sacrificing 1.2% of their pmem for real pages.

Even at 6G page-structs with 400G pmem, people would love some of the stuff
this gives them today. just few examples: direct_access from within a VM to
an host defined pmem, is trivial with no extra code with my two simple #1a
patches. RDMA memory brick targets, network shared memory FS and so on, the
list will always be bigger then any of #1b #2 or #3. Yes for people that
want to sacrifice the extra cost.

In the Kernel it was always about choice and diversity. And what does it
costs us. Nothing. Two small simple patches and a Kconfig option.
Note that I made it in such a way that if pmem is configured without
use of pages, then the mm code is *not* configured-in automatically.
We can even add a runtime option that even if #1a is enabled, for certain
pmem device may not want pages allocated. And so choose at runtime rather
than compile time.

I think this will only farther our cause and let people advance with
their research and development with great new ideas about use of pmem.
Then once there is a great demand for #1a and those large 512G devices
come out, we can go the #1b or #3 route and save them the extra 1.2%
memory, but once they have the appetite for it. (And Andrews question
becomes clear)

Our two ways need not be "either-or". They can be "have both". I think
choice is a good thing for us here. Even with #3 available #1a still has
merit in some configurations and they can co exist perfectly.

Please think about it?

Thanks
Boaz

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Linux-nvdimm] [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-19 15:54     ` [Linux-nvdimm] " Boaz Harrosh
@ 2015-03-19 15:54       ` Boaz Harrosh
  2015-03-19 19:59       ` Andrew Morton
  1 sibling, 0 replies; 59+ messages in thread
From: Boaz Harrosh @ 2015-03-19 15:54 UTC (permalink / raw)
  To: Matthew Wilcox, Andrew Morton
  Cc: linux-arch, axboe, riel, hch, linux-nvdimm, Dave Hansen,
	linux-kernel, linux-raid, mgorman, linux-fsdevel

On 03/19/2015 03:43 PM, Matthew Wilcox wrote:
<>
> 
> Dan missed "Support O_DIRECT to a mapped DAX file".  More generally, if we
> want to be able to do any kind of I/O directly to persistent memory,
> and I think we do, we need to do one of:
> 
> 1. Construct struct pages for persistent memory
> 1a. Permanently
> 1b. While the pages are under I/O
> 2. Teach the I/O layers to deal in PFNs instead of struct pages
> 3. Replace struct page with some other structure that can represent both
>    DRAM and PMEM
> 
> I'm personally a fan of #3, and I was looking at the scatterlist as
> my preferred data structure.  I now believe the scatterlist as it is
> currently defined isn't sufficient, so we probably end up needing a new
> data structure.  I think Dan's preferred method of replacing struct
> pages with PFNs is actually less instrusive, but doesn't give us as
> much advantage (an entirely new data structure would let us move to an
> extent based system at the same time, instead of sticking with an array
> of pages).  Clearly Boaz prefers 1a, which works well enough for the
> 8GB NV-DIMMs, but not well enough for the 400GB NV-DIMMs.
> 
> What's your preference?  I guess option 0 is "force all I/O to go
> through the page cache and then get copied", but that feels like a nasty
> performance hit.

Thanks Matthew, you have summarized it perfectly.

I think #1b might have merit, as well. I have a very surgical small
"hack" that we can do with allocating on demand pages before IO.
It involves adding a new MEMORY_MODEL policy that is derived from
SPARSEMEM but lets you allocate individual pages on demand. And a new
type of page say call it GP_emulated_page.
(Tell me if you find this interesting. It is 1/117 in size of both
 #2 or #3)

In anyway please reconsider a configurable #1a for people that do
not mind sacrificing 1.2% of their pmem for real pages.

Even at 6G page-structs with 400G pmem, people would love some of the stuff
this gives them today. just few examples: direct_access from within a VM to
an host defined pmem, is trivial with no extra code with my two simple #1a
patches. RDMA memory brick targets, network shared memory FS and so on, the
list will always be bigger then any of #1b #2 or #3. Yes for people that
want to sacrifice the extra cost.

In the Kernel it was always about choice and diversity. And what does it
costs us. Nothing. Two small simple patches and a Kconfig option.
Note that I made it in such a way that if pmem is configured without
use of pages, then the mm code is *not* configured-in automatically.
We can even add a runtime option that even if #1a is enabled, for certain
pmem device may not want pages allocated. And so choose at runtime rather
than compile time.

I think this will only farther our cause and let people advance with
their research and development with great new ideas about use of pmem.
Then once there is a great demand for #1a and those large 512G devices
come out, we can go the #1b or #3 route and save them the extra 1.2%
memory, but once they have the appetite for it. (And Andrews question
becomes clear)

Our two ways need not be "either-or". They can be "have both". I think
choice is a good thing for us here. Even with #3 available #1a still has
merit in some configurations and they can co exist perfectly.

Please think about it?

Thanks
Boaz

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-19 13:43   ` Matthew Wilcox
  2015-03-19 13:43     ` Matthew Wilcox
  2015-03-19 15:54     ` [Linux-nvdimm] " Boaz Harrosh
@ 2015-03-19 18:17     ` Christoph Hellwig
  2015-03-19 19:31       ` Matthew Wilcox
  2015-03-22 16:46       ` Boaz Harrosh
  2015-03-20 16:21     ` Rik van Riel
  3 siblings, 2 replies; 59+ messages in thread
From: Christoph Hellwig @ 2015-03-19 18:17 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, Dan Williams, linux-kernel, linux-arch, axboe,
	riel, linux-nvdimm, Dave Hansen, linux-raid, mgorman, hch,
	linux-fsdevel

On Thu, Mar 19, 2015 at 09:43:13AM -0400, Matthew Wilcox wrote:
> Dan missed "Support O_DIRECT to a mapped DAX file".  More generally, if we
> want to be able to do any kind of I/O directly to persistent memory,
> and I think we do, we need to do one of:
> 
> 1. Construct struct pages for persistent memory
> 1a. Permanently
> 1b. While the pages are under I/O
> 2. Teach the I/O layers to deal in PFNs instead of struct pages
> 3. Replace struct page with some other structure that can represent both
>    DRAM and PMEM
> 
> I'm personally a fan of #3, and I was looking at the scatterlist as
> my preferred data structure.  I now believe the scatterlist as it is
> currently defined isn't sufficient, so we probably end up needing a new
> data structure.  I think Dan's preferred method of replacing struct
> pages with PFNs is actually less instrusive, but doesn't give us as
> much advantage (an entirely new data structure would let us move to an
> extent based system at the same time, instead of sticking with an array
> of pages).  Clearly Boaz prefers 1a, which works well enough for the
> 8GB NV-DIMMs, but not well enough for the 400GB NV-DIMMs.
> 
> What's your preference?  I guess option 0 is "force all I/O to go
> through the page cache and then get copied", but that feels like a nasty
> performance hit.

In addition to the options there's also a time line.  At least for the
short term where we want to get something going 1a seems like the
absolutely be option.  It works perfectly fine for the lots of small
capacity dram-like nvdimms, and it works funtionally fine for the
special huge ones, although the resource use for it is highly annoying.
If it turns out to be too annoying we can also offer a no I/O possible
option for them in the short run.

In the long run option 2) sounds like a good plan to me, but not as a
parallel I/O path, but as the main one.  Doing so will in fact give us
options to experiment with 3).  Given that we're moving towards an
increasinly huge page using world replacing the good old struct page
with something extent-like and/or temporary might be needed for dram
as well in the future.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-19 18:17     ` Christoph Hellwig
@ 2015-03-19 19:31       ` Matthew Wilcox
  2015-03-22 16:46       ` Boaz Harrosh
  1 sibling, 0 replies; 59+ messages in thread
From: Matthew Wilcox @ 2015-03-19 19:31 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andrew Morton, Dan Williams, linux-kernel, linux-arch, axboe,
	riel, linux-nvdimm, Dave Hansen, linux-raid, mgorman,
	linux-fsdevel

On Thu, Mar 19, 2015 at 11:17:25AM -0700, Christoph Hellwig wrote:
> On Thu, Mar 19, 2015 at 09:43:13AM -0400, Matthew Wilcox wrote:
> > 1. Construct struct pages for persistent memory
> > 1a. Permanently
> > 1b. While the pages are under I/O
> > 2. Teach the I/O layers to deal in PFNs instead of struct pages
> > 3. Replace struct page with some other structure that can represent both
> >    DRAM and PMEM
> 
> In addition to the options there's also a time line.  At least for the
> short term where we want to get something going 1a seems like the
> absolutely be option.  It works perfectly fine for the lots of small
(assuming "best option")
> capacity dram-like nvdimms, and it works funtionally fine for the
> special huge ones, although the resource use for it is highly annoying.
> If it turns out to be too annoying we can also offer a no I/O possible
> option for them in the short run.
> 
> In the long run option 2) sounds like a good plan to me, but not as a
> parallel I/O path, but as the main one.  Doing so will in fact give us
> options to experiment with 3).  Given that we're moving towards an
> increasinly huge page using world replacing the good old struct page
> with something extent-like and/or temporary might be needed for dram
> as well in the future.

Dan's patches don't actually make it a "parallel I/O path", that was
Boaz's mischaracterisation.  They move all scatterlists and bios over
to using PFNs, at least on architectures which have been converted.
Speaking of architectures not being converted, it is really past time for
architectures to be switched to supporting SG chaining.  It was introduced
in 2007, and not having it generically available causes problems for
the crypto layer, as well as making further enhancements more tricky.

Assuming 'select ARCH_HAS_SG_CHAIN' is sufficient to tell, the following
architectures do support it:

arm arm64 ia64 powerpc s390 sparc x86

which means the following architectures are 8 years delinquent in
adding support:

alpha arc avr32 blackfin c6x cris frv hexagon m32r m68k metag microblaze
mips mn10300 nios2 openrisc parisc score sh tile um unicore32 xtensa

Perhaps we could deliberately make asm-generic/scatterlist.h not build
for architectures that don't select it in order to make them convert ...

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Linux-nvdimm] [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-19 15:54     ` [Linux-nvdimm] " Boaz Harrosh
  2015-03-19 15:54       ` Boaz Harrosh
@ 2015-03-19 19:59       ` Andrew Morton
  2015-03-19 20:59         ` Dan Williams
                           ` (2 more replies)
  1 sibling, 3 replies; 59+ messages in thread
From: Andrew Morton @ 2015-03-19 19:59 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Matthew Wilcox, linux-arch, axboe, riel, hch, linux-nvdimm,
	Dave Hansen, linux-kernel, linux-raid, mgorman, linux-fsdevel

On Thu, 19 Mar 2015 17:54:15 +0200 Boaz Harrosh <boaz@plexistor.com> wrote:

> On 03/19/2015 03:43 PM, Matthew Wilcox wrote:
> <>
> > 
> > Dan missed "Support O_DIRECT to a mapped DAX file".  More generally, if we
> > want to be able to do any kind of I/O directly to persistent memory,
> > and I think we do, we need to do one of:
> > 
> > 1. Construct struct pages for persistent memory
> > 1a. Permanently
> > 1b. While the pages are under I/O
> > 2. Teach the I/O layers to deal in PFNs instead of struct pages
> > 3. Replace struct page with some other structure that can represent both
> >    DRAM and PMEM
> > 
> > I'm personally a fan of #3, and I was looking at the scatterlist as
> > my preferred data structure.  I now believe the scatterlist as it is
> > currently defined isn't sufficient, so we probably end up needing a new
> > data structure.  I think Dan's preferred method of replacing struct
> > pages with PFNs is actually less instrusive, but doesn't give us as
> > much advantage (an entirely new data structure would let us move to an
> > extent based system at the same time, instead of sticking with an array
> > of pages).  Clearly Boaz prefers 1a, which works well enough for the
> > 8GB NV-DIMMs, but not well enough for the 400GB NV-DIMMs.
> > 
> > What's your preference?  I guess option 0 is "force all I/O to go
> > through the page cache and then get copied", but that feels like a nasty
> > performance hit.
> 
> Thanks Matthew, you have summarized it perfectly.
> 
> I think #1b might have merit, as well.

It would be interesting to see what a 1b implementation looks like and
how it performs.  We already allocate a bunch of temporary things to
support in-flight IO (bio, request) and allocating pageframes on the
same basis seems a fairly logical fit.

It is all a bit of a stopgap, designed to shoehorn
direct-io-to-dax-mapped-memory into the existing world.  Longer term
I'd expect us to move to something more powerful, but it's unclear what
that will be at this time, so a stopgap isn't too bad?


This is all contingent upon the prevalence of machines which have vast
amounts of nv memory and relatively small amounts of regular memory. 
How confident are we that this really is the future?

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Linux-nvdimm] [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-19 19:59       ` Andrew Morton
@ 2015-03-19 20:59         ` Dan Williams
  2015-03-22 17:22           ` Boaz Harrosh
  2015-03-20 17:32         ` Wols Lists
  2015-03-22 10:30         ` Boaz Harrosh
  2 siblings, 1 reply; 59+ messages in thread
From: Dan Williams @ 2015-03-19 20:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Boaz Harrosh, linux-arch, Jens Axboe, riel, linux-raid,
	linux-nvdimm, Dave Hansen, linux-kernel@vger.kernel.org,
	Christoph Hellwig, Mel Gorman, linux-fsdevel

On Thu, Mar 19, 2015 at 12:59 PM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> On Thu, 19 Mar 2015 17:54:15 +0200 Boaz Harrosh <boaz@plexistor.com> wrote:
>
>> On 03/19/2015 03:43 PM, Matthew Wilcox wrote:
>> <>
>> >
>> > Dan missed "Support O_DIRECT to a mapped DAX file".  More generally, if we
>> > want to be able to do any kind of I/O directly to persistent memory,
>> > and I think we do, we need to do one of:
>> >
>> > 1. Construct struct pages for persistent memory
>> > 1a. Permanently
>> > 1b. While the pages are under I/O
>> > 2. Teach the I/O layers to deal in PFNs instead of struct pages
>> > 3. Replace struct page with some other structure that can represent both
>> >    DRAM and PMEM
>> >
>> > I'm personally a fan of #3, and I was looking at the scatterlist as
>> > my preferred data structure.  I now believe the scatterlist as it is
>> > currently defined isn't sufficient, so we probably end up needing a new
>> > data structure.  I think Dan's preferred method of replacing struct
>> > pages with PFNs is actually less instrusive, but doesn't give us as
>> > much advantage (an entirely new data structure would let us move to an
>> > extent based system at the same time, instead of sticking with an array
>> > of pages).  Clearly Boaz prefers 1a, which works well enough for the
>> > 8GB NV-DIMMs, but not well enough for the 400GB NV-DIMMs.
>> >
>> > What's your preference?  I guess option 0 is "force all I/O to go
>> > through the page cache and then get copied", but that feels like a nasty
>> > performance hit.
>>
>> Thanks Matthew, you have summarized it perfectly.
>>
>> I think #1b might have merit, as well.
>
> It would be interesting to see what a 1b implementation looks like and
> how it performs.  We already allocate a bunch of temporary things to
> support in-flight IO (bio, request) and allocating pageframes on the
> same basis seems a fairly logical fit.

At least for block-i/o it seems the only place we really need struct
page infrastructure is for kmap().  Given we already need a kmap_pfn()
solution for option 2 a "dynamic allocation" stop along that
development path may just naturally fall out.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Linux-nvdimm] [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-18 14:38     ` [Linux-nvdimm] " Boaz Harrosh
  2015-03-18 14:38       ` Boaz Harrosh
@ 2015-03-20 15:56       ` Rik van Riel
  2015-03-20 15:56         ` Rik van Riel
  2015-03-22 11:53         ` Boaz Harrosh
  1 sibling, 2 replies; 59+ messages in thread
From: Rik van Riel @ 2015-03-20 15:56 UTC (permalink / raw)
  To: Boaz Harrosh, Matthew Wilcox, Boaz Harrosh
  Cc: axboe, linux-arch, linux-raid, linux-nvdimm, Dave Hansen,
	linux-kernel, hch, Linus Torvalds, Al Viro, linux-fsdevel,
	Andrew Morton, mgorman

On 03/18/2015 10:38 AM, Boaz Harrosh wrote:
> On 03/18/2015 03:06 PM, Matthew Wilcox wrote:

>>> I'm not the one afraid of hard work, if it was for a good cause, but for what?
>>> really for what? The block layer, and RDMA, and networking, and spline, and what
>>> ever the heck any one wants to imagine to do with pmem, already works perfectly
>>> stable. right now!
>>
>> The overhead.  Allocating a struct page for every 4k page in a 400GB DIMM
>> (the current capacity available from one NV-DIMM vendor) occupies 6.4GB.
>> That's an unacceptable amount of overhead.
>>
> 
> So lets fix the stacks to work nice with 2M pages. That said we can
> allocate the struct page also from pmem if we need to. The fact remains
> that we need state down the different stacks and this is the current
> design over all.

Fixing the stack to work with 2M pages will be just as invasive,
and just as much work as making it work without a struct page.

What state do you need, exactly?

The struct page in the VM is mostly used for two things:
1) to get a memory address of the data
2) refcounting, to make sure the page does not go away
   during an IO operation, copy, etc...

Persistent memory cannot be paged out so (2) is not a concern, as
long as we ensure the object the page belongs to does not go away.
There are no seek times, so moving it around may not be necessary
either, making (1) not a concern.

The only case where (1) would be a concern is if we wanted to move
data in persistent memory around for better NUMA locality. However,
persistent memory DIMMs are on their way to being too large to move
the memory, anyway - all we can usefully do is detect where programs
are accessing memory, and move the programs there.

What state do you need that is not already represented?

1.5% overhead isn't a whole lot, but it appears to be unnecessary.

If you have a convincing argument as to why we need a struct page,
you might want to articulate it in order to convince us.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Linux-nvdimm] [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-20 15:56       ` Rik van Riel
@ 2015-03-20 15:56         ` Rik van Riel
  2015-03-22 11:53         ` Boaz Harrosh
  1 sibling, 0 replies; 59+ messages in thread
From: Rik van Riel @ 2015-03-20 15:56 UTC (permalink / raw)
  To: Boaz Harrosh, Matthew Wilcox, Boaz Harrosh
  Cc: axboe, linux-arch, linux-raid, linux-nvdimm, Dave Hansen,
	linux-kernel, hch, Linus Torvalds, Al Viro, linux-fsdevel,
	Andrew Morton, mgorman

On 03/18/2015 10:38 AM, Boaz Harrosh wrote:
> On 03/18/2015 03:06 PM, Matthew Wilcox wrote:

>>> I'm not the one afraid of hard work, if it was for a good cause, but for what?
>>> really for what? The block layer, and RDMA, and networking, and spline, and what
>>> ever the heck any one wants to imagine to do with pmem, already works perfectly
>>> stable. right now!
>>
>> The overhead.  Allocating a struct page for every 4k page in a 400GB DIMM
>> (the current capacity available from one NV-DIMM vendor) occupies 6.4GB.
>> That's an unacceptable amount of overhead.
>>
> 
> So lets fix the stacks to work nice with 2M pages. That said we can
> allocate the struct page also from pmem if we need to. The fact remains
> that we need state down the different stacks and this is the current
> design over all.

Fixing the stack to work with 2M pages will be just as invasive,
and just as much work as making it work without a struct page.

What state do you need, exactly?

The struct page in the VM is mostly used for two things:
1) to get a memory address of the data
2) refcounting, to make sure the page does not go away
   during an IO operation, copy, etc...

Persistent memory cannot be paged out so (2) is not a concern, as
long as we ensure the object the page belongs to does not go away.
There are no seek times, so moving it around may not be necessary
either, making (1) not a concern.

The only case where (1) would be a concern is if we wanted to move
data in persistent memory around for better NUMA locality. However,
persistent memory DIMMs are on their way to being too large to move
the memory, anyway - all we can usefully do is detect where programs
are accessing memory, and move the programs there.

What state do you need that is not already represented?

1.5% overhead isn't a whole lot, but it appears to be unnecessary.

If you have a convincing argument as to why we need a struct page,
you might want to articulate it in order to convince us.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-19 13:43   ` Matthew Wilcox
                       ` (2 preceding siblings ...)
  2015-03-19 18:17     ` Christoph Hellwig
@ 2015-03-20 16:21     ` Rik van Riel
  2015-03-20 20:31       ` Matthew Wilcox
  2015-03-22 15:51       ` Boaz Harrosh
  3 siblings, 2 replies; 59+ messages in thread
From: Rik van Riel @ 2015-03-20 16:21 UTC (permalink / raw)
  To: Matthew Wilcox, Andrew Morton
  Cc: Dan Williams, linux-kernel, linux-arch, axboe, linux-nvdimm,
	Dave Hansen, linux-raid, mgorman, hch, linux-fsdevel,
	Michael S. Tsirkin

On 03/19/2015 09:43 AM, Matthew Wilcox wrote:

> 1. Construct struct pages for persistent memory
> 1a. Permanently
> 1b. While the pages are under I/O

Michael Tsirkin and I have been doing some thinking about what
it would take to allocate struct pages per 2MB area permanently,
and allocate additional struct pages for 4kB pages on demand,
when a 2MB area is broken up into 4kB pages.

This should work for both DRAM and persistent memory.

I am still not convinced it is worthwhile to have struct pages
for persistent memory though, but I am willing to change my mind.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Linux-nvdimm] [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-19 19:59       ` Andrew Morton
  2015-03-19 20:59         ` Dan Williams
@ 2015-03-20 17:32         ` Wols Lists
  2015-03-20 17:32           ` Wols Lists
  2015-03-22 10:30         ` Boaz Harrosh
  2 siblings, 1 reply; 59+ messages in thread
From: Wols Lists @ 2015-03-20 17:32 UTC (permalink / raw)
  To: Andrew Morton, Boaz Harrosh
  Cc: Matthew Wilcox, linux-arch, axboe, riel, hch, linux-nvdimm,
	Dave Hansen, linux-kernel, linux-raid, mgorman, linux-fsdevel

On 19/03/15 19:59, Andrew Morton wrote:
> This is all contingent upon the prevalence of machines which have vast
> amounts of nv memory and relatively small amounts of regular memory. 
> How confident are we that this really is the future?

Somewhat off-topic, but it's also the past. I can't help thinking of the
early Pick machines, which treated backing store as one giant permanent
virtual memory. Back when 300Mb hard drives were HUGE.

Cheers,
Wol



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Linux-nvdimm] [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-20 17:32         ` Wols Lists
@ 2015-03-20 17:32           ` Wols Lists
  0 siblings, 0 replies; 59+ messages in thread
From: Wols Lists @ 2015-03-20 17:32 UTC (permalink / raw)
  To: Andrew Morton, Boaz Harrosh
  Cc: Matthew Wilcox, linux-arch, axboe, riel, hch, linux-nvdimm,
	Dave Hansen, linux-kernel, linux-raid, mgorman, linux-fsdevel

On 19/03/15 19:59, Andrew Morton wrote:
> This is all contingent upon the prevalence of machines which have vast
> amounts of nv memory and relatively small amounts of regular memory. 
> How confident are we that this really is the future?

Somewhat off-topic, but it's also the past. I can't help thinking of the
early Pick machines, which treated backing store as one giant permanent
virtual memory. Back when 300Mb hard drives were HUGE.

Cheers,
Wol



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-20 16:21     ` Rik van Riel
@ 2015-03-20 20:31       ` Matthew Wilcox
  2015-03-20 21:08         ` Rik van Riel
                           ` (2 more replies)
  2015-03-22 15:51       ` Boaz Harrosh
  1 sibling, 3 replies; 59+ messages in thread
From: Matthew Wilcox @ 2015-03-20 20:31 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrew Morton, Dan Williams, linux-kernel, linux-arch, axboe,
	linux-nvdimm, Dave Hansen, linux-raid, mgorman, hch,
	linux-fsdevel, Michael S. Tsirkin

On Fri, Mar 20, 2015 at 12:21:34PM -0400, Rik van Riel wrote:
> On 03/19/2015 09:43 AM, Matthew Wilcox wrote:
> 
> > 1. Construct struct pages for persistent memory
> > 1a. Permanently
> > 1b. While the pages are under I/O
> 
> Michael Tsirkin and I have been doing some thinking about what
> it would take to allocate struct pages per 2MB area permanently,
> and allocate additional struct pages for 4kB pages on demand,
> when a 2MB area is broken up into 4kB pages.

Ah!  I've looked at that a couple of times as well.  I asked our database
performance team what impact freeing up the memmap would have on their
performance.  They told me that doubling the amount of memory generally
resulted in approximately a 40% performance improvement.  So freeing up
1.5% additional memory would result in about 0.6% performance improvement,
which I thought was probably too small a return on investment to justify
turning memmap into a two-level data structure.

Persistent memory might change that calculation somewhat ... but I'm
not convinced.  Certainly, if we already had the ability to allocate
'struct superpage', I wouldn't be pushing for page-less I/Os, I'd just
allocate these data structures for PM.  Even if they were 128 bytes in
size, that's only a 25MB overhead per 400GB NV-DIMM, which feels quite
reasonable to me.

> This should work for both DRAM and persistent memory.
> 
> I am still not convinced it is worthwhile to have struct pages
> for persistent memory though, but I am willing to change my mind.

There's a lot of code out there that relies on struct page being PAGE_SIZE
bytes.  I'm cool with replacing 'struct page' with 'struct superpage'
[1] in the biovec and auditing all of the code which touches it ... but
that's going to be a lot of code!  I'm not sure it's less code than
going directly to 'just do I/O on PFNs'.

[1] Please, somebody come up with a better name!

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-20 20:31       ` Matthew Wilcox
@ 2015-03-20 21:08         ` Rik van Riel
  2015-03-20 21:08           ` Rik van Riel
  2015-03-22 17:06           ` Boaz Harrosh
  2015-03-20 21:17         ` Wols Lists
  2015-03-22 16:24         ` Boaz Harrosh
  2 siblings, 2 replies; 59+ messages in thread
From: Rik van Riel @ 2015-03-20 21:08 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, Dan Williams, linux-kernel, linux-arch, axboe,
	linux-nvdimm, Dave Hansen, linux-raid, mgorman, hch,
	linux-fsdevel, Michael S. Tsirkin

On 03/20/2015 04:31 PM, Matthew Wilcox wrote:
> On Fri, Mar 20, 2015 at 12:21:34PM -0400, Rik van Riel wrote:
>> On 03/19/2015 09:43 AM, Matthew Wilcox wrote:
>>
>>> 1. Construct struct pages for persistent memory
>>> 1a. Permanently
>>> 1b. While the pages are under I/O
>>
>> Michael Tsirkin and I have been doing some thinking about what
>> it would take to allocate struct pages per 2MB area permanently,
>> and allocate additional struct pages for 4kB pages on demand,
>> when a 2MB area is broken up into 4kB pages.
> 
> Ah!  I've looked at that a couple of times as well.  I asked our database
> performance team what impact freeing up the memmap would have on their
> performance.  They told me that doubling the amount of memory generally
> resulted in approximately a 40% performance improvement.  So freeing up
> 1.5% additional memory would result in about 0.6% performance improvement,
> which I thought was probably too small a return on investment to justify
> turning memmap into a two-level data structure.

Agreed, it should not be done for memory savings alone, but only
if it helps improve all kinds of other things.

>> This should work for both DRAM and persistent memory.
>>
>> I am still not convinced it is worthwhile to have struct pages
>> for persistent memory though, but I am willing to change my mind.
> 
> There's a lot of code out there that relies on struct page being PAGE_SIZE
> bytes.  I'm cool with replacing 'struct page' with 'struct superpage'
> [1] in the biovec and auditing all of the code which touches it ... but
> that's going to be a lot of code!  I'm not sure it's less code than
> going directly to 'just do I/O on PFNs'.

Totally agreed here. I see absolutely no advantage to teaching the
IO layer about a "struct superpage" when it could operate on PFNs
just as easily.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-20 21:08         ` Rik van Riel
@ 2015-03-20 21:08           ` Rik van Riel
  2015-03-22 17:06           ` Boaz Harrosh
  1 sibling, 0 replies; 59+ messages in thread
From: Rik van Riel @ 2015-03-20 21:08 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, Dan Williams, linux-kernel, linux-arch, axboe,
	linux-nvdimm, Dave Hansen, linux-raid, mgorman, hch,
	linux-fsdevel, Michael S. Tsirkin

On 03/20/2015 04:31 PM, Matthew Wilcox wrote:
> On Fri, Mar 20, 2015 at 12:21:34PM -0400, Rik van Riel wrote:
>> On 03/19/2015 09:43 AM, Matthew Wilcox wrote:
>>
>>> 1. Construct struct pages for persistent memory
>>> 1a. Permanently
>>> 1b. While the pages are under I/O
>>
>> Michael Tsirkin and I have been doing some thinking about what
>> it would take to allocate struct pages per 2MB area permanently,
>> and allocate additional struct pages for 4kB pages on demand,
>> when a 2MB area is broken up into 4kB pages.
> 
> Ah!  I've looked at that a couple of times as well.  I asked our database
> performance team what impact freeing up the memmap would have on their
> performance.  They told me that doubling the amount of memory generally
> resulted in approximately a 40% performance improvement.  So freeing up
> 1.5% additional memory would result in about 0.6% performance improvement,
> which I thought was probably too small a return on investment to justify
> turning memmap into a two-level data structure.

Agreed, it should not be done for memory savings alone, but only
if it helps improve all kinds of other things.

>> This should work for both DRAM and persistent memory.
>>
>> I am still not convinced it is worthwhile to have struct pages
>> for persistent memory though, but I am willing to change my mind.
> 
> There's a lot of code out there that relies on struct page being PAGE_SIZE
> bytes.  I'm cool with replacing 'struct page' with 'struct superpage'
> [1] in the biovec and auditing all of the code which touches it ... but
> that's going to be a lot of code!  I'm not sure it's less code than
> going directly to 'just do I/O on PFNs'.

Totally agreed here. I see absolutely no advantage to teaching the
IO layer about a "struct superpage" when it could operate on PFNs
just as easily.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-20 20:31       ` Matthew Wilcox
  2015-03-20 21:08         ` Rik van Riel
@ 2015-03-20 21:17         ` Wols Lists
  2015-03-20 21:17           ` Wols Lists
  2015-03-22 16:24         ` Boaz Harrosh
  2 siblings, 1 reply; 59+ messages in thread
From: Wols Lists @ 2015-03-20 21:17 UTC (permalink / raw)
  To: Matthew Wilcox, Rik van Riel
  Cc: Andrew Morton, Dan Williams, linux-kernel, linux-arch, axboe,
	linux-nvdimm, Dave Hansen, linux-raid, mgorman, hch,
	linux-fsdevel, Michael S. Tsirkin

On 20/03/15 20:31, Matthew Wilcox wrote:
> Ah!  I've looked at that a couple of times as well.  I asked our database
> performance team what impact freeing up the memmap would have on their
> performance.  They told me that doubling the amount of memory generally
> resulted in approximately a 40% performance improvement.  So freeing up
> 1.5% additional memory would result in about 0.6% performance improvement,
> which I thought was probably too small a return on investment to justify
> turning memmap into a two-level data structure.

Don't get me started on databases! This is very much a relational
problem, other databases don't suffer like this.

(imho relational theory is totally inappropriate for an engineering
problem, like designing a database engine ...)

Cheers,
Wol

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-20 21:17         ` Wols Lists
@ 2015-03-20 21:17           ` Wols Lists
  0 siblings, 0 replies; 59+ messages in thread
From: Wols Lists @ 2015-03-20 21:17 UTC (permalink / raw)
  To: Matthew Wilcox, Rik van Riel
  Cc: Andrew Morton, Dan Williams, linux-kernel, linux-arch, axboe,
	linux-nvdimm, Dave Hansen, linux-raid, mgorman, hch,
	linux-fsdevel, Michael S. Tsirkin

On 20/03/15 20:31, Matthew Wilcox wrote:
> Ah!  I've looked at that a couple of times as well.  I asked our database
> performance team what impact freeing up the memmap would have on their
> performance.  They told me that doubling the amount of memory generally
> resulted in approximately a 40% performance improvement.  So freeing up
> 1.5% additional memory would result in about 0.6% performance improvement,
> which I thought was probably too small a return on investment to justify
> turning memmap into a two-level data structure.

Don't get me started on databases! This is very much a relational
problem, other databases don't suffer like this.

(imho relational theory is totally inappropriate for an engineering
problem, like designing a database engine ...)

Cheers,
Wol

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Linux-nvdimm] [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-19 19:59       ` Andrew Morton
  2015-03-19 20:59         ` Dan Williams
  2015-03-20 17:32         ` Wols Lists
@ 2015-03-22 10:30         ` Boaz Harrosh
  2015-03-22 10:30           ` Boaz Harrosh
  2 siblings, 1 reply; 59+ messages in thread
From: Boaz Harrosh @ 2015-03-22 10:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Matthew Wilcox, linux-arch, axboe, riel, hch, linux-nvdimm,
	Dave Hansen, linux-kernel, linux-raid, mgorman, linux-fsdevel

On 03/19/2015 09:59 PM, Andrew Morton wrote:
> On Thu, 19 Mar 2015 17:54:15 +0200 Boaz Harrosh <boaz@plexistor.com> wrote:
> 
>> On 03/19/2015 03:43 PM, Matthew Wilcox wrote:
>> <>
>>>
>>> Dan missed "Support O_DIRECT to a mapped DAX file".  More generally, if we
>>> want to be able to do any kind of I/O directly to persistent memory,
>>> and I think we do, we need to do one of:
>>>
>>> 1. Construct struct pages for persistent memory
>>> 1a. Permanently
>>> 1b. While the pages are under I/O
>>> 2. Teach the I/O layers to deal in PFNs instead of struct pages
>>> 3. Replace struct page with some other structure that can represent both
>>>    DRAM and PMEM
>>>
>>> I'm personally a fan of #3, and I was looking at the scatterlist as
>>> my preferred data structure.  I now believe the scatterlist as it is
>>> currently defined isn't sufficient, so we probably end up needing a new
>>> data structure.  I think Dan's preferred method of replacing struct
>>> pages with PFNs is actually less instrusive, but doesn't give us as
>>> much advantage (an entirely new data structure would let us move to an
>>> extent based system at the same time, instead of sticking with an array
>>> of pages).  Clearly Boaz prefers 1a, which works well enough for the
>>> 8GB NV-DIMMs, but not well enough for the 400GB NV-DIMMs.
>>>
>>> What's your preference?  I guess option 0 is "force all I/O to go
>>> through the page cache and then get copied", but that feels like a nasty
>>> performance hit.
>>
>> Thanks Matthew, you have summarized it perfectly.
>>
>> I think #1b might have merit, as well.
> 
> It would be interesting to see what a 1b implementation looks like and
> how it performs.  We already allocate a bunch of temporary things to
> support in-flight IO (bio, request) and allocating pageframes on the
> same basis seems a fairly logical fit.

There is a couple of ways we can do this, they are all kind of 
"hacks" to me, along the line of how transparent huge pages is an
hack, a very nice one at that, and every one that knows me knows
I love hacks, be so it is never the less.

So it is all about designating the page to mean something else
at a set of a flag.

And actually the transparent-huge-pages is the core of this.
because there is already a switch on core page operations when
it is present. (for example get/put_page )

And because we do not want to allocate pages inline, as part of a
section, we also need a bit of a memory_model.h new define.
(May this can avoided I need to stare harder on this)

> 
> It is all a bit of a stopgap, designed to shoehorn
> direct-io-to-dax-mapped-memory into the existing world.  Longer term
> I'd expect us to move to something more powerful, but it's unclear what
> that will be at this time, so a stopgap isn't too bad?
> 

I'd bet real huge-pages is the long term. The one stop gap for
huge-pages is that no one wants to dirty a full 2M for two changed
bytes. 4k is kind of the IO performance granularity we all calculate
for. This can be solved in couple of ways, all very invasive to lots
of Kernel areas. 

Lots of times the problem is "where do you start?"

> 
> This is all contingent upon the prevalence of machines which have vast
> amounts of nv memory and relatively small amounts of regular memory. 
> How confident are we that this really is the future?
> 

One thing you guys are ignoring is that the 1.5% "waste" can come
from nv-memory. If real ram is scarce and nv-ram is hips cheep,
just allocate the pages from nvram then.

Do not forget that very soon after the availability of real
nvram, I mean not the backed up one, but the real like mram
or reram. Lots of machines will be 100% nv-ram + sram caches.
This is nothing to do with storage speed, it is to do with
power consumption. The machine shuts-off and picks up exactly
where it was. (Even at power on they consume much less, no refreshes)
In those machine a partition of storage say, the swap partition, will
be the volatile memory section of the machine, zeroed out on boot and
used as RAM.

So this future above does not exist. Pages can just be allocated
from the cheapest memory you have and be done with it.

(BTW all this can already be done now, I have demonstrated it
 in the lab, a reserved NvDIMM memory region is memory_hot_plugged
 and is there after used as regular RAM)

Thanks
Boaz

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Linux-nvdimm] [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-22 10:30         ` Boaz Harrosh
@ 2015-03-22 10:30           ` Boaz Harrosh
  0 siblings, 0 replies; 59+ messages in thread
From: Boaz Harrosh @ 2015-03-22 10:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Matthew Wilcox, linux-arch, axboe, riel, hch, linux-nvdimm,
	Dave Hansen, linux-kernel, linux-raid, mgorman, linux-fsdevel

On 03/19/2015 09:59 PM, Andrew Morton wrote:
> On Thu, 19 Mar 2015 17:54:15 +0200 Boaz Harrosh <boaz@plexistor.com> wrote:
> 
>> On 03/19/2015 03:43 PM, Matthew Wilcox wrote:
>> <>
>>>
>>> Dan missed "Support O_DIRECT to a mapped DAX file".  More generally, if we
>>> want to be able to do any kind of I/O directly to persistent memory,
>>> and I think we do, we need to do one of:
>>>
>>> 1. Construct struct pages for persistent memory
>>> 1a. Permanently
>>> 1b. While the pages are under I/O
>>> 2. Teach the I/O layers to deal in PFNs instead of struct pages
>>> 3. Replace struct page with some other structure that can represent both
>>>    DRAM and PMEM
>>>
>>> I'm personally a fan of #3, and I was looking at the scatterlist as
>>> my preferred data structure.  I now believe the scatterlist as it is
>>> currently defined isn't sufficient, so we probably end up needing a new
>>> data structure.  I think Dan's preferred method of replacing struct
>>> pages with PFNs is actually less instrusive, but doesn't give us as
>>> much advantage (an entirely new data structure would let us move to an
>>> extent based system at the same time, instead of sticking with an array
>>> of pages).  Clearly Boaz prefers 1a, which works well enough for the
>>> 8GB NV-DIMMs, but not well enough for the 400GB NV-DIMMs.
>>>
>>> What's your preference?  I guess option 0 is "force all I/O to go
>>> through the page cache and then get copied", but that feels like a nasty
>>> performance hit.
>>
>> Thanks Matthew, you have summarized it perfectly.
>>
>> I think #1b might have merit, as well.
> 
> It would be interesting to see what a 1b implementation looks like and
> how it performs.  We already allocate a bunch of temporary things to
> support in-flight IO (bio, request) and allocating pageframes on the
> same basis seems a fairly logical fit.

There is a couple of ways we can do this, they are all kind of 
"hacks" to me, along the line of how transparent huge pages is an
hack, a very nice one at that, and every one that knows me knows
I love hacks, be so it is never the less.

So it is all about designating the page to mean something else
at a set of a flag.

And actually the transparent-huge-pages is the core of this.
because there is already a switch on core page operations when
it is present. (for example get/put_page )

And because we do not want to allocate pages inline, as part of a
section, we also need a bit of a memory_model.h new define.
(May this can avoided I need to stare harder on this)

> 
> It is all a bit of a stopgap, designed to shoehorn
> direct-io-to-dax-mapped-memory into the existing world.  Longer term
> I'd expect us to move to something more powerful, but it's unclear what
> that will be at this time, so a stopgap isn't too bad?
> 

I'd bet real huge-pages is the long term. The one stop gap for
huge-pages is that no one wants to dirty a full 2M for two changed
bytes. 4k is kind of the IO performance granularity we all calculate
for. This can be solved in couple of ways, all very invasive to lots
of Kernel areas. 

Lots of times the problem is "where do you start?"

> 
> This is all contingent upon the prevalence of machines which have vast
> amounts of nv memory and relatively small amounts of regular memory. 
> How confident are we that this really is the future?
> 

One thing you guys are ignoring is that the 1.5% "waste" can come
from nv-memory. If real ram is scarce and nv-ram is hips cheep,
just allocate the pages from nvram then.

Do not forget that very soon after the availability of real
nvram, I mean not the backed up one, but the real like mram
or reram. Lots of machines will be 100% nv-ram + sram caches.
This is nothing to do with storage speed, it is to do with
power consumption. The machine shuts-off and picks up exactly
where it was. (Even at power on they consume much less, no refreshes)
In those machine a partition of storage say, the swap partition, will
be the volatile memory section of the machine, zeroed out on boot and
used as RAM.

So this future above does not exist. Pages can just be allocated
from the cheapest memory you have and be done with it.

(BTW all this can already be done now, I have demonstrated it
 in the lab, a reserved NvDIMM memory region is memory_hot_plugged
 and is there after used as regular RAM)

Thanks
Boaz

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Linux-nvdimm] [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-20 15:56       ` Rik van Riel
  2015-03-20 15:56         ` Rik van Riel
@ 2015-03-22 11:53         ` Boaz Harrosh
  2015-03-22 11:53           ` Boaz Harrosh
  1 sibling, 1 reply; 59+ messages in thread
From: Boaz Harrosh @ 2015-03-22 11:53 UTC (permalink / raw)
  To: Rik van Riel, Matthew Wilcox, Boaz Harrosh
  Cc: axboe, linux-arch, linux-raid, linux-nvdimm, Dave Hansen,
	linux-kernel, hch, Linus Torvalds, Al Viro, linux-fsdevel,
	Andrew Morton, mgorman

On 03/20/2015 05:56 PM, Rik van Riel wrote:
> On 03/18/2015 10:38 AM, Boaz Harrosh wrote:
>> On 03/18/2015 03:06 PM, Matthew Wilcox wrote:
> 
>>>> I'm not the one afraid of hard work, if it was for a good cause, but for what?
>>>> really for what? The block layer, and RDMA, and networking, and spline, and what
>>>> ever the heck any one wants to imagine to do with pmem, already works perfectly
>>>> stable. right now!
>>>
>>> The overhead.  Allocating a struct page for every 4k page in a 400GB DIMM
>>> (the current capacity available from one NV-DIMM vendor) occupies 6.4GB.
>>> That's an unacceptable amount of overhead.
>>>
>>
>> So lets fix the stacks to work nice with 2M pages. That said we can
>> allocate the struct page also from pmem if we need to. The fact remains
>> that we need state down the different stacks and this is the current
>> design over all.
> 
> Fixing the stack to work with 2M pages will be just as invasive,
> and just as much work as making it work without a struct page.
> 
> What state do you need, exactly?
> 

It is not me that needs state it is the Kernel. Let me show you
what I can do now that uses state (and pages).

block layer sends a bio via iscsi, in turn it goes around and
sends it via networking stack. Here page-ref is used as well
as all kind of page based management. (This is half the Kernel
converted right here)
Same thing but iser & RDMA. Same thing to a null-target, via
the target stack, maybe via path-threw.

Another big example:
  At user-mode application I mmap a portion of pmem, I then
use the libvirt API to designate a named shared-memory object.
At vm I use the same API to retrieve a pointer to that pmem
region and boom, I'm persistent. (Same can be done between
two VMs)

mmap(pmem) send it to network, to encryption, direct_io
RDMA, anything copyless.

So many subsystem use page_lock page->lru page-ref and are
written to receive and manage pages. I do not like to be
excluded from these systems, and I would very much hate
to re-write them. block layer is an example.

> The struct page in the VM is mostly used for two things:
> 1) to get a memory address of the data
> 2) refcounting, to make sure the page does not go away
>    during an IO operation, copy, etc...
> 
> Persistent memory cannot be paged out so (2) is not a concern, as
> long as we ensure the object the page belongs to does not go away.
> There are no seek times, so moving it around may not be necessary
> either, making (1) not a concern.
> 

I lost you sorry. I'm not sure what you meant here?
Yes kmap/kunmap is mute. I do not see any use for highmem and
any 32bitness with this thing.

refcounting is used sure, even with pmem see above. Actually
relaying on refcounting existence can solve us some stuff at
the pmem management level, which exist today. (RDMA while truncate)

> The only case where (1) would be a concern is if we wanted to move
> data in persistent memory around for better NUMA locality. However,
> persistent memory DIMMs are on their way to being too large to move
> the memory, anyway - all we can usefully do is detect where programs
> are accessing memory, and move the programs there.
> 

So actually I have hands on experience with this very problem.
We have observed that NUMA kills us. Now going through memory_add_physaddr_to_nid()
loop for every 4k operation was a pain, but caching it on page_to_nid()
(As part of flags in 64bit) is very nice optimization, we do NUMA aware block
allocation and it preforms much better. (Never like a single node but magnitude
better then without)

> What state do you need that is not already represented?
> 

Most of these subsystem you guys are focused on it is mostly read-only
state. Except page-ref. But never the less the page has added information
describing the pfn. Like nid mapping->ops flags etc ...

And it is also a stop gap of translation.
give me a page I now the pfn and vaddr, give me a pfn I know page
give me a vaddr I know the page. So I can move between all these domains.

Now I am sure that in hindsight we might have devised better structures
and abstractions that could carry all this information in a more abstract
and convenient way, throughout the Kernel. But for now this basic object
is a page and is passed around like in a relay-race. Each subsystem with
its own page based meta-structure. The only real global token is
page-struct.

You are saying: "not already represented" ? I'm saying exactly, sir
it is already represented as a page-struct. Anything else is in the
far far future. (if at all)

> 1.5% overhead isn't a whole lot, but it appears to be unnecessary.
> 

unnecessary, in a theoretical future with every single Kernel
subsystem changed (maybe for the better I'm not saying). And this
future is not even at all clear what it is.

But for current code structure it is very much necessary. For the
very long present days, it is not 1.5% with or without. It is
need-to-copy or direct(-1.5%)

[For me it is not even the performance of a memcpy which exacly halves
 my pmem performance, it is the latency and the extra nightmare locking
 and management to keep in sync two copies of the same thing]

> If you have a convincing argument as to why we need a struct page,
> you might want to articulate it in order to convince us.
> 

The must simple convincing argument there is. "Existing code". Apparently
page was needed, maybe we can all think of much better constructs. But
for now this is what the Kernel is based on. Until such time that we
better it it is there.

Since when we refrain from new technologies and new fixtures because
"A major cleanup is needed". I'm all for all the great 
"change-every-file in Kernel" ideas some guys have, but while at it
also change the small patch I added to support pmem.

For me pmem is now, at clients systems. and I chose direct(-1.5%)
over need-to-copy. Because it gives me the performance, and most
important, latency that sales my products. What is your timetable?

Cheers
Boaz

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Linux-nvdimm] [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-22 11:53         ` Boaz Harrosh
@ 2015-03-22 11:53           ` Boaz Harrosh
  0 siblings, 0 replies; 59+ messages in thread
From: Boaz Harrosh @ 2015-03-22 11:53 UTC (permalink / raw)
  To: Rik van Riel, Matthew Wilcox, Boaz Harrosh
  Cc: axboe, linux-arch, linux-raid, linux-nvdimm, Dave Hansen,
	linux-kernel, hch, Linus Torvalds, Al Viro, linux-fsdevel,
	Andrew Morton, mgorman

On 03/20/2015 05:56 PM, Rik van Riel wrote:
> On 03/18/2015 10:38 AM, Boaz Harrosh wrote:
>> On 03/18/2015 03:06 PM, Matthew Wilcox wrote:
> 
>>>> I'm not the one afraid of hard work, if it was for a good cause, but for what?
>>>> really for what? The block layer, and RDMA, and networking, and spline, and what
>>>> ever the heck any one wants to imagine to do with pmem, already works perfectly
>>>> stable. right now!
>>>
>>> The overhead.  Allocating a struct page for every 4k page in a 400GB DIMM
>>> (the current capacity available from one NV-DIMM vendor) occupies 6.4GB.
>>> That's an unacceptable amount of overhead.
>>>
>>
>> So lets fix the stacks to work nice with 2M pages. That said we can
>> allocate the struct page also from pmem if we need to. The fact remains
>> that we need state down the different stacks and this is the current
>> design over all.
> 
> Fixing the stack to work with 2M pages will be just as invasive,
> and just as much work as making it work without a struct page.
> 
> What state do you need, exactly?
> 

It is not me that needs state it is the Kernel. Let me show you
what I can do now that uses state (and pages).

block layer sends a bio via iscsi, in turn it goes around and
sends it via networking stack. Here page-ref is used as well
as all kind of page based management. (This is half the Kernel
converted right here)
Same thing but iser & RDMA. Same thing to a null-target, via
the target stack, maybe via path-threw.

Another big example:
  At user-mode application I mmap a portion of pmem, I then
use the libvirt API to designate a named shared-memory object.
At vm I use the same API to retrieve a pointer to that pmem
region and boom, I'm persistent. (Same can be done between
two VMs)

mmap(pmem) send it to network, to encryption, direct_io
RDMA, anything copyless.

So many subsystem use page_lock page->lru page-ref and are
written to receive and manage pages. I do not like to be
excluded from these systems, and I would very much hate
to re-write them. block layer is an example.

> The struct page in the VM is mostly used for two things:
> 1) to get a memory address of the data
> 2) refcounting, to make sure the page does not go away
>    during an IO operation, copy, etc...
> 
> Persistent memory cannot be paged out so (2) is not a concern, as
> long as we ensure the object the page belongs to does not go away.
> There are no seek times, so moving it around may not be necessary
> either, making (1) not a concern.
> 

I lost you sorry. I'm not sure what you meant here?
Yes kmap/kunmap is mute. I do not see any use for highmem and
any 32bitness with this thing.

refcounting is used sure, even with pmem see above. Actually
relaying on refcounting existence can solve us some stuff at
the pmem management level, which exist today. (RDMA while truncate)

> The only case where (1) would be a concern is if we wanted to move
> data in persistent memory around for better NUMA locality. However,
> persistent memory DIMMs are on their way to being too large to move
> the memory, anyway - all we can usefully do is detect where programs
> are accessing memory, and move the programs there.
> 

So actually I have hands on experience with this very problem.
We have observed that NUMA kills us. Now going through memory_add_physaddr_to_nid()
loop for every 4k operation was a pain, but caching it on page_to_nid()
(As part of flags in 64bit) is very nice optimization, we do NUMA aware block
allocation and it preforms much better. (Never like a single node but magnitude
better then without)

> What state do you need that is not already represented?
> 

Most of these subsystem you guys are focused on it is mostly read-only
state. Except page-ref. But never the less the page has added information
describing the pfn. Like nid mapping->ops flags etc ...

And it is also a stop gap of translation.
give me a page I now the pfn and vaddr, give me a pfn I know page
give me a vaddr I know the page. So I can move between all these domains.

Now I am sure that in hindsight we might have devised better structures
and abstractions that could carry all this information in a more abstract
and convenient way, throughout the Kernel. But for now this basic object
is a page and is passed around like in a relay-race. Each subsystem with
its own page based meta-structure. The only real global token is
page-struct.

You are saying: "not already represented" ? I'm saying exactly, sir
it is already represented as a page-struct. Anything else is in the
far far future. (if at all)

> 1.5% overhead isn't a whole lot, but it appears to be unnecessary.
> 

unnecessary, in a theoretical future with every single Kernel
subsystem changed (maybe for the better I'm not saying). And this
future is not even at all clear what it is.

But for current code structure it is very much necessary. For the
very long present days, it is not 1.5% with or without. It is
need-to-copy or direct(-1.5%)

[For me it is not even the performance of a memcpy which exacly halves
 my pmem performance, it is the latency and the extra nightmare locking
 and management to keep in sync two copies of the same thing]

> If you have a convincing argument as to why we need a struct page,
> you might want to articulate it in order to convince us.
> 

The must simple convincing argument there is. "Existing code". Apparently
page was needed, maybe we can all think of much better constructs. But
for now this is what the Kernel is based on. Until such time that we
better it it is there.

Since when we refrain from new technologies and new fixtures because
"A major cleanup is needed". I'm all for all the great 
"change-every-file in Kernel" ideas some guys have, but while at it
also change the small patch I added to support pmem.

For me pmem is now, at clients systems. and I chose direct(-1.5%)
over need-to-copy. Because it gives me the performance, and most
important, latency that sales my products. What is your timetable?

Cheers
Boaz

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-20 16:21     ` Rik van Riel
  2015-03-20 20:31       ` Matthew Wilcox
@ 2015-03-22 15:51       ` Boaz Harrosh
  2015-03-23 15:19         ` Rik van Riel
  1 sibling, 1 reply; 59+ messages in thread
From: Boaz Harrosh @ 2015-03-22 15:51 UTC (permalink / raw)
  To: Rik van Riel, Matthew Wilcox, Andrew Morton
  Cc: Dan Williams, linux-kernel, linux-arch, axboe, linux-nvdimm,
	Dave Hansen, linux-raid, mgorman, hch, linux-fsdevel,
	Michael S. Tsirkin

On 03/20/2015 06:21 PM, Rik van Riel wrote:
> On 03/19/2015 09:43 AM, Matthew Wilcox wrote:
> 
>> 1. Construct struct pages for persistent memory
>> 1a. Permanently
>> 1b. While the pages are under I/O
> 
> Michael Tsirkin and I have been doing some thinking about what
> it would take to allocate struct pages per 2MB area permanently,
> and allocate additional struct pages for 4kB pages on demand,
> when a 2MB area is broken up into 4kB pages.
> 
> This should work for both DRAM and persistent memory.
> 

My thoughts as well, this need *not* be a huge evasive change. Is however
a careful surgery in very core code. And lots of sleepless scary nights
and testing to make sure all the side effects are wrinkled out.

BTW: Basic core block code may very well work with:
	bv_page, bv_len > PAGE_SIZE bv_offset > PAGE_SIZE.

  Meaning bv_page-pfn is contiguous in physical space (and virtual
  of course). So much so that there are already rumors that this suppose
  to be supported, and there are already out-of-tree drivers that use
  this today by kmalloc a page-order and feeding BIOs with bv_len=64K

  But going out of block-layer and say to networking say via iscsi and
  this breaks pretty fast. Lets fix that then lets introduce a:
	page_size(page)
  page already knows its size (ie belonging to a 2M THP)

> I am still not convinced it is worthwhile to have struct pages
> for persistent memory though, but I am willing to change my mind.
> 

If we want copy-less, we need a common memory descriptor career. Today this
is page-struct. So for me your above statement means:
	"still not convinced I care about copy-less pmem"

Otherwise you either enhance what you have today or devise a new
system, which means change the all Kernel.

Lastly: Why does pmem need to wait out-of-tree. Even you say above that
machines with lots of DRAM can enjoy the HUGE-to-4k split. So why
not let pmem waist 4k pages like everyone else and fix it as above
down the line, both for pmem and ram. And save both ways.
Why do we need to first change the all Kernel, then have pmem. Why not
use current infra structure, for good or for worth, and incrementally
do better.

May I call you on the phone to try and work things out. I believe the
huge page thing + 4k on demand is not a very big change, as long as
	struct page *page is left as is, everywhere.

But may *now* carry a different physical/virtual contiguous payload
bigger then 4k. Is not the PAGE_SIZE the real bug? lets fix that problem.

Thanks
Boaz

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-20 20:31       ` Matthew Wilcox
  2015-03-20 21:08         ` Rik van Riel
  2015-03-20 21:17         ` Wols Lists
@ 2015-03-22 16:24         ` Boaz Harrosh
  2015-03-22 16:24           ` Boaz Harrosh
  2 siblings, 1 reply; 59+ messages in thread
From: Boaz Harrosh @ 2015-03-22 16:24 UTC (permalink / raw)
  To: Matthew Wilcox, Rik van Riel
  Cc: Andrew Morton, Dan Williams, linux-kernel, linux-arch, axboe,
	linux-nvdimm, Dave Hansen, linux-raid, mgorman, hch,
	linux-fsdevel, Michael S. Tsirkin

On 03/20/2015 10:31 PM, Matthew Wilcox wrote:
<>
> 
> There's a lot of code out there that relies on struct page being PAGE_SIZE
> bytes.  

Not so much really. Not at the lower end of the stack. You can actually feed
a 
	vp = kmalloc(64K);
	bv_page = virt_to_page(vp)
	bv_len = 64k

And feed that to an hard drive. It works.

The only last stronghold of PAGE_SIZE is at the page-cache and page-fault
granularity where the minimum is the better. But it should not be hard
to clean up the lower end of the stack. Even introduce a:
	page_size(page)

You will find that every subsystem that can work with a sub-page size
similar to above bv_len. Will also work well with bigger than PAGE_SIZE
bv_len equivalent.

Only the BUG_ONs need to convert to page_size(page) instead of PAGE_SIZE

> I'm cool with replacing 'struct page' with 'struct superpage'
> [1] in the biovec and auditing all of the code which touches it ... but
> that's going to be a lot of code!  I'm not sure it's less code than
> going directly to 'just do I/O on PFNs'.
> 

struct page already knows how to be a super-page. with the THP mechanics.
All a page_size(page) needs is a call to its section, we do not need any
added storage at page-struct. (And we can cache this as a flag we actually
already have a flag)

It looks like you are very trigger happy to change
	"biovec and auditing all of the code which touches it"

I believe long long term your #1b is the correct "full audit" path:

	Page Is the virtual-2-page-2-physical descriptor + state.
	It is variable size
 
> [1] Please, somebody come up with a better name!

sure struct page *page.

The one to kill is PAGE_SIZE. In most current code it can just be MIN_PAGE_SIZE
and CACHE_PAGE_SIZE == MIN_PAGE_SIZE. Only novelty is enhance of the split_huge_page
in the case of "page-fault-granularity".

Thanks
Boaz


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-22 16:24         ` Boaz Harrosh
@ 2015-03-22 16:24           ` Boaz Harrosh
  0 siblings, 0 replies; 59+ messages in thread
From: Boaz Harrosh @ 2015-03-22 16:24 UTC (permalink / raw)
  To: Matthew Wilcox, Rik van Riel
  Cc: Andrew Morton, Dan Williams, linux-kernel, linux-arch, axboe,
	linux-nvdimm, Dave Hansen, linux-raid, mgorman, hch,
	linux-fsdevel, Michael S. Tsirkin

On 03/20/2015 10:31 PM, Matthew Wilcox wrote:
<>
> 
> There's a lot of code out there that relies on struct page being PAGE_SIZE
> bytes.  

Not so much really. Not at the lower end of the stack. You can actually feed
a 
	vp = kmalloc(64K);
	bv_page = virt_to_page(vp)
	bv_len = 64k

And feed that to an hard drive. It works.

The only last stronghold of PAGE_SIZE is at the page-cache and page-fault
granularity where the minimum is the better. But it should not be hard
to clean up the lower end of the stack. Even introduce a:
	page_size(page)

You will find that every subsystem that can work with a sub-page size
similar to above bv_len. Will also work well with bigger than PAGE_SIZE
bv_len equivalent.

Only the BUG_ONs need to convert to page_size(page) instead of PAGE_SIZE

> I'm cool with replacing 'struct page' with 'struct superpage'
> [1] in the biovec and auditing all of the code which touches it ... but
> that's going to be a lot of code!  I'm not sure it's less code than
> going directly to 'just do I/O on PFNs'.
> 

struct page already knows how to be a super-page. with the THP mechanics.
All a page_size(page) needs is a call to its section, we do not need any
added storage at page-struct. (And we can cache this as a flag we actually
already have a flag)

It looks like you are very trigger happy to change
	"biovec and auditing all of the code which touches it"

I believe long long term your #1b is the correct "full audit" path:

	Page Is the virtual-2-page-2-physical descriptor + state.
	It is variable size

> [1] Please, somebody come up with a better name!

sure struct page *page.

The one to kill is PAGE_SIZE. In most current code it can just be MIN_PAGE_SIZE
and CACHE_PAGE_SIZE == MIN_PAGE_SIZE. Only novelty is enhance of the split_huge_page
in the case of "page-fault-granularity".

Thanks
Boaz

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-19 18:17     ` Christoph Hellwig
  2015-03-19 19:31       ` Matthew Wilcox
@ 2015-03-22 16:46       ` Boaz Harrosh
  1 sibling, 0 replies; 59+ messages in thread
From: Boaz Harrosh @ 2015-03-22 16:46 UTC (permalink / raw)
  To: Christoph Hellwig, Matthew Wilcox
  Cc: Andrew Morton, Dan Williams, linux-kernel, linux-arch, axboe,
	riel, linux-nvdimm, Dave Hansen, linux-raid, mgorman,
	linux-fsdevel

On 03/19/2015 08:17 PM, Christoph Hellwig wrote:
<>
> 
> In addition to the options there's also a time line.  At least for the
> short term where we want to get something going 1a seems like the
> absolutely be option.  It works perfectly fine for the lots of small
> capacity dram-like nvdimms, and it works funtionally fine for the
> special huge ones, although the resource use for it is highly annoying.
> If it turns out to be too annoying we can also offer a no I/O possible
> option for them in the short run.
> 

Finally some voice in the dessert.

> In the long run option 2) sounds like a good plan to me, but not as a
> parallel I/O path, but as the main one.  Doing so will in fact give us
> options to experiment with 3).  Given that we're moving towards an
> increasinly huge page using world replacing the good old struct page
> with something extent-like and/or temporary might be needed for dram
> as well in the future.

Why ? why not just make page mean page_size(page) and mostly even that
is not needed.

Any changes to bio will only solve bio. And will push the problem to
the next subsystem.

Fix the PAGE_SIZE problem and you fixed it for all subsystems, not only
bio. And I believe it is the smaller change by far.

Because in most places PAGE_SIZE just means MIN_PAGE_SIZE when we try
calculate some array sizes for storage of a given "io-length", this
is surly 4k, but then when the actual run time is preformed we usually
have a length specifier like bv_len. (And the few places that do not are
easy to fix I believe)

Thanks
Boaz

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-20 21:08         ` Rik van Riel
  2015-03-20 21:08           ` Rik van Riel
@ 2015-03-22 17:06           ` Boaz Harrosh
  2015-03-22 17:22             ` Dan Williams
  1 sibling, 1 reply; 59+ messages in thread
From: Boaz Harrosh @ 2015-03-22 17:06 UTC (permalink / raw)
  To: Rik van Riel, Matthew Wilcox
  Cc: Andrew Morton, Dan Williams, linux-kernel, linux-arch, axboe,
	linux-nvdimm, Dave Hansen, linux-raid, mgorman, hch,
	linux-fsdevel, Michael S. Tsirkin

On 03/20/2015 11:08 PM, Rik van Riel wrote:
> On 03/20/2015 04:31 PM, Matthew Wilcox wrote:
<>
>> There's a lot of code out there that relies on struct page being PAGE_SIZE
>> bytes.  I'm cool with replacing 'struct page' with 'struct superpage'
>> [1] in the biovec and auditing all of the code which touches it ... but
>> that's going to be a lot of code!  I'm not sure it's less code than
>> going directly to 'just do I/O on PFNs'.
> 
> Totally agreed here. I see absolutely no advantage to teaching the
> IO layer about a "struct superpage" when it could operate on PFNs
> just as easily.
> 

Or teaching 'struct page' to be variable length, This is already so at
bio and sg level so you fixed nothing.

Moving to pfn's only means that all this unnamed code above that
"relies on struct page being PAGE_SIZE" is now not allowed to
interfaced with bio and sg list. Which in current code and in Dan's patches
means two tons of BUG_ONS and return -ENOTSUPP . For all these
subsystems below the bio and sglist that operate on page_structs

Say the "relies on struct page being PAGE_SIZE" is such an hard
work, which is not at all at the bio and sg-list level, will
it not be worth while fixing this instead of alienating the all
Kernel from the IO subsystem.

But I believe it is the much much smaller change? Specially considering
Networking, RDMA shared memory ...

Cheers
Boaz

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Linux-nvdimm] [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-19 20:59         ` Dan Williams
@ 2015-03-22 17:22           ` Boaz Harrosh
  2015-03-22 17:22             ` Boaz Harrosh
  0 siblings, 1 reply; 59+ messages in thread
From: Boaz Harrosh @ 2015-03-22 17:22 UTC (permalink / raw)
  To: Dan Williams, Andrew Morton
  Cc: Boaz Harrosh, linux-arch, Jens Axboe, riel, linux-raid,
	linux-nvdimm, Dave Hansen, linux-kernel@vger.kernel.org,
	Christoph Hellwig, Mel Gorman, linux-fsdevel

On 03/19/2015 10:59 PM, Dan Williams wrote:

> 
> At least for block-i/o it seems the only place we really need struct
> page infrastructure is for kmap().  Given we already need a kmap_pfn()
> solution for option 2 a "dynamic allocation" stop along that
> development path may just naturally fall out.

Really? what about networked block-io, RDMA, FcOE emulated targets,
mmaped pointers. virtual-machine bdev drivers

Block layer sits in the middle of the stack not at the low end as you
make it appear. There are lots of below the bio subsystems that tie into
a page struct, which will now stop to operate, unless you do:

pfn_to_page() which means a page-less pfn will now crash or will need
to be rejected so any where you have a
	if (page_less_pfn())
		... /* Fail or do some other code like copy */
	else
		page = pfn_to_page()

Is a double code path in the Kernel and is a nightmare to maintain.
(I'm here for you believe me ;-) )

Thanks
Boaz

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Linux-nvdimm] [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-22 17:22           ` Boaz Harrosh
@ 2015-03-22 17:22             ` Boaz Harrosh
  0 siblings, 0 replies; 59+ messages in thread
From: Boaz Harrosh @ 2015-03-22 17:22 UTC (permalink / raw)
  To: Dan Williams, Andrew Morton
  Cc: Boaz Harrosh, linux-arch, Jens Axboe, riel, linux-raid,
	linux-nvdimm, Dave Hansen, linux-kernel@vger.kernel.org,
	Christoph Hellwig, Mel Gorman, linux-fsdevel

On 03/19/2015 10:59 PM, Dan Williams wrote:

> 
> At least for block-i/o it seems the only place we really need struct
> page infrastructure is for kmap().  Given we already need a kmap_pfn()
> solution for option 2 a "dynamic allocation" stop along that
> development path may just naturally fall out.

Really? what about networked block-io, RDMA, FcOE emulated targets,
mmaped pointers. virtual-machine bdev drivers

Block layer sits in the middle of the stack not at the low end as you
make it appear. There are lots of below the bio subsystems that tie into
a page struct, which will now stop to operate, unless you do:

pfn_to_page() which means a page-less pfn will now crash or will need
to be rejected so any where you have a
	if (page_less_pfn())
		... /* Fail or do some other code like copy */
	else
		page = pfn_to_page()

Is a double code path in the Kernel and is a nightmare to maintain.
(I'm here for you believe me ;-) )

Thanks
Boaz

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-22 17:06           ` Boaz Harrosh
@ 2015-03-22 17:22             ` Dan Williams
  2015-03-22 17:22               ` Dan Williams
  2015-03-22 17:39               ` Boaz Harrosh
  0 siblings, 2 replies; 59+ messages in thread
From: Dan Williams @ 2015-03-22 17:22 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Rik van Riel, Matthew Wilcox, Andrew Morton,
	linux-kernel@vger.kernel.org, linux-arch, Jens Axboe,
	linux-nvdimm, Dave Hansen, linux-raid, Mel Gorman,
	Christoph Hellwig, linux-fsdevel, Michael S. Tsirkin

On Sun, Mar 22, 2015 at 10:06 AM, Boaz Harrosh <boaz@plexistor.com> wrote:
> On 03/20/2015 11:08 PM, Rik van Riel wrote:
>> On 03/20/2015 04:31 PM, Matthew Wilcox wrote:
> <>
>>> There's a lot of code out there that relies on struct page being PAGE_SIZE
>>> bytes.  I'm cool with replacing 'struct page' with 'struct superpage'
>>> [1] in the biovec and auditing all of the code which touches it ... but
>>> that's going to be a lot of code!  I'm not sure it's less code than
>>> going directly to 'just do I/O on PFNs'.
>>
>> Totally agreed here. I see absolutely no advantage to teaching the
>> IO layer about a "struct superpage" when it could operate on PFNs
>> just as easily.
>>
>
> Or teaching 'struct page' to be variable length, This is already so at
> bio and sg level so you fixed nothing.
>
> Moving to pfn's only means that all this unnamed code above that
> "relies on struct page being PAGE_SIZE" is now not allowed to
> interfaced with bio and sg list. Which in current code and in Dan's patches
> means two tons of BUG_ONS and return -ENOTSUPP . For all these
> subsystems below the bio and sglist that operate on page_structs

I'm not convinced it will be that bad.  In hyperbolic terms,
continuing to overload struct page means we get to let floppy.c do i/o
from pmem, who needs that level of compatibility?

Similar to sg_chain support I think it's fine to let sub-systems /
archs add pmem i/o support over time.  It's a scaling problem our
development model is good at.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-22 17:22             ` Dan Williams
@ 2015-03-22 17:22               ` Dan Williams
  2015-03-22 17:39               ` Boaz Harrosh
  1 sibling, 0 replies; 59+ messages in thread
From: Dan Williams @ 2015-03-22 17:22 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Rik van Riel, Matthew Wilcox, Andrew Morton,
	linux-kernel@vger.kernel.org, linux-arch, Jens Axboe,
	linux-nvdimm, Dave Hansen, linux-raid, Mel Gorman,
	Christoph Hellwig, linux-fsdevel, Michael S. Tsirkin

On Sun, Mar 22, 2015 at 10:06 AM, Boaz Harrosh <boaz@plexistor.com> wrote:
> On 03/20/2015 11:08 PM, Rik van Riel wrote:
>> On 03/20/2015 04:31 PM, Matthew Wilcox wrote:
> <>
>>> There's a lot of code out there that relies on struct page being PAGE_SIZE
>>> bytes.  I'm cool with replacing 'struct page' with 'struct superpage'
>>> [1] in the biovec and auditing all of the code which touches it ... but
>>> that's going to be a lot of code!  I'm not sure it's less code than
>>> going directly to 'just do I/O on PFNs'.
>>
>> Totally agreed here. I see absolutely no advantage to teaching the
>> IO layer about a "struct superpage" when it could operate on PFNs
>> just as easily.
>>
>
> Or teaching 'struct page' to be variable length, This is already so at
> bio and sg level so you fixed nothing.
>
> Moving to pfn's only means that all this unnamed code above that
> "relies on struct page being PAGE_SIZE" is now not allowed to
> interfaced with bio and sg list. Which in current code and in Dan's patches
> means two tons of BUG_ONS and return -ENOTSUPP . For all these
> subsystems below the bio and sglist that operate on page_structs

I'm not convinced it will be that bad.  In hyperbolic terms,
continuing to overload struct page means we get to let floppy.c do i/o
from pmem, who needs that level of compatibility?

Similar to sg_chain support I think it's fine to let sub-systems /
archs add pmem i/o support over time.  It's a scaling problem our
development model is good at.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-22 17:22             ` Dan Williams
  2015-03-22 17:22               ` Dan Williams
@ 2015-03-22 17:39               ` Boaz Harrosh
  2015-03-22 17:39                 ` Boaz Harrosh
  1 sibling, 1 reply; 59+ messages in thread
From: Boaz Harrosh @ 2015-03-22 17:39 UTC (permalink / raw)
  To: Dan Williams
  Cc: Rik van Riel, Matthew Wilcox, Andrew Morton,
	linux-kernel@vger.kernel.org, linux-arch, Jens Axboe,
	linux-nvdimm, Dave Hansen, linux-raid, Mel Gorman,
	Christoph Hellwig, linux-fsdevel, Michael S. Tsirkin

On 03/22/2015 07:22 PM, Dan Williams wrote:
> On Sun, Mar 22, 2015 at 10:06 AM, Boaz Harrosh <boaz@plexistor.com> wrote:
<>
>>
>> Moving to pfn's only means that all this unnamed code above that
>> "relies on struct page being PAGE_SIZE" is now not allowed to
>> interfaced with bio and sg list. Which in current code and in Dan's patches
>> means two tons of BUG_ONS and return -ENOTSUPP . For all these
>> subsystems below the bio and sglist that operate on page_structs
> 
> I'm not convinced it will be that bad.  In hyperbolic terms,
> continuing to overload struct page means we get to let floppy.c do i/o
> from pmem, who needs that level of compatibility?
> 

But you do need to make sure it does not crash. right?

> Similar to sg_chain support I think it's fine to let sub-systems /
> archs add pmem i/o support over time.  It's a scaling problem our
> development model is good at.
> 

You are so eager to do all this massive change, and willing to do it
over a decade (Judging by your own example of sg-chain)

But you completely ignore the fact that what I'm saying is that
nothing needs to fundamentally change at all. No support over time
and no "scaling problem" at all. All we want to fix is that page-struct
means NOT PAGE_SIZE but some other size.

The much smaller change and full cross Kernel compatibility. What's
not to like ?

Cheers
Boaz

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-22 17:39               ` Boaz Harrosh
@ 2015-03-22 17:39                 ` Boaz Harrosh
  0 siblings, 0 replies; 59+ messages in thread
From: Boaz Harrosh @ 2015-03-22 17:39 UTC (permalink / raw)
  To: Dan Williams
  Cc: Rik van Riel, Matthew Wilcox, Andrew Morton,
	linux-kernel@vger.kernel.org, linux-arch, Jens Axboe,
	linux-nvdimm, Dave Hansen, linux-raid, Mel Gorman,
	Christoph Hellwig, linux-fsdevel, Michael S. Tsirkin

On 03/22/2015 07:22 PM, Dan Williams wrote:
> On Sun, Mar 22, 2015 at 10:06 AM, Boaz Harrosh <boaz@plexistor.com> wrote:
<>
>>
>> Moving to pfn's only means that all this unnamed code above that
>> "relies on struct page being PAGE_SIZE" is now not allowed to
>> interfaced with bio and sg list. Which in current code and in Dan's patches
>> means two tons of BUG_ONS and return -ENOTSUPP . For all these
>> subsystems below the bio and sglist that operate on page_structs
> 
> I'm not convinced it will be that bad.  In hyperbolic terms,
> continuing to overload struct page means we get to let floppy.c do i/o
> from pmem, who needs that level of compatibility?
> 

But you do need to make sure it does not crash. right?

> Similar to sg_chain support I think it's fine to let sub-systems /
> archs add pmem i/o support over time.  It's a scaling problem our
> development model is good at.
> 

You are so eager to do all this massive change, and willing to do it
over a decade (Judging by your own example of sg-chain)

But you completely ignore the fact that what I'm saying is that
nothing needs to fundamentally change at all. No support over time
and no "scaling problem" at all. All we want to fix is that page-struct
means NOT PAGE_SIZE but some other size.

The much smaller change and full cross Kernel compatibility. What's
not to like ?

Cheers
Boaz


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-22 15:51       ` Boaz Harrosh
@ 2015-03-23 15:19         ` Rik van Riel
  2015-03-23 15:19           ` Rik van Riel
                             ` (2 more replies)
  0 siblings, 3 replies; 59+ messages in thread
From: Rik van Riel @ 2015-03-23 15:19 UTC (permalink / raw)
  To: Boaz Harrosh, Matthew Wilcox, Andrew Morton
  Cc: Dan Williams, linux-kernel, linux-arch, axboe, linux-nvdimm,
	Dave Hansen, linux-raid, mgorman, hch, linux-fsdevel,
	Michael S. Tsirkin

On 03/22/2015 11:51 AM, Boaz Harrosh wrote:
> On 03/20/2015 06:21 PM, Rik van Riel wrote:
>> On 03/19/2015 09:43 AM, Matthew Wilcox wrote:
>>
>>> 1. Construct struct pages for persistent memory
>>> 1a. Permanently
>>> 1b. While the pages are under I/O
>>
>> Michael Tsirkin and I have been doing some thinking about what
>> it would take to allocate struct pages per 2MB area permanently,
>> and allocate additional struct pages for 4kB pages on demand,
>> when a 2MB area is broken up into 4kB pages.
>>
>> This should work for both DRAM and persistent memory.
> 
> My thoughts as well, this need *not* be a huge evasive change. Is however
> a careful surgery in very core code. And lots of sleepless scary nights
> and testing to make sure all the side effects are wrinkled out.

Even the above IS a huge invasive change, and I do not see it
as much better than the work Dan and Matthew are doing.

> If we want copy-less, we need a common memory descriptor career. Today this
> is page-struct. So for me your above statement means:
> 	"still not convinced I care about copy-less pmem"
> 
> Otherwise you either enhance what you have today or devise a new
> system, which means change the all Kernel.

We do not necessarily need a common descriptor, as much as
one that abstracts out what is happening. Something like a
struct bio could be a good I/O descriptor, and releasing the
backing memory after IO completion could be a function of the
bio freeing function itself.

> Lastly: Why does pmem need to wait out-of-tree. Even you say above that
> machines with lots of DRAM can enjoy the HUGE-to-4k split. So why
> not let pmem waist 4k pages like everyone else and fix it as above
> down the line, both for pmem and ram. And save both ways.
> Why do we need to first change the all Kernel, then have pmem. Why not
> use current infra structure, for good or for worth, and incrementally
> do better.

There are two things going on here:

1) You want to keep using struct page for now, while there are
   subsystems that require it. This is perfectly legitimate.

2) Matthew and Dan are changing over some subsystems to no longer
   require struct page. This is perfectly legitimate.

I do not understand why either of you would have to object to what
the other is doing. There is room to keep using struct page until
the rest of the kernel no longer requires it.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-23 15:19         ` Rik van Riel
@ 2015-03-23 15:19           ` Rik van Riel
  2015-03-23 19:30           ` Christoph Hellwig
  2015-03-24  9:41           ` Boaz Harrosh
  2 siblings, 0 replies; 59+ messages in thread
From: Rik van Riel @ 2015-03-23 15:19 UTC (permalink / raw)
  To: Boaz Harrosh, Matthew Wilcox, Andrew Morton
  Cc: Dan Williams, linux-kernel, linux-arch, axboe, linux-nvdimm,
	Dave Hansen, linux-raid, mgorman, hch, linux-fsdevel,
	Michael S. Tsirkin

On 03/22/2015 11:51 AM, Boaz Harrosh wrote:
> On 03/20/2015 06:21 PM, Rik van Riel wrote:
>> On 03/19/2015 09:43 AM, Matthew Wilcox wrote:
>>
>>> 1. Construct struct pages for persistent memory
>>> 1a. Permanently
>>> 1b. While the pages are under I/O
>>
>> Michael Tsirkin and I have been doing some thinking about what
>> it would take to allocate struct pages per 2MB area permanently,
>> and allocate additional struct pages for 4kB pages on demand,
>> when a 2MB area is broken up into 4kB pages.
>>
>> This should work for both DRAM and persistent memory.
> 
> My thoughts as well, this need *not* be a huge evasive change. Is however
> a careful surgery in very core code. And lots of sleepless scary nights
> and testing to make sure all the side effects are wrinkled out.

Even the above IS a huge invasive change, and I do not see it
as much better than the work Dan and Matthew are doing.

> If we want copy-less, we need a common memory descriptor career. Today this
> is page-struct. So for me your above statement means:
> 	"still not convinced I care about copy-less pmem"
> 
> Otherwise you either enhance what you have today or devise a new
> system, which means change the all Kernel.

We do not necessarily need a common descriptor, as much as
one that abstracts out what is happening. Something like a
struct bio could be a good I/O descriptor, and releasing the
backing memory after IO completion could be a function of the
bio freeing function itself.

> Lastly: Why does pmem need to wait out-of-tree. Even you say above that
> machines with lots of DRAM can enjoy the HUGE-to-4k split. So why
> not let pmem waist 4k pages like everyone else and fix it as above
> down the line, both for pmem and ram. And save both ways.
> Why do we need to first change the all Kernel, then have pmem. Why not
> use current infra structure, for good or for worth, and incrementally
> do better.

There are two things going on here:

1) You want to keep using struct page for now, while there are
   subsystems that require it. This is perfectly legitimate.

2) Matthew and Dan are changing over some subsystems to no longer
   require struct page. This is perfectly legitimate.

I do not understand why either of you would have to object to what
the other is doing. There is room to keep using struct page until
the rest of the kernel no longer requires it.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-23 15:19         ` Rik van Riel
  2015-03-23 15:19           ` Rik van Riel
@ 2015-03-23 19:30           ` Christoph Hellwig
  2015-03-23 19:30             ` Christoph Hellwig
  2015-03-24  9:41           ` Boaz Harrosh
  2 siblings, 1 reply; 59+ messages in thread
From: Christoph Hellwig @ 2015-03-23 19:30 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Boaz Harrosh, Matthew Wilcox, Andrew Morton, Dan Williams,
	linux-kernel, linux-arch, axboe, linux-nvdimm, Dave Hansen,
	linux-raid, mgorman, hch, linux-fsdevel, Michael S. Tsirkin

On Mon, Mar 23, 2015 at 11:19:07AM -0400, Rik van Riel wrote:
> There are two things going on here:
> 
> 1) You want to keep using struct page for now, while there are
>    subsystems that require it. This is perfectly legitimate.
> 
> 2) Matthew and Dan are changing over some subsystems to no longer
>    require struct page. This is perfectly legitimate.
> 
> I do not understand why either of you would have to object to what
> the other is doing. There is room to keep using struct page until
> the rest of the kernel no longer requires it.

*nod*

I'd really like to merge the struct page based pmem driver ASAP.  We can
then look into work that avoid the need for struct page, and I think Dan
is doing some good work in that direction.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-23 19:30           ` Christoph Hellwig
@ 2015-03-23 19:30             ` Christoph Hellwig
  0 siblings, 0 replies; 59+ messages in thread
From: Christoph Hellwig @ 2015-03-23 19:30 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Boaz Harrosh, Matthew Wilcox, Andrew Morton, Dan Williams,
	linux-kernel, linux-arch, axboe, linux-nvdimm, Dave Hansen,
	linux-raid, mgorman, hch, linux-fsdevel, Michael S. Tsirkin

On Mon, Mar 23, 2015 at 11:19:07AM -0400, Rik van Riel wrote:
> There are two things going on here:
> 
> 1) You want to keep using struct page for now, while there are
>    subsystems that require it. This is perfectly legitimate.
> 
> 2) Matthew and Dan are changing over some subsystems to no longer
>    require struct page. This is perfectly legitimate.
> 
> I do not understand why either of you would have to object to what
> the other is doing. There is room to keep using struct page until
> the rest of the kernel no longer requires it.

*nod*

I'd really like to merge the struct page based pmem driver ASAP.  We can
then look into work that avoid the need for struct page, and I think Dan
is doing some good work in that direction.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-23 15:19         ` Rik van Riel
  2015-03-23 15:19           ` Rik van Riel
  2015-03-23 19:30           ` Christoph Hellwig
@ 2015-03-24  9:41           ` Boaz Harrosh
  2015-03-24  9:41             ` Boaz Harrosh
  2015-03-24 16:57             ` Rik van Riel
  2 siblings, 2 replies; 59+ messages in thread
From: Boaz Harrosh @ 2015-03-24  9:41 UTC (permalink / raw)
  To: Rik van Riel, Boaz Harrosh, Matthew Wilcox, Andrew Morton
  Cc: Dan Williams, linux-kernel, linux-arch, axboe, linux-nvdimm,
	Dave Hansen, linux-raid, mgorman, hch, linux-fsdevel,
	Michael S. Tsirkin

On 03/23/2015 05:19 PM, Rik van Riel wrote:
>>> Michael Tsirkin and I have been doing some thinking about what
>>> it would take to allocate struct pages per 2MB area permanently,
>>> and allocate additional struct pages for 4kB pages on demand,
>>> when a 2MB area is broken up into 4kB pages.
>>
>> My thoughts as well, this need *not* be a huge evasive change. Is however
>> a careful surgery in very core code. And lots of sleepless scary nights
>> and testing to make sure all the side effects are wrinkled out.
> 
> Even the above IS a huge invasive change, and I do not see it
> as much better than the work Dan and Matthew are doing.
> 

You lost me again. Sorry for my slowness. The code I envision is not
invasive at all. Nothing is touched at all, except a few core places
at the page level.

The contract with Kernel stays the same:
	page_to_pfn, pfn_to_page, page_address (which is kmap_atomic in 64bit)
	virt_to_page, page_get/put and so on...

So none of the Kernel code need change at all. You were saying that we
might have a 2M page and on demand we can allocate a 4k page shove it down the
stack, which does not change at all, and once back from io, the 4k pages can be
freed and recycled for reuse with other IO. This is what I thought you said.

This is doable, and not that much work and for the life of me I do not see any
"invasive". (Yes a few core headers that make everything compile ;-))

That said I do not even think we need that (2M split to 4k on demand) we can even
do better and make sure 2M pages just work as is. It is very possible today
(Tested) to push a 2M page into a bio and write to a bdev. Yes lots of side
code will break, but the core path is clean. Let us fix that then.

(Need I send code to show you how a 2M page is written with a single
 bvec?)

>> If we want copy-less, we need a common memory descriptor career. Today this
>> is page-struct. So for me your above statement means:
>> 	"still not convinced I care about copy-less pmem"
>>
>> Otherwise you either enhance what you have today or devise a new
>> system, which means change the all Kernel.
> 
> We do not necessarily need a common descriptor, as much as
> one that abstracts out what is happening. Something like a
> struct bio could be a good I/O descriptor, and releasing the
> backing memory after IO completion could be a function of the
> bio freeing function itself.
> 

Lost me again sorry. What backing memory. struct bio is already
an I/O descriptor which gets freed after use. How is that relevant
to pfn vs page ?

>> Lastly: Why does pmem need to wait out-of-tree. Even you say above that
>> machines with lots of DRAM can enjoy the HUGE-to-4k split. So why
>> not let pmem waist 4k pages like everyone else and fix it as above
>> down the line, both for pmem and ram. And save both ways.
>> Why do we need to first change the all Kernel, then have pmem. Why not
>> use current infra structure, for good or for worth, and incrementally
>> do better.
> 
> There are two things going on here:
> 
> 1) You want to keep using struct page for now, while there are
>    subsystems that require it. This is perfectly legitimate.
> 
> 2) Matthew and Dan are changing over some subsystems to no longer
>    require struct page. This is perfectly legitimate.
> 

How is this legitimate when you need to Interface the [1] subsystems
under the [2] subsystem? A subsystem that expects pages is now not
usable by [2].

Today *All* the Kernel subsystems are [1] Period. How does it become
legitimate to now start *two* competing, do the same differently, abstraction,
in our kernel. We have two much diversity not to little.

> I do not understand why either of you would have to object to what
> the other is doing. There is room to keep using struct page until
> the rest of the kernel no longer requires it.
> 

So this is your vision "until the rest of the kernel no longer requires
pages" Really? Sigh, coming from other Kernels I thought pages were
a breeze of fresh air. I thought it was very clever. And BTW good luck
with that.

BTW: you have not solved the basic problem yet. for one pfn_kmap() given a
pfn what is its virtual address. would you like to loop through the
Kernel's range tables to look for the registered ioremap ? its a long
annoying loop. The page was invented exactly for this reason, to go through
the section object. And actually it is not that easy because if it is an ioremap
pointer it is in one list and if a page it is another way, and on top of
all this, it is ARCH dependent. And you are trashing highmem, because the
state and locks of that are at the page level. Not that I care about highmem
but I hate double coding. For god sake what do you guys have with poor old
pages, they were invented to exacly do this, abstract away management of a single
pfn-to-virt.
All I see is complains about page being 4K well it need not be. page can be any
size, and hell it can be variable size. (And no we do not need to add an extra size
member, all we need is the one bit)

Cheers
Boaz

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-24  9:41           ` Boaz Harrosh
@ 2015-03-24  9:41             ` Boaz Harrosh
  2015-03-24 16:57             ` Rik van Riel
  1 sibling, 0 replies; 59+ messages in thread
From: Boaz Harrosh @ 2015-03-24  9:41 UTC (permalink / raw)
  To: Rik van Riel, Boaz Harrosh, Matthew Wilcox, Andrew Morton
  Cc: Dan Williams, linux-kernel, linux-arch, axboe, linux-nvdimm,
	Dave Hansen, linux-raid, mgorman, hch, linux-fsdevel,
	Michael S. Tsirkin

On 03/23/2015 05:19 PM, Rik van Riel wrote:
>>> Michael Tsirkin and I have been doing some thinking about what
>>> it would take to allocate struct pages per 2MB area permanently,
>>> and allocate additional struct pages for 4kB pages on demand,
>>> when a 2MB area is broken up into 4kB pages.
>>
>> My thoughts as well, this need *not* be a huge evasive change. Is however
>> a careful surgery in very core code. And lots of sleepless scary nights
>> and testing to make sure all the side effects are wrinkled out.
> 
> Even the above IS a huge invasive change, and I do not see it
> as much better than the work Dan and Matthew are doing.
> 

You lost me again. Sorry for my slowness. The code I envision is not
invasive at all. Nothing is touched at all, except a few core places
at the page level.

The contract with Kernel stays the same:
	page_to_pfn, pfn_to_page, page_address (which is kmap_atomic in 64bit)
	virt_to_page, page_get/put and so on...

So none of the Kernel code need change at all. You were saying that we
might have a 2M page and on demand we can allocate a 4k page shove it down the
stack, which does not change at all, and once back from io, the 4k pages can be
freed and recycled for reuse with other IO. This is what I thought you said.

This is doable, and not that much work and for the life of me I do not see any
"invasive". (Yes a few core headers that make everything compile ;-))

That said I do not even think we need that (2M split to 4k on demand) we can even
do better and make sure 2M pages just work as is. It is very possible today
(Tested) to push a 2M page into a bio and write to a bdev. Yes lots of side
code will break, but the core path is clean. Let us fix that then.

(Need I send code to show you how a 2M page is written with a single
 bvec?)

>> If we want copy-less, we need a common memory descriptor career. Today this
>> is page-struct. So for me your above statement means:
>> 	"still not convinced I care about copy-less pmem"
>>
>> Otherwise you either enhance what you have today or devise a new
>> system, which means change the all Kernel.
> 
> We do not necessarily need a common descriptor, as much as
> one that abstracts out what is happening. Something like a
> struct bio could be a good I/O descriptor, and releasing the
> backing memory after IO completion could be a function of the
> bio freeing function itself.
> 

Lost me again sorry. What backing memory. struct bio is already
an I/O descriptor which gets freed after use. How is that relevant
to pfn vs page ?

>> Lastly: Why does pmem need to wait out-of-tree. Even you say above that
>> machines with lots of DRAM can enjoy the HUGE-to-4k split. So why
>> not let pmem waist 4k pages like everyone else and fix it as above
>> down the line, both for pmem and ram. And save both ways.
>> Why do we need to first change the all Kernel, then have pmem. Why not
>> use current infra structure, for good or for worth, and incrementally
>> do better.
> 
> There are two things going on here:
> 
> 1) You want to keep using struct page for now, while there are
>    subsystems that require it. This is perfectly legitimate.
> 
> 2) Matthew and Dan are changing over some subsystems to no longer
>    require struct page. This is perfectly legitimate.
> 

How is this legitimate when you need to Interface the [1] subsystems
under the [2] subsystem? A subsystem that expects pages is now not
usable by [2].

Today *All* the Kernel subsystems are [1] Period. How does it become
legitimate to now start *two* competing, do the same differently, abstraction,
in our kernel. We have two much diversity not to little.

> I do not understand why either of you would have to object to what
> the other is doing. There is room to keep using struct page until
> the rest of the kernel no longer requires it.
> 

So this is your vision "until the rest of the kernel no longer requires
pages" Really? Sigh, coming from other Kernels I thought pages were
a breeze of fresh air. I thought it was very clever. And BTW good luck
with that.

BTW: you have not solved the basic problem yet. for one pfn_kmap() given a
pfn what is its virtual address. would you like to loop through the
Kernel's range tables to look for the registered ioremap ? its a long
annoying loop. The page was invented exactly for this reason, to go through
the section object. And actually it is not that easy because if it is an ioremap
pointer it is in one list and if a page it is another way, and on top of
all this, it is ARCH dependent. And you are trashing highmem, because the
state and locks of that are at the page level. Not that I care about highmem
but I hate double coding. For god sake what do you guys have with poor old
pages, they were invented to exacly do this, abstract away management of a single
pfn-to-virt.
All I see is complains about page being 4K well it need not be. page can be any
size, and hell it can be variable size. (And no we do not need to add an extra size
member, all we need is the one bit)

Cheers
Boaz

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-24  9:41           ` Boaz Harrosh
  2015-03-24  9:41             ` Boaz Harrosh
@ 2015-03-24 16:57             ` Rik van Riel
  2015-03-24 16:57               ` Rik van Riel
  1 sibling, 1 reply; 59+ messages in thread
From: Rik van Riel @ 2015-03-24 16:57 UTC (permalink / raw)
  To: Boaz Harrosh, Matthew Wilcox, Andrew Morton
  Cc: Dan Williams, linux-kernel, linux-arch, axboe, linux-nvdimm,
	Dave Hansen, linux-raid, mgorman, hch, linux-fsdevel,
	Michael S. Tsirkin

On 03/24/2015 05:41 AM, Boaz Harrosh wrote:
> On 03/23/2015 05:19 PM, Rik van Riel wrote:

>> There are two things going on here:
>>
>> 1) You want to keep using struct page for now, while there are
>>    subsystems that require it. This is perfectly legitimate.
>>
>> 2) Matthew and Dan are changing over some subsystems to no longer
>>    require struct page. This is perfectly legitimate.
>>
> 
> How is this legitimate when you need to Interface the [1] subsystems
> under the [2] subsystem? A subsystem that expects pages is now not
> usable by [2].
> 
> Today *All* the Kernel subsystems are [1] Period.

That's not true. In the graphics subsystem it is very normal to
mmap graphics memory without ever using a struct page. There are
other callers of remap_pfn_range() too.

In these cases, refcounting is done by keeping a refcount on the
entire object, not on individual pages (since we have none).

> How does it become
> legitimate to now start *two* competing, do the same differently, abstraction,
> in our kernel. We have two much diversity not to little.

We are already able to refcount either the whole object, or an
individual page.

One issue is that not every subsystem can do the whole object
refcounting, and that it would be nice to have the refcounting
done by one single interface.

If we want the code to be the same everywhere, we could achieve
that just as well with an abstraction as with a single data
structure.

Maybe even something as simplistic as these, with the internals
automatically taking and releasing a refcount on the proper object:

get_reference(file, memory_address)

put_reference(file, memory_address)

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 0/7] evacuate struct page from the block layer
  2015-03-24 16:57             ` Rik van Riel
@ 2015-03-24 16:57               ` Rik van Riel
  0 siblings, 0 replies; 59+ messages in thread
From: Rik van Riel @ 2015-03-24 16:57 UTC (permalink / raw)
  To: Boaz Harrosh, Matthew Wilcox, Andrew Morton
  Cc: Dan Williams, linux-kernel, linux-arch, axboe, linux-nvdimm,
	Dave Hansen, linux-raid, mgorman, hch, linux-fsdevel,
	Michael S. Tsirkin

On 03/24/2015 05:41 AM, Boaz Harrosh wrote:
> On 03/23/2015 05:19 PM, Rik van Riel wrote:

>> There are two things going on here:
>>
>> 1) You want to keep using struct page for now, while there are
>>    subsystems that require it. This is perfectly legitimate.
>>
>> 2) Matthew and Dan are changing over some subsystems to no longer
>>    require struct page. This is perfectly legitimate.
>>
> 
> How is this legitimate when you need to Interface the [1] subsystems
> under the [2] subsystem? A subsystem that expects pages is now not
> usable by [2].
> 
> Today *All* the Kernel subsystems are [1] Period.

That's not true. In the graphics subsystem it is very normal to
mmap graphics memory without ever using a struct page. There are
other callers of remap_pfn_range() too.

In these cases, refcounting is done by keeping a refcount on the
entire object, not on individual pages (since we have none).

> How does it become
> legitimate to now start *two* competing, do the same differently, abstraction,
> in our kernel. We have two much diversity not to little.

We are already able to refcount either the whole object, or an
individual page.

One issue is that not every subsystem can do the whole object
refcounting, and that it would be nice to have the refcounting
done by one single interface.

If we want the code to be the same everywhere, we could achieve
that just as well with an abstraction as with a single data
structure.

Maybe even something as simplistic as these, with the internals
automatically taking and releasing a refcount on the proper object:

get_reference(file, memory_address)

put_reference(file, memory_address)

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 59+ messages in thread

end of thread, other threads:[~2015-03-24 16:58 UTC | newest]

Thread overview: 59+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-03-16 20:25 [RFC PATCH 0/7] evacuate struct page from the block layer Dan Williams
2015-03-16 20:25 ` Dan Williams
2015-03-16 20:25 ` [RFC PATCH 2/7] block: convert bio_vec.bv_page to bv_pfn Dan Williams
2015-03-16 20:25   ` Dan Williams
2015-03-16 23:05   ` Al Viro
2015-03-16 23:05     ` Al Viro
2015-03-17 13:02     ` Matthew Wilcox
2015-03-17 13:02       ` Matthew Wilcox
2015-03-17 15:53       ` Dan Williams
2015-03-17 15:53         ` Dan Williams
2015-03-18 10:47 ` [RFC PATCH 0/7] evacuate struct page from the block layer Boaz Harrosh
2015-03-18 10:47   ` Boaz Harrosh
2015-03-18 13:06   ` Matthew Wilcox
2015-03-18 13:06     ` Matthew Wilcox
2015-03-18 14:38     ` [Linux-nvdimm] " Boaz Harrosh
2015-03-18 14:38       ` Boaz Harrosh
2015-03-20 15:56       ` Rik van Riel
2015-03-20 15:56         ` Rik van Riel
2015-03-22 11:53         ` Boaz Harrosh
2015-03-22 11:53           ` Boaz Harrosh
2015-03-18 15:35   ` Dan Williams
2015-03-18 20:26 ` Andrew Morton
2015-03-19 13:43   ` Matthew Wilcox
2015-03-19 13:43     ` Matthew Wilcox
2015-03-19 15:54     ` [Linux-nvdimm] " Boaz Harrosh
2015-03-19 15:54       ` Boaz Harrosh
2015-03-19 19:59       ` Andrew Morton
2015-03-19 20:59         ` Dan Williams
2015-03-22 17:22           ` Boaz Harrosh
2015-03-22 17:22             ` Boaz Harrosh
2015-03-20 17:32         ` Wols Lists
2015-03-20 17:32           ` Wols Lists
2015-03-22 10:30         ` Boaz Harrosh
2015-03-22 10:30           ` Boaz Harrosh
2015-03-19 18:17     ` Christoph Hellwig
2015-03-19 19:31       ` Matthew Wilcox
2015-03-22 16:46       ` Boaz Harrosh
2015-03-20 16:21     ` Rik van Riel
2015-03-20 20:31       ` Matthew Wilcox
2015-03-20 21:08         ` Rik van Riel
2015-03-20 21:08           ` Rik van Riel
2015-03-22 17:06           ` Boaz Harrosh
2015-03-22 17:22             ` Dan Williams
2015-03-22 17:22               ` Dan Williams
2015-03-22 17:39               ` Boaz Harrosh
2015-03-22 17:39                 ` Boaz Harrosh
2015-03-20 21:17         ` Wols Lists
2015-03-20 21:17           ` Wols Lists
2015-03-22 16:24         ` Boaz Harrosh
2015-03-22 16:24           ` Boaz Harrosh
2015-03-22 15:51       ` Boaz Harrosh
2015-03-23 15:19         ` Rik van Riel
2015-03-23 15:19           ` Rik van Riel
2015-03-23 19:30           ` Christoph Hellwig
2015-03-23 19:30             ` Christoph Hellwig
2015-03-24  9:41           ` Boaz Harrosh
2015-03-24  9:41             ` Boaz Harrosh
2015-03-24 16:57             ` Rik van Riel
2015-03-24 16:57               ` Rik van Riel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).