From mboxrd@z Thu Jan  1 00:00:00 1970
From: Jooyoung Hwang <jooyoung.hwang@samsung.com>
Subject: RE: [PATCH 00/16] f2fs: introduce flash-friendly file system
Date: Tue, 09 Oct 2012 14:53:26 -0500
Message-ID: <1349812406.18456.41.camel@adg-desktop-09.cs.wisc.edu>
References: <415E76CC-A53D-4643-88AB-3D7D7DC56F98@dubeyko.com>
	 <9DE65D03-D4EA-4B32-9C1D-1516EAE50E23@dubeyko.com>
	 <1349553966.12699.132.camel@kjgkr> <50712AAA.5030807@gmail.com>
	 <002201cda46e$88b84d30$9a28e790$%kim@samsung.com>
	 <F7EA5AF7-8A1C-4C98-90AF-9D47E39731AB@dubeyko.com>
	 <004101cda52e$72210e20$56632a60$%kim@samsung.com>
	 <55A93BD0-CBCB-4707-A970-EB823EC54B2D@dubeyko.com>
	 <006f01cda5ec$e63e9b60$b2bbd220$%kim@samsung.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: QUOTED-PRINTABLE
Cc: 'Marco Stornelli' <marco.stornelli@gmail.com>,
	'Jaegeuk Kim' <jaegeuk.kim@gmail.com>,
	'Al Viro' <viro@zeniv.linux.org.uk>, tytso@mit.edu,
	gregkh@linuxfoundation.org, linux-kernel@vger.kernel.org,
	chur.lee@samsung.com, cm224.lee@samsung.com,
	linux-fsdevel@vger.kernel.org
To: 'Vyacheslav Dubeyko' <slava@dubeyko.com>
Return-path: <linux-fsdevel-owner@vger.kernel.org>
Received: from mail-ia0-f174.google.com ([209.85.210.174]:35993 "EHLO
	mail-ia0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1751597Ab2JITxa (ORCPT
	<rfc822;linux-fsdevel@vger.kernel.org>);
	Tue, 9 Oct 2012 15:53:30 -0400
In-Reply-To: <006f01cda5ec$e63e9b60$b2bbd220$%kim@samsung.com>
Sender: linux-fsdevel-owner@vger.kernel.org
List-ID: <linux-fsdevel.vger.kernel.org>

On Tue, 2012-10-09 at 16:08 +0900, Jaegeuk Kim wrote:
> > -----Original Message-----
> > From: Vyacheslav Dubeyko [mailto:slava@dubeyko.com]
> > Sent: Tuesday, October 09, 2012 4:23 AM
> > To: Jaegeuk Kim
> > Cc: 'Marco Stornelli'; 'Jaegeuk Kim'; 'Al Viro'; tytso@mit.edu; gre=
gkh@linuxfoundation.org; linux-
> > kernel@vger.kernel.org; chur.lee@samsung.com; cm224.lee@samsung.com=
; jooyoung.hwang@samsung.com;
> > linux-fsdevel@vger.kernel.org
> > Subject: Re: [PATCH 00/16] f2fs: introduce flash-friendly file syst=
em
> >=20
> > Hi,
> >=20
> > On Oct 8, 2012, at 12:25 PM, Jaegeuk Kim wrote:
> >=20
> > >> -----Original Message-----
> > >> From: Vyacheslav Dubeyko [mailto:slava@dubeyko.com]
> > >> Sent: Sunday, October 07, 2012 9:09 PM
> > >> To: Jaegeuk Kim
> > >> Cc: 'Marco Stornelli'; 'Jaegeuk Kim'; 'Al Viro'; tytso@mit.edu; =
gregkh@linuxfoundation.org; linux-
> > >> kernel@vger.kernel.org; chur.lee@samsung.com; cm224.lee@samsung.=
com; jooyoung.hwang@samsung.com;
> > >> linux-fsdevel@vger.kernel.org
> > >> Subject: Re: [PATCH 00/16] f2fs: introduce flash-friendly file s=
ystem
> > >>
> > >> Hi,
> > >>
> > >> On Oct 7, 2012, at 1:31 PM, Jaegeuk Kim wrote:
> > >>
> > >>>> -----Original Message-----
> > >>>> From: Marco Stornelli [mailto:marco.stornelli@gmail.com]
> > >>>> Sent: Sunday, October 07, 2012 4:10 PM
> > >>>> To: Jaegeuk Kim
> > >>>> Cc: Vyacheslav Dubeyko; jaegeuk.kim@samsung.com; Al Viro; tyts=
o@mit.edu;
> > gregkh@linuxfoundation.org;
> > >>>> linux-kernel@vger.kernel.org; chur.lee@samsung.com; cm224.lee@=
samsung.com;
> > >> jooyoung.hwang@samsung.com;
> > >>>> linux-fsdevel@vger.kernel.org
> > >>>> Subject: Re: [PATCH 00/16] f2fs: introduce flash-friendly file=
 system
> > >>>>
> > >>>> Il 06/10/2012 22:06, Jaegeuk Kim ha scritto:
> > >>>>> 2012-10-06 (=ED=86=A0), 17:54 +0400, Vyacheslav Dubeyko:
> > >>>>>> Hi Jaegeuk,
> > >>>>>
> > >>>>> Hi.
> > >>>>> We know each other, right? :)
> > >>>>>
> > >>>>>>
> > >>>>>>> From:	 	=EA=B9=80=EC=9E=AC=EA=B7=B9 <jaegeuk.kim@samsung.co=
m>
> > >>>>>>> To:	 	viro@zeniv.linux.org.uk, 'Theodore Ts'o' <tytso@mit.e=
du>,
> > >>>> gregkh@linuxfoundation.org, linux-kernel@vger.kernel.org, chur=
=2Elee@samsung.com,
> > >> cm224.lee@samsung.com,
> > >>>> jaegeuk.kim@samsung.com, jooyoung.hwang@samsung.com
> > >>>>>>> Subject:	 	[PATCH 00/16] f2fs: introduce flash-friendly fil=
e system
> > >>>>>>> Date:	 	Fri, 05 Oct 2012 20:55:07 +0900
> > >>>>>>>
> > >>>>>>> This is a new patch set for the f2fs file system.
> > >>>>>>>
> > >>>>>>> What is F2FS?
> > >>>>>>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> > >>>>>>>
> > >>>>>>> NAND flash memory-based storage devices, such as SSD, eMMC,=
 and SD cards, have
> > >>>>>>> been widely being used for ranging from mobile to server sy=
stems. Since they are
> > >>>>>>> known to have different characteristics from the convention=
al rotational disks,
> > >>>>>>> a file system, an upper layer to the storage device, should=
 adapt to the changes
> > >>>>>>> from the sketch.
> > >>>>>>>
> > >>>>>>> F2FS is a new file system carefully designed for the NAND f=
lash memory-based storage
> > >>>>>>> devices. We chose a log structure file system approach, but=
 we tried to adapt it
> > >>>>>>> to the new form of storage. Also we remedy some known issue=
s of the very old log
> > >>>>>>> structured file system, such as snowball effect of wanderin=
g tree and high cleaning
> > >>>>>>> overhead.
> > >>>>>>>
> > >>>>>>> Because a NAND-based storage device shows different charact=
eristics according to
> > >>>>>>> its internal geometry or flash memory management scheme aka=
 FTL, we add various
> > >>>>>>> parameters not only for configuring on-disk layout, but als=
o for selecting allocation
> > >>>>>>> and cleaning algorithms.
> > >>>>>>>
> > >>>>>>
> > >>>>>> What about F2FS performance? Could you share benchmarking re=
sults of the new file system?
> > >>>>>>
> > >>>>>> It is very interesting the case of aged file system. How is =
GC's implementation efficient?
> > Could
> > >>>> you share benchmarking results for the very aged file system s=
tate?
> > >>>>>>
> > >>>>>
> > >>>>> Although I have benchmark results, currently I'd like to see =
the results
> > >>>>> measured by community as a black-box. As you know, the result=
s are very
> > >>>>> dependent on the workloads and parameters, so I think it woul=
d be better
> > >>>>> to see other results for a while.
> > >>>>> Thanks,
> > >>>>>
> > >>>>
> > >>>> 1) Actually it's a strange approach. If you have got any resul=
ts you
> > >>>> should share them with the community explaining how (the workl=
oad, hw
> > >>>> and so on) your benchmark works and the specific condition. I =
really
> > >>>> don't like the approach "I've got the results but I don't say =
anything,
> > >>>> if you want a number, do it yourself".
> > >>>
> > >>> It's definitely right, and I meant *for a while*.
> > >>> I just wanted to avoid arguing with how to age file system in t=
his time.
> > >>> Before then, I share the primitive results as follows.
> > >>>
> > >>> 1. iozone in Panda board
> > >>> - ARM A9
> > >>> - DRAM : 1GB
> > >>> - Kernel: Linux 3.3
> > >>> - Partition: 12GB (64GB Samsung eMMC)
> > >>> - Tested on 2GB file
> > >>>
> > >>>          seq. read, seq. write, rand. read, rand. write
> > >>> - ext4:    30.753         17.066       5.06         4.15
> > >>> - f2fs:    30.71          16.906       5.073       15.204
> > >>>
> > >>> 2. iozone in Galaxy Nexus
> > >>> - DRAM : 1GB
> > >>> - Android 4.0.4_r1.2
> > >>> - Kernel omap 3.0.8
> > >>> - Partition: /data, 12GB
> > >>> - Tested on 2GB file
> > >>>
> > >>>          seq. read, seq. write, rand. read,  rand. write
> > >>> - ext4:    29.88        12.83         11.43          0.56
> > >>> - f2fs:    29.70        13.34         10.79         12.82
> > >>>
> > >>
> > >>
> > >> This is results for non-aged filesystem state. Am I correct?
> > >>
> > >
> > > Yes, right.
> > >
> > >>
> > >>> Due to the company secret, I expect to show other results after=
 presenting f2fs at korea linux
> > forum.
> > >>>
> > >>>> 2) For a new filesystem you should send the patches to linux-f=
sdevel.
> > >>>
> > >>> Yes, that was totally my mistake.
> > >>>
> > >>>> 3) It's not clear the pros/cons of your filesystem, can you sh=
are with
> > >>>> us the main differences with the current fs already in mainlin=
e? Or is
> > >>>> it a company secret?
> > >>>
> > >>> After forum, I can share the slides, and I hope they will be us=
eful to you.
> > >>>
> > >>> Instead, let me summarize at a glance compared with other file =
systems.
> > >>> Here are several log-structured file systems.
> > >>> Note that, F2FS operates on top of block device with considerat=
ion on the FTL behavior.
> > >>> So, JFFS2, YAFFS2, and UBIFS are out-of scope, since they are d=
esigned for raw NAND flash.
> > >>> LogFS is initially designed for raw NAND flash, but expanded to=
 block device.
> > >>> But, I don't know whether it is stable or not.
> > >>> NILFS2 is one of major log-structured file systems, which suppo=
rts multiple snap-shots.
> > >>> IMO, that feature is quite promising and important to users, bu=
t it may degrade the performance.
> > >>> There is a trade-off between functionalities and performance.
> > >>> F2FS chose high performance without any further fancy functiona=
lities.
> > >>>
> > >>
> > >> Performance is a good goal. But fault-tolerance is also very imp=
ortant point. Filesystems are used
> > by
> > >> users, so, it is very important to guarantee reliability of data=
 keeping. Degradation of
> > performance
> > >> by means of snapshots is arguable point. Snapshots can solve the=
 problem not only some
> > unpredictable
> > >> environmental issues but also user's erroneous behavior.
> > >>
> > >
> > > Yes, I agree. I concerned the multiple snapshot feature.
> > > Of course, fault-tolerance is very important, and file system sho=
uld support it as you know as
> > power-off-recovery.
> > > f2fs supports the recovery mechanism by adopting checkpoint simil=
ar to snapshot.
> > > But, f2fs does not support multiple snapshots for user convenienc=
e.
> > > I just focused on the performance, and absolutely, the multiple s=
napshot feature is also a good
> > alternative approach.
> > > That may be a trade-off.
> >=20
> > So, maybe I misunderstand something, but I can't understand the dif=
ference. As I know, snapshot in
> > NILFS2 is a checkpoint converted by user in snapshot. So, NILFS2's =
checkpoint is a log that adds new
> > file system's state changing (user data + metadata). In other words=
, checkpoint is mechanism of
> > writing on volume. Moreover, NILFS2 gives flexible way of checkpoin=
t/snapshot management.
> >=20
> > As you are saying, f2fs supports checkpoints also. It means for me =
that checkpoints are the basic
> > mechanism of writing operations on f2fs. But, about what performanc=
e gain and difference do you talk?
>=20
> How about the following scenario?
> 1. data "a" is newly written.
> 2. checkpoint "A" is done.
> 3. data "a" is truncated.
> 4. checkpoint "B" is done.
>=20
> If fs supports multiple snapshots like "A" and "B" to users, it canno=
t reuse the space allocated by
> data "a" after checkpoint "B" even though data "a" is safely truncate=
d by checkpoint "B".
> This is because fs should keep data "a" to prepare a roll-back to "A"=
=2E
> So, even though user sees some free space, LFS may suffer from cleani=
ng due to the exhausted free space.
> If users want to avoid this, they have to remove snapshots by themsel=
ves. Or, maybe automatically?
>=20
> >=20
> > Moreover, user can't manage by f2fs checkpoints completely, as I ca=
n understand. It is not so clear
> > what critical points can be a starting points of recovery actions. =
How is it possible to define how
> > many checkpoints f2fs volume will have?
>=20
> IMHO, user does not need to know how many snapshots there exist and t=
rack the fs utilization all the time.
> (off list: I don't know why cleaning process should be tuned by users=
=2E)
>=20
> f2fs writes two checkpoints alternatively. One is for the last stable=
 checkpoint and another is for next checkpoint.
> So, during the recovery, f2fs starts to find one of the latest stable=
 checkpoint.
> The stable checkpoint must have whole index structures and data consi=
stently.
> As you knew, many things can be found in the following LFS paper.
> http://www.cs.berkeley.edu/~brewer/cs262/LFS.pdf
>=20
>=20
> >=20
> > How many user data (metadata) can be lost in the case of sudden pow=
er off? Is it possible to estimate
> > this?
> >=20
>=20
> If user calls sync, f2fs via vfs writes all the data, and it writes a=
 checkpoint.
> In that case, all the data are safe.
> After sync, several fsync can be triggered, and it occurs sudden powe=
r off.
> In that case, f2fs first performs roll-back to the last stable checkp=
oint among two, and then roll-forward to recover fsync'ed data only.
> So, f2fs recovers data triggered by sync or fsync only.
>=20
> > >
> > >> As I understand, it is not possible to have a perfect performanc=
e in all possible workloads. Could
> > you
> > >> point out what workloads are the best way of F2FS using?
> > >
> > > Basically I think the following workloads will be good for F2FS.
> > > - Many random writes : it's LFS nature
> > > - Small writes with frequent fsync : f2fs is optimized to reduce =
the fsync overhead.
> > >
> >=20
> > Yes, it can be so for the case of non-aged f2fs volume. But I am af=
raid that for the case of aged f2fs
> > volume the situation can be opposite. I think that in the case of a=
ged state of f2fs volume the GC
> > will be under hard work in above-mentioned workloads.
>=20
> Yes, you're right.
> In the LFS paper above, there are two logging schemes: threaded loggi=
ng and copy-and-compaction.
> In order to avoid high cleaning overhead, f2fs adopts a hybrid one wh=
ich changes the allocation policy dynamically
> between two schemes.
> Threaded logging is similar to the traditional approach, resulting in=
 random writes without cleaning operations.
> Copy-and-compaction is another name of cleaning, resulting in sequent=
ial writes with cleaning operations.
> So, f2fs adopts one of them in runtime according to the file system s=
tatus.
> Through this, we could see the random write performance comparable to=
 ext4 even in the worst case.
>=20
> >=20
> > But, as I can understand, smartphones and tablets are the most prom=
ising way of f2fs using. Because
> > f2fs designs for NAND flash memory based-storage devices. So, I thi=
nk that such workloads as "many
> > random writes" or "small writes with frequent fsync" are not so fre=
quent use-cases. Use-case of
> > creation and deletion many small files can be more frequent use-cas=
e under smartphones and tablets.
> > But, as I can understand, f2fs has slightly expensive metadata payl=
oad in the case of small files
> > creation. Moreover, frequent and random deletion of small files end=
s in the very sophisticated and
> > unpredictable GC behavior, as I can understand.
> >=20
>=20
> I'd like to share the following paper.
> http://research.cs.wisc.edu/adsl/Publications/ibench-tocs12.pdf
>=20
> In our experiments *also* on android phones, we've seen many random p=
atterns with frequent fsync calls.
> We found that the main problem is database, and I think f2fs is benef=
icial to this.
> As you mentioned, I agree that it is important to handle many small f=
iles too.
> It is right that this may cause additional cleaning overhead, and f2f=
s has some metadata payload overhead.
> In order to reduce the cleaning overhead, f2fs adopts static and dyna=
mic hot and cold data separation.
> The main goal is to split the data according to their type (e.g., dir=
 inode, file inode, dentry data, etc) as much as possible.
> Please see the document in detail.
> I think this approach is quite effective to achieve the goal.
> BTW, the payload overhead can be resolved by adopting embedding data =
in the inode likewise ext4.
> I think it is also good idea, and I hope to adopt it in future.
>=20

I'd like you to refer to the following link as well which is about
mobile workload pattern.
http://www.cs.cmu.edu/~fuyaoz/courses/15712/report.pdf
It's reported that in Android there are frequent issues of fsync and
most of them are only for small size of data.

To provide efficient fsync, F2FS minimizes the amount of metadata
written to serve a fsync. Fsync in F2FS is completed by writing user
data blocks and direct node blocks which point to them rather than
creating a new checkpoint which would incur more I/O loads.=20
If sudden power failure happens, then F2FS recovery routine rolls back
to the latest checkpoint and thereafter recovers file system state to
reflect all the completed fsync operations, which we call roll-forward
recovery.
You may want to look at the code about the roll-forward in recover_fsyn=
c_data().

> > >>
> > >>> Maybe or obviously it is possible to optimize ext4 or btrfs to =
flash storages.
> > >>> IMHO, however, they are originally designed for HDDs, so that i=
t may or may not suffer from
> > >> fundamental designs.
> > >>> I don't know, but why not designing a new file system for flash=
 storages as a counterpart?
> > >>>
> > >>
> > >> Yes, it is possible. But F2FS is not flash oriented filesystem a=
s JFFS2, YAFFS2, UBIFS but block-
> > >> oriented filesystem. So, F2FS design is restricted by block-laye=
r's opportunities in the using of
> > >> flash storages' peculiarities. Could you point out key points of=
 F2FS design that makes this design
> > >> fundamentally unique?
> > >
> > > As you can see the f2fs kernel document patch, I think one of the=
 most important features is to
> > align operating units between f2fs and ftl.
> > > Specifically, f2fs has section and zone, which are cleaning unit =
and basic allocation unit
> > respectively.
> > > Through these configurable units in f2fs, I think f2fs is able to=
 reduce the unnecessary operations
> > done by FTL.
> > > And, in order to avoid changing IO patterns by the block-layer, f=
2fs merges itself some bios
> > likewise ext4.
> > >
> >=20
> > As I can understand, it is not so easy to create partition with f2f=
s volume which is aligned on
> > operating units (especially in the case of eMMC or SSD).
>=20
> Could you explain why it is not so easy?
>=20
> > Performance of unaligned volume can degrade
> > significantly because of FTL activity. What mechanisms has f2fs for=
 excluding such situation and
> > achieving of the goal to reduce unnecessary FTL operations?
>=20
> Could you please explain your concern more exactly?
> In the kernel doc, the start address of f2fs data structure is aligne=
d to the segment size (i.e., 2MB).
> Do you mean that or another operating units (e.g., section and zone)?
>=20
> Thanks,
>=20
> >=20
> > With the best regards,
> > Vyacheslav Dubeyko.
> >=20
> > >>
> > >> With the best regards,
> > >> Vyacheslav Dubeyko.
> > >>
> > >>
> > >>>>
> > >>>> Marco
> > >>>
> > >>> ---
> > >>> Jaegeuk Kim
> > >>> Samsung
> > >>>
> > >>> --
> > >>> To unsubscribe from this list: send the line "unsubscribe linux=
-kernel" in
> > >>> the body of a message to majordomo@vger.kernel.org
> > >>> More majordomo info at  http://vger.kernel.org/majordomo-info.h=
tml
> > >>> Please read the FAQ at  http://www.tux.org/lkml/
> > >
> > >
> > > ---
> > > Jaegeuk Kim
> > > Samsung
> > >
>=20
>=20
> ---
> Jaegeuk Kim
> Samsung
>=20
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kerne=
l" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

--
Jooyoung Hwang
Samsung Electronics

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel=
" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html