Fpart - README

What is fpart ?
***************

Fpart is a tool that helps you sort file trees and pack them into bags (called
"partitions"). It is developped in C and available under the BSD license.

It splits a list of directories and file trees into a certain number of
partitions, trying to produce partitions with the same size and number of files.
It can also produce partitions with a given number of files or a limited size.

Once generated, partitions are either printed as file lists to stdout (default)
or to files. Those lists can then be used by third party programs.

Fpart also includes a live mode, which allows it to crawl very large filesystems
and produce partitions in live. Hooks are available to act on those partitions
(e.g. immediatly start a transfer using rsync(1)) without having to wait for
the filesystem traversal job to be finished. Used this way, fpart can be seen
as a powerful data migration tool.

Compatibility :
***************

Fpart has been successfully tested on :

* FreeBSD (7x, 8x, 9x, amd64)
* GNU/Linux (x86_64, arm)
* Solaris 10 (i386)
* Solaris 9 (Sparc)
* NetBSD (amd64, alpha)

and will probably work on other operating systems too (*).

(*) fpart built as a static binary within a Debian (armel) chroot will give you
a powerful tool for backing-up your Android (arm) phone ;-)

Examples - common usage :
*************************

The following will produce 3 partitions, with (approximatively) the same size
and number of files. Three files : "var-parts.[0-2]", are generated as output :

$ fpart -n 3 -o var-parts /var

$ ls var-parts*
var-parts.0 var-parts.1 var-parts.2

$ head -n 2 var-parts.0
/var/some/file1
/var/some/file2

The following will produce partitions of 4.3 GB, containing music files ready
to be burnt to a DVD (for example). Files "music-parts.[0-n]", are generated
as output :

$ fpart -s 4617089843 -o music-parts /path/to/my/music

The following will produce partitions containing 10000 files each by examining
/usr first and then /home and display only partition 0 on stdout :

$ find /usr ! -type d | fpart -f 10000 -i - /home | grep '^0:'

The following will produce two partitions by re-using du(1) output. Fpart will
not examine the filesystem but instead re-use arbitrary values printed by du(1)
and sort them :

$ du * | fpart -n 2 -a

Examples - live mode :
**********************

By default, fpart will wait for FS crawling to terminate before generating and
displaying partitions. If you use the live mode (option -L), fpart will display
each partition as soon as it is complete. You can combine this option with
hooks; they will be triggered just before (pre-part hook, option -w) or after
(post-part hook, option -W) partitions' completion.

Hooks provide several environment variables (see fpart(1)); they are a
convenient way of getting information about fpart's and partition's current
states. For example, ${FPART_PARTFILENAME} will contain the name of the output
file of the partition that has just been generated; using that variable within a
post-part hook permits starting manipulating the files just after partition
generation.

See the following example :

$ mkdir foo && touch foo/{bar,baz}
$ fpart -L -f 1 -o /tmp/part.out -W \
    'echo == ${FPART_PARTFILENAME} == ; cat ${FPART_PARTFILENAME}' foo/
== /tmp/part.out.0 ==
foo/bar
== /tmp/part.out.1 ==
foo/baz
2 file(s) found.

This example crawls foo/ in live mode (option -L). For each file (option -f,
1 file per partition), it generates a partition into /tmp/part.out.<n>
(option -o; <n> is the partition index and will be automatically added by fpart)
and executes the following post-part hook (option -W) :

echo == ${FPART_PARTFILENAME} == ; cat ${FPART_PARTFILENAME}

This hook will display the name of current partition's output file name as well
as display its contents.

Examples - migrating data :
***************************

Here is a more complex example that will show you how to use fpart, GNU Parallel
and Rsync to split up a directory and immediately schedule data synchronization
of smaller lists of files, while FS crawling goes on. We will be synchronizing
data from /data/src to /data/dest.

First, go to the source directory (as rsync's --files-from option takes a file
list relative to its source directory) :

$ cd /data/src

Then, run fpart from here :

$ fpart -L -f 10000 -x '.snapshot' -x '.zfs' -Z -o /tmp/part.out -W \
  '/usr/local/bin/sem -j 3
    "/usr/local/bin/rsync -av --files-from=${FPART_PARTFILENAME}
      /data/src/ /data/dest/"' .

This command will start fpart in live mode (option -L), making it generate
partitions during FS crawling. Fpart will produce partitions containing at most
10000 files each (option -f), will skip files and folders named '.snapshot' or
'.zfs' (option -x) and will list empty and non-accessible directories (option
-Z; that option is necessary when working with rsync to make sure the whole file
tree will be re-created within the destination directory). Last but not least,
each partition will be written to /tmp/part.out.<n> (option -o) and used within
the post-part hook (option -W), run immediately by fpart once the partition is
complete :

/usr/local/bin/sem -j 3
    "/usr/local/bin/rsync -av --files-from=${FPART_PARTFILENAME} /data/src/ /data/dest/"

This hook is itself a nested command. It will run GNU Parallel's sem scheduler
(any other scheduler will do) to run at most 3 rsync in parallel.

The scheduler will finally trigger the following command :

/usr/local/bin/rsync -av --files-from=${FPART_PARTFILENAME} /data/src/ /data/dest/

where ${FPART_PARTFILENAME} will be part of rsync's environment when it runs
and contain the file name of the partition that has just been generated.

That's all, folks ! Pretty simple, isn't it ?

In this example, file crawling and data transfer are run from the same -local-
machine, but you can use it as the basis of a much sophisticated solution : at
work, by using a cluster of machines connected to our filers through NFS and
running Open Grid Scheduler, we successully migrated over 400 TB of data.

Limitations :
*************

Fpart will *NOT* modify data, it will *NOT* split your files !

As a consequence, if you have a directory containing several small files and a
huge one, it will be unable to produce partitions with the same size. Fpart does
magic, but not that much ;-)

Installing :
************

For FreeBSD users, fpart is already available in ports, see sysutils/fpart. You
can install fpart from there or from sources if you really need the latest
revision.

Installing from sources is simple. First, if there is no 'configure' script in
the main directory, run :

$ autoreconf -i

(autoreconf comes from the GNU autotools), then run :

$ ./configure
$ make

to configure and build fpart.

Finally, install fpart (as root) :

# make install

See also :
**********

See fpart(1) for more details.

The partition problem is detailed here :
http://en.wikipedia.org/wiki/Partition_problem

and here :
http://en.wikipedia.org/wiki/Bin_packing_problem

I am sure you will also be interested in :
https://github.com/jbd/packo

which was developped by Jean-Baptiste Denis as the original proof of concept.

Author / Licence :
******************

Fpart has been written by Ganal LAPLANCHE <ganael.laplanche@martymac.org>
and is available under the BSD license (see COPYING for details).

Thanks to Jean-Baptiste Denis for having given me the idea of this program !

Contributions :
***************

FTS code comes from FreeBSD :
    lib/libc/gen/fts.c -> fts.c
    include/fts.h -> fts.h
and is available under the BSD license.
