blog

View on GitHub

Evolution of a simple du -s clone

So, the other day I was writing a simple utility to automate some repetitive tasks in my research, involving processing raw data files and transferring them to a remote server. The transfer itself uses rsync, and rsync doesn’t have a progress bar. OK, it does have a way to output progress, but it isn’t a bar and the percentage tends to jump around randomly, even when you pass flags that supposedly make it not do that. Part of the issue is that measuring progress is hard: rsync might not even transfer all the files, since it diffs as it goes, and estimating progress when filesystems and networks are involved can be fraught. My idea was to show two progress bars simultaneously: one counting the files transferred as a percentage of the total number, and one watching the size of the destination directory as a percentage of the total data to be transferred. The actual progress are produced by indicatif, which I recommend – but that’s a subject for another post. For the latter calculation, I needed a function to measure the size of a directory on the local filesystem. Yes, I could just shell out to du and parse the output, but that’s no fun. And so, several yaks later, we have this post.

What I want to show is the process of iteratively making the code better and more expressive using several crates from crates.io. The code is available.

Boilerplate

First, some boilerplate for errors and a driving function. This is mostly shared between all versions of the code that I’ll show here.

We’ll use error-chain for dead-simple error propagation. Add a dependency…

[dependencies]
error-chain = "0.10"

…and make an Error enum with just one clause for chaining std::io::Errors…

error_chain! {
    errors {
        Io(op: &'static str, path: PathBuf) {
            description("I/O operation failed")
            display("Could not {} {}", op, path.display())
        }
    }
}

…and finally, here’s a simple main to drive the local_du function.

fn main() {
    let dir = PathBuf::from(env::args().skip(1).next().unwrap());
    match local_du(&dir) {
        Ok(bytes) => println!("{} {}", bytes, dir.display()),
        Err(error) => println!("ERROR: {:?}", error),
    }
}

v1: quick and dirty

Here is the first version of my local_du function. You’ll notice that I like chaining iterators. The implementation is a fairly straightforward depth-first search of the directory structure. We list the directory entries, recurse when encountering a directory (line 9), and use fold to add up the results.

#[macro_use] extern crate error_chain;
use std::env;
use std::path::{Path, PathBuf};
use self::ErrorKind::*;
fn local_du(path: &Path) -> Result<u64> {
    Ok(path.read_dir()
           .chain_err(|| Io("list", path.into()))?
           .map(|res| match res {
               Ok(entry) => {
                   match entry.metadata() {
                       Ok(meta) => {
                           if meta.is_dir() {
                               local_du(&entry.path()).unwrap_or_else(|e| {
                                   println!("warning: could not list directory {} (skipping): {}",
                                            entry.path().display(), e);
                                   0
                               })
                           } else {
                               meta.len()
                           }
                       }
                       Err(e) => {
                           println!("warning: could not stat {} (skipping): {}",
                                    entry.path().display(), e);
                           0
                       }
                   }
               }
               Err(e) => {
                   println!("warning: could not stat entry in {} (skipping): {}",
                            path.display(), e);
                   0
               }
           })
           .fold(path.metadata().chain_err(|| Io("stat", path.into()))?.len(),
                 |a, b| a + b))
}

Now, this code is bad. It’s ugly and highly indented – you had to scroll to the right to read it, if you even bothered. I had to clumsily add in the size of the original directory as the initial value for fold. It handles errors eagerly because that was the simplest thing that occurred to me at the time. The actual functionality that I advertised (depth-first search and summation) is obscured by irrelevant details and closing braces. However, it does work!

Morpheus meme

At this point… well, at this point I committed the working1 code and continued with my research. But when I came back to the issue, my first idea was to use the awesome walkdir crate to invert the function structure. walkdir provides an iterator that factors out the recursive depth-first search, letting you deal directly with a linear stream of files and directories.

v2: flip the script

We’ll need one more dependency.

[dependencies]
error-chain = "0.10"
walkdir = "1"

And import the iterator struct.

extern crate walkdir;
use walkdir::WalkDir;

It turns out we don’t need to chain io::Errors anymore, so that stanza can be deleted, leaving just one for walkdir::Error. I could just remove the whole chain and return Result<u64, walkdir::Error>, but in a real program, there would likely be more error types.

error_chain! {
    foreign_links {
        WalkDir(::walkdir::Error);
    }
}

As I mentioned, using the walkdir iterator inverts the control flow, so you won’t see any recursion in this iterator chain.

fn local_du<P: AsRef<Path>>(path: P) -> Result<u64> {
    WalkDir::new(path)
        .into_iter()
        .map(|entry| entry.and_then(|entry| entry.metadata())
                          .map(|meta| meta.len())
                          .map_err(Into::into))
        .fold(Ok(0),
              |a, b| match (a, b) {
                  (Ok(a), Ok(b)) => Ok(a + b),
                  (e @ Err(_), _) | (_, e @ Err(_)) => e,
              })
}

Hey, this code is looking a lot better! It’s almost 64% shorter, and I think it is way more expressive. You can read it from top to bottom:

This time, errors are handled lazily, waiting until the fold to do anything. The behavior is different from v1: in this case, any error will cause the whole thing to fail, while previously erroring files or directories would just be skipped. But for this application that doesn’t matter to me, and I haven’t run into any such errors while testing, so if one does happen, it probably indicates something alarming and maybe it’s better to fail loudly.

Anyway, a few aspects of this are still annoying to me. For one, there’s friction when using iterator adapters, because the elements being yielded are Results. So I have Result::and_then and such nested inside Iterator::map. We’re doing fallible operations inside the closures, but we can’t use ?. The friction gets worse during the fold operation, which has to do some gymnastics to propagate the first error until it can be question-marked afterwards. I dimly remembered a crate that could help with this. Enter fallible-iterator!

v3: final form

It’s one more dependency.

[dependencies]
error-chain = "0.10"
walkdir = "1"
ignore = "0.2"

And we import the trait plus a free function that converts any Iterator<Item=Result<T,E>> to a FallibleIterator<Item=T,Error=E>.

extern crate fallible_iterator;
use fallible_iterator::{FallibleIterator, convert};

When you have a FallibleIterator, the familiar iterator adapters, like map, filter, and fold only need to deal with Ok values, and errors are propagated automatically. FallibleIterator::and_then is like map, but much like Result::and_then it returns a new Result, so it can turn Ok values into Errs.

fn local_du<P: AsRef<Path>>(path: P) -> Result<u64> {
    Ok(convert(WalkDir::new(path).into_iter())
           .and_then(|entry| Ok(entry.metadata()?.len()))
           .fold(0, |a, b| a + b)?)
}

This is the final form of the code, and I find it beautiful. The intent reads directly from the code: iterate over directory entries, try to look up the metadata and pull out the size, then add up the results. Error handling behavior is identical to v2, but it is entirely abstracted away, by the built-in ? and by fallible-iterator.

Free lunch

Finally, a word about zero-cost abstractions. You might think I paid a price for simplifying the code and lazifying the error handling. But I didn’t! Compiling in debug mode, you would be right. But nobody compiles in debug mode, right? It’s just too slow for real work. In release mode, these programs have very nearly identical runtimes, validating Rust’s core promise of zero-cost abstractions. Here’s the data:

I ran each version of the code on the same directory. For each, I ran it 20 times to warm up whatever caches there might be, then 100 times to collect timing data. Incidentally, there is 29 GB of data in 421,557 files in the subject directory. I also compared against native du -s.2 Here are the results, in table form:

Version Mean Std
du 9.7803 s 0.07262
v1 10.2459 s 0.10183
v2 10.3733 s 0.10564
v3 10.3423 s 0.08861

…and in graph form:

Box plot of timing results


Unresolved questions


  1. It did not actually work 100% correctly, as I neglected to count directory sizes in the first version (not shown), but it was close. 

  2. Note that du counts in 512-byte blocks by default. On OSX (presumably BSD as well?), you can’t set the block size any lower than that. So implementations like mine will report lower numbers, and it’s not as simple as just dividing by 512 because du rounds up to the block size for every file. On Linux, du -sb and my programs agree. As it happens, the number I get is more useful for my purposes, since I’m estimating network transfer time, not provisioning hard drive sectors. 

Comments

Want to leave a comment? Visit this post's issue page on GitHub (you'll need a GitHub account).