Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why I think static tree representations tend to be less bug-prone than patches #72

Open
joliss opened this issue Feb 1, 2017 · 1 comment
Labels

Comments

@joliss
Copy link
Contributor

joliss commented Feb 1, 2017

If we use in-memory data structures rather than physical directories to speed up communication between plugins, we have two options as to what exactly we exchange: (1) A list of changes, or (2) a static view of the files like entries. The current usage of FSTree seems to be favoring option (1). I've alluded in the past that I'd prefer (2) for usability reasons, but I've never properly explained this. So let me do this now.

There's no action items to do in response to this GitHub issue, but if you're working on FSTree, and particularly on getting FSTree into the official node API, I'd simply like you to read this please!

Here's the specifics of each option:

  1. A patch, i.e. a list of changes, looking like [['change', 'foo/a.js'], ['create', 'bar/b.js'], ['unlink', 'bar/c.js']]

  2. A static representation of the directory tree, similar to the current entries structure, but Merkle-tree-like, e.g.

    // First argument (200) is maximum last-modified time of all
    // subdirectories and files recursively, taking the role of the hash in a
    // proper Merkle tree. Current time is 200.
    Directory(200, {
      'foo': Directory(200, {
        'a.js': File(200)
      }),
      'bar': Directory(200, {
        'b.js': File(200),
        /* no c.js here */
        'd.js': File(100)
      })
    })

Now clearly these options are technically isomorphic: We can reconstruct an in-memory Merkle tree of files by applying the patch to it on each rebuild, and we can conversely reconstruct the patch by diffing the current tree against the tree we got on the previous rebuild. Performance of this diffing operation is, by the way, the reason why I'm suggesting a Merkle-tree-like structure: With a flat entry list, diffing is O(totalFiles), whereas with a Merkle tree, it is O(changedFiles) (except for pathological cases), similar to how git diff commit1..commit2 is very fast even on large trees.

However, despite this isomorphism, what I want to argue here is that there is a major usability difference between the two, in that option (1) encourages a class of subtle caching bugs. To make plugin authors "fall into a pit of success" as Stef likes to say, I'd like to steer them towards working against the static representation by default, and only use diffing when it's really needed for performance.

Here's an example of the type of bug that can arise if you try to process a list of changes.

Let's say we're writing a plugin that transforms any file foo.js with an optional neighboring foo.js.map file into some output file(s).

If you were to write down an algorithm for this with caching, it would probably look like this (pseudo-code):

function build(changes) {
  changes.on(['create', 'change'], function(entry) {
    if (!/\.js$/.test(entry.relativePath)) continue;

    var outputFileName = ...;
    // compile will automatically pick up .map files
    writeFileSync(outputFileName, compile(entry.relativePath));
  }

  // Must not forget to handle deletion
  changes.on(['unlink'], function(entry) {
    if (!/\.js$/.test(entry.relativePath)) continue;

    var outputFileName = ...;
    unlinkSync(outputFileName):
  })
}

Can you spot the bug in the caching logic?

The bug is this: When a .js.map file is created, changed or unlinked, but the corresponding .js file is not changed, the .js file still needs to be recompiled. The code above doesn't do this. When we fix this bug, we also need to take care to deduplicate change events when both the .js and the .js.map file change.

So the "straightforward" code above contains a subtle bug, and the correct implementation is considerably more involved, even for a simple case like this where you have a single static dependency, rather than a set of dynamic dependencies due to import mechanisms. To make matters worse, the bug is rarely triggered in practice and so will go unnoticed. As a result, we end up with rare cases of broken caching, eroding trust in the rebuild system and causing people to restart Broccoli whenever they switch branches "just in case".

This class of bugs isn't hypothetical at all. I sat down for a chat with Igor Minar a while back to talk about in-memory representations, and he told me about a TypeScript plugin that picked up .ts files plus optional neighboring .js files. (The .js files allow for manually defining JavaScript bridges for TypeScript code when the automatic bridge isn't good enough.) He described the plugin to me as using patches, and I said, "can you pull up the code? I want to show to you a problem that happens with patches." He pulled up the code, and within a minute of looking at it, I spotted exactly this issue. The bug was there, and I had been able to predict that it was there without knowing the code. This was in an (otherwise) well-written and clean plugin implementation, written by an experienced JavaScript developer.

Let me contrast the previous, buggy implementation with another straightforward implementation, but this time we use a static representation of the filesystem rather than a patch:

function build(inputFileTree) {
  // oldCache/newCache is a cache-purging pattern: newCache will only contain
  // those cache entries that were needed for this rebuild
  var oldCache = this.cache || {};
  var newCache = {};

  var outputDirectoryTree = new DirectoryTree;
  for (var jsFile in inputDirectoryTree.walkSync('**/*.js')) {
    // hashFiles hashes [fileName, mtime] for files that exist, or [fileName,
    // null] for files that don't exist.
    var cacheKey = inputDirectoryTree.hashFiles([jsFile, jsFile + '.map']);

    // Get from cache, or compile if changed
    var outputFiles = oldCache.cacheKey ||
      compile(jsFile);
    outputFileTree.add(outputFiles);
    newCache.cacheKey = outputFiles;
  }

  this.cache = newCache;

  return outputDirectoryTree;
}

This implementation is quite straightforward, and it's correct! We built an ad-hoc cache in only a few lines of code on top of the file tree primitive.

Now I should mention that the disadvantage of the second implementation is that we're doing an O(n) in-memory iteration of all the input files, costing us a millisecond or so. There are several ways to get around this, and some of them involve using patches, but in those cases, we can move the (bug-prone) patch-processing code into a carefully-reviewed shared library. That library might call inputFileTree.diff(this.previousInputFileTree) and process the resulting patch. The main point I'm getting at here though is that in the static-representation case, a reasonably-fast naive implementation that doesn't use a supporting library will tend to come out correct.

Secondly, I should mention that in the static representation case, to get patches, like inputFileTree.diff(this.previousInputFileTree), we need to make sure that the producer doesn't mutate the previousInputFileTree when it rebuilds, and we need to make sure that we don't try to physically access any of the files the previousInputFileTree points to, because those files might have gone away in the meantime.

On balance, I still believe that steering users towards static representations, rather than patches, will result in more reliable and easier-to-maintain code.

Thanks for reading! :-)

@joliss joliss changed the title Why I like static tree representations better than patches Why I think static tree representations tend to be more usable than patches Feb 1, 2017
@joliss joliss changed the title Why I think static tree representations tend to be more usable than patches Why I think static tree representations tend to be less bug-prone than patches Feb 1, 2017
@krisselden
Copy link
Collaborator

I can confirm, I started trying to make a few things build on the diff. That was the direction I was headed with broccoli-typescript-compiler. I started trying to abstract building from changes in https://github.com/krisselden/broccoli-processor and started like everyone does naively modeling it as a basic dependency graph.

When I sketched this on paper, I came up with what I thought to be the primary problem. There are 3 different edges that often produce the same graph but when they don't, the bugs it leads to are very bad and require a clean build to fix.

  • build order: this thing is a prerequisite for me to be built
  • causality: this thing existing, causes me to come into existence and produce my result, and when all my cause edges go away my result should never be produced
  • consumed: this thing is something I consume to produce my result and thus invalidates my result when it changes

I had sketched out some basic examples where these edges did not produce the same graph. I need to find my notes and diagram them but I had some very basic examples. One basic thing is, if you are doing mkdir -p and not handling this a graph with these distinct ends where each dir creation is a node, and your system involves imports and search paths, and you are modeling causality and trying to cache just on a key from your causal relationship, you have bugs, you just don't know it yet.

I ended up tabling the idea for now to just get something done, I now just diff input to invalidate the version of the source files for the TS language service but it always follows the same steps, and emits the whole program and diffs it to patch the output (right now in memory, but I could emit it to the cache path and diff that against the output path). https://github.com/tildeio/broccoli-typescript-compiler/blob/master/src/compiler.ts#L46

I planned to come back and write the tests for the scenarios I described above and finish that other project but I'm doubting when I'll have time for that.

But this is for sure one of those things that seems to work %95 but the naive thing really doesn't, it just isn't obvious for a while.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants