|
| 1 | + |
| 2 | + GIT - the stupid content tracker |
| 3 | + |
| 4 | +"git" can mean anything, depending on your mood. |
| 5 | + |
| 6 | + - random three-letter combination that is pronounceable, and not |
| 7 | + actually used by any common UNIX command. The fact that it is a |
| 8 | + mispronounciation of "get" may or may not be relevant. |
| 9 | + - stupid. contemptible and despicable. simple. Take your pick from the |
| 10 | + dictionary of slang. |
| 11 | + - "global information tracker": you're in a good mood, and it actually |
| 12 | + works for you. Angels sing, and a light suddenly fills the room. |
| 13 | + - "goddamn idiotic truckload of sh*t": when it breaks |
| 14 | + |
| 15 | +This is a stupid (but extremely fast) directory content manager. It |
| 16 | +doesn't do a whole lot, but what it _does_ do is track directory |
| 17 | +contents efficiently. |
| 18 | + |
| 19 | +There are two object abstractions: the "object database", and the |
| 20 | +"current directory cache". |
| 21 | + |
| 22 | + The Object Database (SHA1_FILE_DIRECTORY) |
| 23 | + |
| 24 | +The object database is literally just a content-addressable collection |
| 25 | +of objects. All objects are named by their content, which is |
| 26 | +approximated by the SHA1 hash of the object itself. Objects may refer |
| 27 | +to other objects (by referencing their SHA1 hash), and so you can build |
| 28 | +up a hierarchy of objects. |
| 29 | + |
| 30 | +There are several kinds of objects in the content-addressable collection |
| 31 | +database. They are all in deflated with zlib, and start off with a tag |
| 32 | +of their type, and size information about the data. The SHA1 hash is |
| 33 | +always the hash of the _compressed_ object, not the original one. |
| 34 | + |
| 35 | +In particular, the consistency of an object can always be tested |
| 36 | +independently of the contents or the type of the object: all objects can |
| 37 | +be validated by verifying that (a) their hashes match the content of the |
| 38 | +file and (b) the object successfully inflates to a stream of bytes that |
| 39 | +forms a sequence of <ascii tag without space> + <space> + <ascii decimal |
| 40 | +size> + <byte\0> + <binary object data>. |
| 41 | + |
| 42 | +BLOB: A "blob" object is nothing but a binary blob of data, and doesn't |
| 43 | +refer to anything else. There is no signature or any other verification |
| 44 | +of the data, so while the object is consistent (it _is_ indexed by its |
| 45 | +sha1 hash, so the data itself is certainly correct), it has absolutely |
| 46 | +no other attributes. No name associations, no permissions. It is |
| 47 | +purely a blob of data (ie normally "file contents"). |
| 48 | + |
| 49 | +TREE: The next hierarchical object type is the "tree" object. A tree |
| 50 | +object is a list of permission/name/blob data, sorted by name. In other |
| 51 | +words the tree object is uniquely determined by the set contents, and so |
| 52 | +two separate but identical trees will always share the exact same |
| 53 | +object. |
| 54 | + |
| 55 | +Again, a "tree" object is just a pure data abstraction: it has no |
| 56 | +history, no signatures, no verification of validity, except that the |
| 57 | +contents are again protected by the hash itself. So you can trust the |
| 58 | +contents of a tree, the same way you can trust the contents of a blob, |
| 59 | +but you don't know where those contents _came_ from. |
| 60 | + |
| 61 | +Side note on trees: since a "tree" object is a sorted list of |
| 62 | +"filename+content", you can create a diff between two trees without |
| 63 | +actually having to unpack two trees. Just ignore all common parts, and |
| 64 | +your diff will look right. In other words, you can effectively (and |
| 65 | +efficiently) tell the difference between any two random trees by O(n) |
| 66 | +where "n" is the size of the difference, rather than the size of the |
| 67 | +tree. |
| 68 | + |
| 69 | +Side note 2 on trees: since the name of a "blob" depends entirely and |
| 70 | +exclusively on its contents (ie there are no names or permissions |
| 71 | +involved), you can see trivial renames or permission changes by noticing |
| 72 | +that the blob stayed the same. However, renames with data changes need |
| 73 | +a smarter "diff" implementation. |
| 74 | + |
| 75 | +CHANGESET: The "changeset" object is an object that introduces the |
| 76 | +notion of history into the picture. In contrast to the other objects, |
| 77 | +it doesn't just describe the physical state of a tree, it describes how |
| 78 | +we got there, and why. |
| 79 | + |
| 80 | +A "changeset" is defined by the tree-object that it results in, the |
| 81 | +parent changesets (zero, one or more) that led up to that point, and a |
| 82 | +comment on what happened. Again, a changeset is not trusted per se: |
| 83 | +the contents are well-defined and "safe" due to the cryptographically |
| 84 | +strong signatures at all levels, but there is no reason to believe that |
| 85 | +the tree is "good" or that the merge information makes sense. The |
| 86 | +parents do not have to actually have any relationship with the result, |
| 87 | +for example. |
| 88 | + |
| 89 | +Note on changesets: unlike real SCM's, changesets do not contain rename |
| 90 | +information or file mode chane information. All of that is implicit in |
| 91 | +the trees involved (the result tree, and the result trees of the |
| 92 | +parents), and describing that makes no sense in this idiotic file |
| 93 | +manager. |
| 94 | + |
| 95 | +TRUST: The notion of "trust" is really outside the scope of "git", but |
| 96 | +it's worth noting a few things. First off, since everything is hashed |
| 97 | +with SHA1, you _can_ trust that an object is intact and has not been |
| 98 | +messed with by external sources. So the name of an object uniquely |
| 99 | +identifies a known state - just not a state that you may want to trust. |
| 100 | + |
| 101 | +Furthermore, since the SHA1 signature of a changeset refers to the |
| 102 | +SHA1 signatures of the tree it is associated with and the signatures |
| 103 | +of the parent, a single named changeset specifies uniquely a whole |
| 104 | +set of history, with full contents. You can't later fake any step of |
| 105 | +the way once you have the name of a changeset. |
| 106 | + |
| 107 | +So to introduce some real trust in the system, the only thing you need |
| 108 | +to do is to digitally sign just _one_ special note, which includes the |
| 109 | +name of a top-level changeset. Your digital signature shows others that |
| 110 | +you trust that changeset, and the immutability of the history of |
| 111 | +changesets tells others that they can trust the whole history. |
| 112 | + |
| 113 | +In other words, you can easily validate a whole archive by just sending |
| 114 | +out a single email that tells the people the name (SHA1 hash) of the top |
| 115 | +changeset, and digitally sign that email using something like GPG/PGP. |
| 116 | + |
| 117 | +In particular, you can also have a separate archive of "trust points" or |
| 118 | +tags, which document your (and other peoples) trust. You may, of |
| 119 | +course, archive these "certificates of trust" using "git" itself, but |
| 120 | +it's not something "git" does for you. |
| 121 | + |
| 122 | +Another way of saying the same thing: "git" itself only handles content |
| 123 | +integrity, the trust has to come from outside. |
| 124 | + |
| 125 | + Current Directory Cache (".dircache/index") |
| 126 | + |
| 127 | +The "current directory cache" is a simple binary file, which contains an |
| 128 | +efficient representation of a virtual directory content at some random |
| 129 | +time. It does so by a simple array that associates a set of names, |
| 130 | +dates, permissions and content (aka "blob") objects together. The cache |
| 131 | +is always kept ordered by name, and names are unique at any point in |
| 132 | +time, but the cache has no long-term meaning, and can be partially |
| 133 | +updated at any time. |
| 134 | + |
| 135 | +In particular, the "current directory cache" certainly does not need to |
| 136 | +be consistent with the current directory contents, but it has two very |
| 137 | +important attributes: |
| 138 | + |
| 139 | + (a) it can re-generate the full state it caches (not just the directory |
| 140 | + structure: through the "blob" object it can regenerate the data too) |
| 141 | + |
| 142 | + As a special case, there is a clear and unambiguous one-way mapping |
| 143 | + from a current directory cache to a "tree object", which can be |
| 144 | + efficiently created from just the current directory cache without |
| 145 | + actually looking at any other data. So a directory cache at any |
| 146 | + one time uniquely specifies one and only one "tree" object (but |
| 147 | + has additional data to make it easy to match up that tree object |
| 148 | + with what has happened in the directory) |
| 149 | + |
| 150 | + |
| 151 | +and |
| 152 | + |
| 153 | + (b) it has efficient methods for finding inconsistencies between that |
| 154 | + cached state ("tree object waiting to be instantiated") and the |
| 155 | + current state. |
| 156 | + |
| 157 | +Those are the two ONLY things that the directory cache does. It's a |
| 158 | +cache, and the normal operation is to re-generate it completely from a |
| 159 | +known tree object, or update/compare it with a live tree that is being |
| 160 | +developed. If you blow the directory cache away entirely, you haven't |
| 161 | +lost any information as long as you have the name of the tree that it |
| 162 | +described. |
| 163 | + |
| 164 | +(But directory caches can also have real information in them: in |
| 165 | +particular, they can have the representation of an intermediate tree |
| 166 | +that has not yet been instantiated. So they do have meaning and usage |
| 167 | +outside of caching - in one sense you can think of the current directory |
| 168 | +cache as being the "work in progress" towards a tree commit). |
0 commit comments