Skip to content

Commit e83c516

Browse files
author
Linus Torvalds
committed
Initial revision of "git", the information manager from hell
0 parents  commit e83c516

11 files changed

+1244
-0
lines changed

Makefile

+40
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
CFLAGS=-g
2+
CC=gcc
3+
4+
PROG=update-cache show-diff init-db write-tree read-tree commit-tree cat-file
5+
6+
all: $(PROG)
7+
8+
install: $(PROG)
9+
install $(PROG) $(HOME)/bin/
10+
11+
LIBS= -lssl
12+
13+
init-db: init-db.o
14+
15+
update-cache: update-cache.o read-cache.o
16+
$(CC) $(CFLAGS) -o update-cache update-cache.o read-cache.o $(LIBS)
17+
18+
show-diff: show-diff.o read-cache.o
19+
$(CC) $(CFLAGS) -o show-diff show-diff.o read-cache.o $(LIBS)
20+
21+
write-tree: write-tree.o read-cache.o
22+
$(CC) $(CFLAGS) -o write-tree write-tree.o read-cache.o $(LIBS)
23+
24+
read-tree: read-tree.o read-cache.o
25+
$(CC) $(CFLAGS) -o read-tree read-tree.o read-cache.o $(LIBS)
26+
27+
commit-tree: commit-tree.o read-cache.o
28+
$(CC) $(CFLAGS) -o commit-tree commit-tree.o read-cache.o $(LIBS)
29+
30+
cat-file: cat-file.o read-cache.o
31+
$(CC) $(CFLAGS) -o cat-file cat-file.o read-cache.o $(LIBS)
32+
33+
read-cache.o: cache.h
34+
show-diff.o: cache.h
35+
36+
clean:
37+
rm -f *.o $(PROG) temp_git_file_*
38+
39+
backup: clean
40+
cd .. ; tar czvf dircache.tar.gz dir-cache

README

+168
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,168 @@
1+
2+
GIT - the stupid content tracker
3+
4+
"git" can mean anything, depending on your mood.
5+
6+
- random three-letter combination that is pronounceable, and not
7+
actually used by any common UNIX command. The fact that it is a
8+
mispronounciation of "get" may or may not be relevant.
9+
- stupid. contemptible and despicable. simple. Take your pick from the
10+
dictionary of slang.
11+
- "global information tracker": you're in a good mood, and it actually
12+
works for you. Angels sing, and a light suddenly fills the room.
13+
- "goddamn idiotic truckload of sh*t": when it breaks
14+
15+
This is a stupid (but extremely fast) directory content manager. It
16+
doesn't do a whole lot, but what it _does_ do is track directory
17+
contents efficiently.
18+
19+
There are two object abstractions: the "object database", and the
20+
"current directory cache".
21+
22+
The Object Database (SHA1_FILE_DIRECTORY)
23+
24+
The object database is literally just a content-addressable collection
25+
of objects. All objects are named by their content, which is
26+
approximated by the SHA1 hash of the object itself. Objects may refer
27+
to other objects (by referencing their SHA1 hash), and so you can build
28+
up a hierarchy of objects.
29+
30+
There are several kinds of objects in the content-addressable collection
31+
database. They are all in deflated with zlib, and start off with a tag
32+
of their type, and size information about the data. The SHA1 hash is
33+
always the hash of the _compressed_ object, not the original one.
34+
35+
In particular, the consistency of an object can always be tested
36+
independently of the contents or the type of the object: all objects can
37+
be validated by verifying that (a) their hashes match the content of the
38+
file and (b) the object successfully inflates to a stream of bytes that
39+
forms a sequence of <ascii tag without space> + <space> + <ascii decimal
40+
size> + <byte\0> + <binary object data>.
41+
42+
BLOB: A "blob" object is nothing but a binary blob of data, and doesn't
43+
refer to anything else. There is no signature or any other verification
44+
of the data, so while the object is consistent (it _is_ indexed by its
45+
sha1 hash, so the data itself is certainly correct), it has absolutely
46+
no other attributes. No name associations, no permissions. It is
47+
purely a blob of data (ie normally "file contents").
48+
49+
TREE: The next hierarchical object type is the "tree" object. A tree
50+
object is a list of permission/name/blob data, sorted by name. In other
51+
words the tree object is uniquely determined by the set contents, and so
52+
two separate but identical trees will always share the exact same
53+
object.
54+
55+
Again, a "tree" object is just a pure data abstraction: it has no
56+
history, no signatures, no verification of validity, except that the
57+
contents are again protected by the hash itself. So you can trust the
58+
contents of a tree, the same way you can trust the contents of a blob,
59+
but you don't know where those contents _came_ from.
60+
61+
Side note on trees: since a "tree" object is a sorted list of
62+
"filename+content", you can create a diff between two trees without
63+
actually having to unpack two trees. Just ignore all common parts, and
64+
your diff will look right. In other words, you can effectively (and
65+
efficiently) tell the difference between any two random trees by O(n)
66+
where "n" is the size of the difference, rather than the size of the
67+
tree.
68+
69+
Side note 2 on trees: since the name of a "blob" depends entirely and
70+
exclusively on its contents (ie there are no names or permissions
71+
involved), you can see trivial renames or permission changes by noticing
72+
that the blob stayed the same. However, renames with data changes need
73+
a smarter "diff" implementation.
74+
75+
CHANGESET: The "changeset" object is an object that introduces the
76+
notion of history into the picture. In contrast to the other objects,
77+
it doesn't just describe the physical state of a tree, it describes how
78+
we got there, and why.
79+
80+
A "changeset" is defined by the tree-object that it results in, the
81+
parent changesets (zero, one or more) that led up to that point, and a
82+
comment on what happened. Again, a changeset is not trusted per se:
83+
the contents are well-defined and "safe" due to the cryptographically
84+
strong signatures at all levels, but there is no reason to believe that
85+
the tree is "good" or that the merge information makes sense. The
86+
parents do not have to actually have any relationship with the result,
87+
for example.
88+
89+
Note on changesets: unlike real SCM's, changesets do not contain rename
90+
information or file mode chane information. All of that is implicit in
91+
the trees involved (the result tree, and the result trees of the
92+
parents), and describing that makes no sense in this idiotic file
93+
manager.
94+
95+
TRUST: The notion of "trust" is really outside the scope of "git", but
96+
it's worth noting a few things. First off, since everything is hashed
97+
with SHA1, you _can_ trust that an object is intact and has not been
98+
messed with by external sources. So the name of an object uniquely
99+
identifies a known state - just not a state that you may want to trust.
100+
101+
Furthermore, since the SHA1 signature of a changeset refers to the
102+
SHA1 signatures of the tree it is associated with and the signatures
103+
of the parent, a single named changeset specifies uniquely a whole
104+
set of history, with full contents. You can't later fake any step of
105+
the way once you have the name of a changeset.
106+
107+
So to introduce some real trust in the system, the only thing you need
108+
to do is to digitally sign just _one_ special note, which includes the
109+
name of a top-level changeset. Your digital signature shows others that
110+
you trust that changeset, and the immutability of the history of
111+
changesets tells others that they can trust the whole history.
112+
113+
In other words, you can easily validate a whole archive by just sending
114+
out a single email that tells the people the name (SHA1 hash) of the top
115+
changeset, and digitally sign that email using something like GPG/PGP.
116+
117+
In particular, you can also have a separate archive of "trust points" or
118+
tags, which document your (and other peoples) trust. You may, of
119+
course, archive these "certificates of trust" using "git" itself, but
120+
it's not something "git" does for you.
121+
122+
Another way of saying the same thing: "git" itself only handles content
123+
integrity, the trust has to come from outside.
124+
125+
Current Directory Cache (".dircache/index")
126+
127+
The "current directory cache" is a simple binary file, which contains an
128+
efficient representation of a virtual directory content at some random
129+
time. It does so by a simple array that associates a set of names,
130+
dates, permissions and content (aka "blob") objects together. The cache
131+
is always kept ordered by name, and names are unique at any point in
132+
time, but the cache has no long-term meaning, and can be partially
133+
updated at any time.
134+
135+
In particular, the "current directory cache" certainly does not need to
136+
be consistent with the current directory contents, but it has two very
137+
important attributes:
138+
139+
(a) it can re-generate the full state it caches (not just the directory
140+
structure: through the "blob" object it can regenerate the data too)
141+
142+
As a special case, there is a clear and unambiguous one-way mapping
143+
from a current directory cache to a "tree object", which can be
144+
efficiently created from just the current directory cache without
145+
actually looking at any other data. So a directory cache at any
146+
one time uniquely specifies one and only one "tree" object (but
147+
has additional data to make it easy to match up that tree object
148+
with what has happened in the directory)
149+
150+
151+
and
152+
153+
(b) it has efficient methods for finding inconsistencies between that
154+
cached state ("tree object waiting to be instantiated") and the
155+
current state.
156+
157+
Those are the two ONLY things that the directory cache does. It's a
158+
cache, and the normal operation is to re-generate it completely from a
159+
known tree object, or update/compare it with a live tree that is being
160+
developed. If you blow the directory cache away entirely, you haven't
161+
lost any information as long as you have the name of the tree that it
162+
described.
163+
164+
(But directory caches can also have real information in them: in
165+
particular, they can have the representation of an intermediate tree
166+
that has not yet been instantiated. So they do have meaning and usage
167+
outside of caching - in one sense you can think of the current directory
168+
cache as being the "work in progress" towards a tree commit).

cache.h

+93
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,93 @@
1+
#ifndef CACHE_H
2+
#define CACHE_H
3+
4+
#include <stdio.h>
5+
#include <sys/stat.h>
6+
#include <fcntl.h>
7+
#include <stddef.h>
8+
#include <stdlib.h>
9+
#include <stdarg.h>
10+
#include <errno.h>
11+
#include <sys/mman.h>
12+
13+
#include <openssl/sha.h>
14+
#include <zlib.h>
15+
16+
/*
17+
* Basic data structures for the directory cache
18+
*
19+
* NOTE NOTE NOTE! This is all in the native CPU byte format. It's
20+
* not even trying to be portable. It's trying to be efficient. It's
21+
* just a cache, after all.
22+
*/
23+
24+
#define CACHE_SIGNATURE 0x44495243 /* "DIRC" */
25+
struct cache_header {
26+
unsigned int signature;
27+
unsigned int version;
28+
unsigned int entries;
29+
unsigned char sha1[20];
30+
};
31+
32+
/*
33+
* The "cache_time" is just the low 32 bits of the
34+
* time. It doesn't matter if it overflows - we only
35+
* check it for equality in the 32 bits we save.
36+
*/
37+
struct cache_time {
38+
unsigned int sec;
39+
unsigned int nsec;
40+
};
41+
42+
/*
43+
* dev/ino/uid/gid/size are also just tracked to the low 32 bits
44+
* Again - this is just a (very strong in practice) heuristic that
45+
* the inode hasn't changed.
46+
*/
47+
struct cache_entry {
48+
struct cache_time ctime;
49+
struct cache_time mtime;
50+
unsigned int st_dev;
51+
unsigned int st_ino;
52+
unsigned int st_mode;
53+
unsigned int st_uid;
54+
unsigned int st_gid;
55+
unsigned int st_size;
56+
unsigned char sha1[20];
57+
unsigned short namelen;
58+
unsigned char name[0];
59+
};
60+
61+
const char *sha1_file_directory;
62+
struct cache_entry **active_cache;
63+
unsigned int active_nr, active_alloc;
64+
65+
#define DB_ENVIRONMENT "SHA1_FILE_DIRECTORY"
66+
#define DEFAULT_DB_ENVIRONMENT ".dircache/objects"
67+
68+
#define cache_entry_size(len) ((offsetof(struct cache_entry,name) + (len) + 8) & ~7)
69+
#define ce_size(ce) cache_entry_size((ce)->namelen)
70+
71+
#define alloc_nr(x) (((x)+16)*3/2)
72+
73+
/* Initialize the cache information */
74+
extern int read_cache(void);
75+
76+
/* Return a statically allocated filename matching the sha1 signature */
77+
extern char *sha1_file_name(unsigned char *sha1);
78+
79+
/* Write a memory buffer out to the sha file */
80+
extern int write_sha1_buffer(unsigned char *sha1, void *buf, unsigned int size);
81+
82+
/* Read and unpack a sha1 file into memory, write memory to a sha1 file */
83+
extern void * read_sha1_file(unsigned char *sha1, char *type, unsigned long *size);
84+
extern int write_sha1_file(char *buf, unsigned len);
85+
86+
/* Convert to/from hex/sha1 representation */
87+
extern int get_sha1_hex(char *hex, unsigned char *sha1);
88+
extern char *sha1_to_hex(unsigned char *sha1); /* static buffer! */
89+
90+
/* General helper functions */
91+
extern void usage(const char *err);
92+
93+
#endif /* CACHE_H */

cat-file.c

+23
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
#include "cache.h"
2+
3+
int main(int argc, char **argv)
4+
{
5+
unsigned char sha1[20];
6+
char type[20];
7+
void *buf;
8+
unsigned long size;
9+
char template[] = "temp_git_file_XXXXXX";
10+
int fd;
11+
12+
if (argc != 2 || get_sha1_hex(argv[1], sha1))
13+
usage("cat-file: cat-file <sha1>");
14+
buf = read_sha1_file(sha1, type, &size);
15+
if (!buf)
16+
exit(1);
17+
fd = mkstemp(template);
18+
if (fd < 0)
19+
usage("unable to create tempfile");
20+
if (write(fd, buf, size) != size)
21+
strcpy(type, "bad");
22+
printf("%s: %s\n", template, type);
23+
}

0 commit comments

Comments
 (0)