summaryrefslogtreecommitdiffstatshomepage
path: root/doc/design/data-model.md
diff options
context:
space:
mode:
authorsudoforge <no-reply@sudoforge.com>2025-04-08 03:49:06 -0700
committersudoforge <no-reply@sudoforge.com>2025-04-19 02:11:08 -0700
commit4b9583c5379121a65bfb3b0866b7bfd593d8f9f1 (patch)
tree4a095676de2c4934f9eda6d18919255489a2217f /doc/design/data-model.md
parent52f42daf8bf317a065277da50d916c90e26fe4dd (diff)
downloadgit-bug-4b9583c5379121a65bfb3b0866b7bfd593d8f9f1.tar.gz
git-bug-4b9583c5379121a65bfb3b0866b7bfd593d8f9f1.zip
docs: update install, contrib, and usage documentationI6f4b5f5ca73780cecf66a553cce80aa3f75df2ce
This change refactors documentation, especially //:README.md, //:CONTRIBUTING.md, and centralized most of the rest of the contributing and usage documentation into `//doc`. Documentation changes focus on cleaning up erroneous language, reformatting, and restructuring docs in order to reduce visual noise - the goals are to enable users and contributors to find what they're looking for much more quickly, reduce (sometimes erroneous) duplication, and present a structured approach for where to put docs. Additionally: - Added CI pipeline badge for the `trunk` workflow in `//:README.md` - Converted embedded emoji characters to GitHub's emoji syntax - Moved most sections with image and installation instructions into `<details>` elements - Replaced references to gitter with matrix - Introduced `treefmt-nix` to `devShells.default` in the flake, and all non-generated documentation was formatted with `nix fmt` (running `mdformat` under the hood) - Regenerated //doc/md and //doc/man and refactored inline docs from various pages that were iterating out command line details so that they simply point to //doc/md instead - Rewrote contributing docs to focus on using `nix`, to ensure that the development shell is used. This enforces consistency across different platforms with the revision and configuration of each tool we use, and simplifies the onboarding story for contributors -- they only need to install and configure nix (and optionally direnv). A dev container will be provided in the future [0] as an alternative for users on Windows (who can use WSL) or want to avoid installing `nix`. [0]: https://github.com/git-bug/git-bug/issues/1364 Closes: #1212 #1276 #1330 #1357 Change-Id: I6f4b5f5ca73780cecf66a553cce80aa3f75df2ce
Diffstat (limited to 'doc/design/data-model.md')
-rw-r--r--doc/design/data-model.md215
1 files changed, 215 insertions, 0 deletions
diff --git a/doc/design/data-model.md b/doc/design/data-model.md
new file mode 100644
index 00000000..0553ed25
--- /dev/null
+++ b/doc/design/data-model.md
@@ -0,0 +1,215 @@
+# git-bug's reusable entity data model
+
+This document explains how git-bug's reusable distributed data structure in git
+is working. This data structure is capable of:
+
+- storing an entity (bug, pull-request, config...) and its complete history in
+ git
+- carry signed authorship of editions
+- use git remotes as a medium for synchronisation and collaboration
+- merge conflicts
+- respect the rules you define as to what edition are possible
+- carry attached media
+
+If you are looking for a different writing format or to see how you can easily
+make your own, checkout [the example code](../../entity/dag/example_test.go).
+
+If you are not familiar with
+[git internals](https://git-scm.com/book/en/v2/Git-Internals-Git-Objects), you
+might first want to read about them, as the `git-bug` data model is built on top
+of them.
+
+## Entities (bug, author, ...) are a series of edit operations
+
+As entities are stored and edited in multiple processes at the same time, it's
+not possible to store the current state like it would be done in a normal
+application. If two processes change the same entity and later try to merge the
+states, we wouldn't know which change takes precedence or how to merge those
+states.
+
+To deal with this problem, you need a way to merge these changes in a meaningful
+way. Instead of storing the final bug data directly, we store a series of edit
+`Operation`s. This is a common idea, notably with
+[Operation-based CRDTs](https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type#Operation-based_CRDTs).
+
+![ordered operations](../assets/operations.png)
+
+To get the final state of an entity, we apply these `Operation`s in the correct
+order on an empty state, to compute (aka "compile") our view.
+
+## Entities are stored in git objects
+
+An `Operation` is a piece of data, including:
+
+- a type identifier
+- an author (a reference to another entity)
+- a timestamp (there is also one or two [Lamport time](#time-is-unreliable))
+- all the data required by that operation type (a message, a status ...)
+- a random [nonce](https://en.wikipedia.org/wiki/Cryptographic_nonce) to ensure
+ we have enough entropy, as the operation identifier is a hash of that data
+ (more on that later)
+
+These `Operation`s are aggregated in an `OperationPack`, a simple array. An
+`OperationPack` represents an edit session of the entity. As the operation's
+author is the same for all the `OperationPack` we only store it once.
+
+We store this pack in git as a git `Blob`; that consists of a string containing
+a JSON array of operations. One such pack -- here with two operations -- might
+look like this:
+
+```json
+{
+ "author": {
+ "id": "04bf6c1a69bb8e9679644874c85f82e337b40d92df9d8d4176f1c5e5c6627058"
+ },
+ "ops": [
+ {
+ "type": 3,
+ "timestamp": 1647377254,
+ "nonce": "SRQwUWTJCXAmQBIS+1ctKgOcbF0=",
+ "message": "Adding a comment",
+ "files": null
+ },
+ {
+ "type": 4,
+ "timestamp": 1647377257,
+ "nonce": "la/HaRPMvD77/cJSJOUzKWuJdY8=",
+ "status": 1
+ }
+ ]
+}
+```
+
+To reference our `OperationPack`, we create a git `Tree`; it references our
+`OperationPack` `Blob` under `"/ops"`. If any edit operation includes a media
+(for instance in a text message), we can store that media as a `Blob` and
+reference it here under `"/media"`.
+
+To complete the picture, we create a git `Commit` that references our `Tree`.
+Each time we add more `Operation`s to our bug, we add a new `Commit` with the
+same data-structure to form a chain of `Commit`s.
+
+This chain of `Commit`s is made available as a git `Reference` under
+`refs/<namespace>/<id>`. We can later use this reference to push our data to a
+git remote. As git will push any data needed as well, everything will be pushed
+to the remote, including the media.
+
+Here is the complete picture:
+
+![git graph of a simple bug](../assets/bug-graph.png)
+
+## Time is unreliable
+
+Before being able to merge conflicts, let's start with some building blocks.
+
+It would be very tempting to use the `Operation`'s timestamp to give us the
+order to compile the final state. However, you can't rely on the time provided
+by other people (their clock might be off) for anything other than just display.
+This is a fundamental limitation of distributed system, and even more so when
+actors might want to game the system.
+
+Instead, we are going to use
+[Lamport logical clock](https://en.wikipedia.org/wiki/Lamport_timestamps). A
+Lamport clock is a simple counter of events. This logical clock gives us a
+partial ordering:
+
+- if L1 \< L2, L1 happened before L2
+- if L1 > L2, L1 happened after L2
+- if L1 == L2, we can't tell which happened first: it's a concurrent edition
+
+Each time we are appending something to the data (create an `Entity`, add an
+`Operation`) a logical time will be attached, with the highest time value we are
+aware of, plus one. This declares a causality in the events and allows ordering
+entities and operations.
+
+The first commit of an `Entity` will have both a creation time and edit time
+clock, while a later commit will only have an edit time clock. These clocks
+value are serialized directly in the `Tree` entry name (for example:
+`"create-clock-4"`). As a `Tree` entry needs to reference something, we
+reference the git `Blob` with an empty content. As all of these entries will
+reference the same `Blob`, no network transfer is needed as long as you already
+have any entity in your repository.
+
+Example of a `Tree` of the first commit of an entity:
+
+```
+100644 blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 create-clock-14
+100644 blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 edit-clock-137
+100644 blob a020a85baa788e12699a4d83dd735578f0d78c75 ops
+```
+
+Example of a `Tree` of a later commit of an entity:
+
+```
+100644 blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 edit-clock-154
+100644 blob 68383346c1a9503f28eec888efd300e9fc179ca0 ops
+```
+
+## Entities and Operation's ID
+
+`Operation`s can be referenced - in the data model or by users - with an
+identifier. This identifier is computed from the `Operation`'s data itself, with
+a hash of that data: `id = hash(json(op))`
+
+For entities, `git-bug` uses as identifier the hash of the first `Operation` of
+the entity, as serialized on disk.
+
+The same way as git does, this hash is displayed truncated to a 7 characters
+string to a human user. Note that when specifying a bug id in a command, you can
+enter as few characters as you want, as long as there is no ambiguity. If
+multiple entities match your prefix, `git-bug` will complain and display the
+potential matches.
+
+## Entities support conflict resolution
+
+Now that we have all that, we can finally merge our entities without conflict,
+and collaborate with other users. Let's start by getting rid of two simple
+scenarios:
+
+- if we simply pull updates, we move forward our local reference. We get an
+ update of our graph that we read as usual.
+- if we push fast-forward updates, we move forward the remote reference and
+ other users can update their reference as well.
+
+The tricky part happens when we have concurrent editions. If we pull updates
+while we have local changes (non-straightforward in git term), `git-bug` creates
+the equivalent of a merge commit to merge both branches into a DAG. This DAG has
+a single root containing the first operation, but can have branches that get
+merged back into a single head pointed by the reference.
+
+As we don't have a purely linear series of commits/`Operations`s, we need a
+deterministic ordering to always apply operations in the same order.
+
+`git-bug` applies the following algorithm:
+
+1. load and read all the commits and the associated `OperationPack`s
+2. make sure that the Lamport clocks respect the DAG structure: a parent
+ commit/`OperationPack` (that is, towards the head) cannot have a clock that
+ is higher or equal than its direct child. If such a problem happens, the
+ commit is refused/discarded.
+3. individual `Operation`s are assembled together and ordered given the
+ following priorities:
+ 1. the edition's lamport clock if not concurrent
+ 2. the lexicographic order of the `OperationPack`'s identifier
+
+Step 2 is providing and enforcing a constraint over the `Operation`'s logical
+clocks. What that means, is that **we inherit the implicit ordering given by the
+DAG**. Later, logical clocks refine that ordering. This - coupled with signed
+commits - has the nice property of limiting how this data model can be abused.
+
+Here is an example of such an ordering:
+
+![merge scenario 1](../assets/merge-1.png)
+
+We can see that:
+
+- Lamport clocks respect the DAG structure
+- the final `Operation` order is \[A,B,C,D,E,F\], according to those clocks
+
+When we have concurrent editions, we apply a secondary ordering, based on the
+`OperationPack`'s identifier:
+
+![merge scenario 2](../assets/merge-2.png)
+
+This secondary ordering doesn't carry much meaning, but it's unbiased and hard
+to abuse.