Integrate git-blame-tree(1) into Gitaly
When GitLab shows the tree of files you get something like this:
Before | After |
---|---|
![]() |
![]() |
Current state
So GitLab knows the tree of files, but doesn't have the commits which last touched each file associated with them yet. To load this info, it uses the ListLastCommitsForTree
RPC to fetch that information.
Internally this RPC handler runs git-ls-tree(1)
to get a list of files in the tree. When it has the list of files it calls log.LastCommitForPath()
for each file, this returns a *catfile.Commit
for each path and that gets filled into the ListLastCommitsForTreeResponse
.
With git-blame-tree
When git-blame-tree(1) is available in Git, we could avoid calling log.LastCommitForPath()
for each path separately, and instead get all info at once for the files in the tree. To able to use git-blame-tree(1), we'd need it to return a full GitCommit
:
message GitCommit {
// id ...
string id = 1;
// subject ...
bytes subject = 2;
// body ...
bytes body = 3;
// author ...
CommitAuthor author = 4;
// committer ...
CommitAuthor committer = 5;
// parent_ids ...
repeated string parent_ids = 6;
// body_size is the size of the commit body. If body exceeds a certain threshold,
// it will be nullified, but its size will be set in body_size so we can know if
// a commit had a body in the first place.
int64 body_size = 7;
// signature_type ...
SignatureType signature_type = 8;
// tree_id is the object ID of the tree. The tree ID will always be filled, even
// if the tree is empty. In that case the value will be `4b825dc642cb6eb9a060e54bf8d69288fbee4904`.
// That value is equivalent to `git hash-object -t tree /dev/null`.
string tree_id = 9;
// trailers is the list of Git trailers (https://git-scm.com/docs/git-interpret-trailers)
// found in this commit's message. The number of trailers and their key/value
// sizes are limited. If a trailer exceeds these size limits, it and any
// trailers that follow it are not included.
repeated CommitTrailer trailers = 10;
// short_stats are the git stats including additions, deletions and changed_files,
// they are only set when `include_shortstat == true`.
CommitStatInfo short_stats = 11;
// referenced_by contains fully-qualified reference names (e.g refs/heads/main)
// that point to the commit.
repeated bytes referenced_by = 12; // protolint:disable:this REPEATED_FIELD_NAMES_PLURALIZED
// encoding is the encoding of the commit message. This field will only be present if
// `i18n.commitEncoding` was set to a value other than "UTF-8" at the time
// this commit was made.
// See: https://git-scm.com/docs/git-commit#_discussion
string encoding = 13;
}
The proposed git-blame-tree implementation would only return the commit sha, so extending it to return full commit details would be beneficial to avoid doing another Git call to git-show
each commit.
Pagination
At the moment the ListLastCommitsForTreeRequest
RPC has a offset
and limit
field. Based on that, the handler takes the output of git-ls-tree(1) and takes the subset of paths within that range. It's inefficient to first get the full list of entries in a tree, to only take a small subset of it.
To use git-blame-tree we'll probably need to be doing the same. Although we need to avoid git-blame-tree blames every file in the tree, to only take a subset of files. Therefore we'd still need to call git-ls-tree first, get the subset of paths we want to the commit information for and do something like: git blame-tree <rev> -- <path-0> <path-1> <path-2> ...
Future optimization.
So because the UI knows already which files it needs the last commit for, I wonder if we can change the interface of ListLastCommitsForTree
to have the caller pass a list of paths where it needs the last commit for. (maybe the name ListLastCommitsForTree
wouldn't fit and it might be better to have an RPC ListLastCommitsForPaths
).