Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PATH WALK IV: Add 'git survey' command #1821

Open
wants to merge 1,560 commits into
base: api-upstream
Choose a base branch
from

Conversation

derrickstolee
Copy link

WIP

Thanks, -Stolee

gitster and others added 14 commits February 10, 2025 10:18
"git init" to reinitialize a repository that already exists cannot
change the hash function and ref backends; such a request is
silently ignored now.

* ps/setup-reinit-fixes:
  setup: fix reinit of repos with incompatible GIT_DEFAULT_HASH
  setup: fix reinit of repos with incompatible GIT_DEFAULT_REF_FORMAT
  t0001: remove duplicate test
"git apply" internally uses unsigned long for line numbers and uses
strtoul() to parse numbers on the hunk headers.  It however forgot
to check parse errors.

* pw/apply-ulong-overflow-check:
  apply: detect overflow when parsing hunk header
Two CI tasks, whitespace check and style check, work on the
difference from the base version and the version being checked, but
the base was computed incorrectly in GitLab CI in some cases, which
has been corrected.

* jt/gitlab-ci-base-fix:
  ci: fix base commit fallback for check-whitespace and check-style
Further code clean-up on the use of hash functions.  Now the
context object knows what hash function it is working with.

* ps/hash-cleanup:
  global: adapt callers to use generic hash context helpers
  hash: provide generic wrappers to update hash contexts
  hash: stop typedeffing the hash context
  hash: convert hashing context to a structure
Convert a handful of unit tests to work with the clar framework.

* sk/unit-tests-0130:
  t/unit-tests: convert strcmp-offset test to use clar test framework
  t/unit-tests: convert strbuf test to use clar test framework
  t/unit-tests: adapt example decorate test to use clar test framework
  t/unit-tests: convert hashmap test to use clar test framework
CI update to make Coverity job work again.

* jk/ci-coverity-update:
  ci: set CI_JOB_IMAGE for coverity job
Signed-off-by: Junio C Hamano <[email protected]>
The use of "echo -e" is not portable and not specified by POSIX.  dash
does not support any options except "-n", and so this script will not
work on operating systems which use that as /bin/sh.

Fortunately, the solution is easy: switch to printf(1), which is
specified by POSIX and allows the escape sequences we want to use.  This
will allow the script to work with any POSIX shell.

Signed-off-by: brian m. carlson <[email protected]>
Signed-off-by: Junio C Hamano <[email protected]>
Remove the_repository global variable in favor of the repository
argument that gets passed in "builtin/update-server-info.c".

When `-h` is passed to the command outside a Git repository, the
`run_builtin()` will call the `cmd_update_server_info()` function
with `repo` set to NULL and then early in the function, "parse_options()"
call will give the options help and exit, without having to consult much
of the configuration file. So it is safe to omit reading the config when
`repo` argument the caller gave us is NULL.

Mentored-by: Christian Couder <[email protected]>
Signed-off-by: Usman Akinyemi <[email protected]>
Signed-off-by: Junio C Hamano <[email protected]>
When rebase rewords a commit it picks the commit and then runs "git
commit --amend" to reword it. When the commit is picked the sequencer
tries to reuse existing commits by fast-forwarding if the parents are
unchanged. Rewording an empty commit that has been fast-forwarded fails
because "git commit --amend" is called without "--allow-empty". This
happens because when a commit is fast-forwarded the logic that checks
whether we should pass "--allow-empty" is skipped. Fix this by always
passing "--allow-empty" when rewording a commit. This is safe because we
are amending a commit that has already been picked so if it had become
empty when it was picked we'd have already returned an error.

As "git commit" will happily create empty merge commits without
"--allow-empty" we do not need to pass that flag when rewording merge
commits.

Signed-off-by: Phillip Wood <[email protected]>
Signed-off-by: Junio C Hamano <[email protected]>
We do not seem to centrally document exhaustively ways to spell
Boolean values.

The description in the Environment Variables of git(1) section
assumes that the reader is already familiar with how "Boolean valued
configuration variables" are specified, without referring to
anything, so there is no way for the readers to find out more.

The description of `bool` in the section on "--type
<type>" in "git config --help" might be the place to do so, but it
is not telling us all that much.

The description of Boolean valued placeholders in the pretty formats
section of "git log --help" enumerates the possible values with "etc."
implying there may be other synonyms; shrink the list of samples and
instead refer to the canonical and authoritative source of truth, which
now is git-config(1).

Signed-off-by: Junio C Hamano <[email protected]>
The -X renormalize (or merge.renormalize config) option is intended to
reduce conflicts due to normalization of newer versions of history.  It
does so by renormalizing files that it is about to do a three-way
content merge on.  Some folks thought it would renormalize all files
throughout the tree, and the previous wording wasn't clear enough to
dispell that misconception.  Update the docs to make it clear that the
merge machinery will only apply renormalization to files which need a
three-way content merge.

(Technically, the merge machinery also does renormalization on
modify/delete conflicts, in order to see if the modification was merely
a normalization; if so, it can accept the delete and not report a
conflict.  But it's not clear that this piece needs to be explained to
users, and trying to distinguish it might feel like splitting hairs and
overcomplicating the explanation, so we leave it out.)

Signed-off-by: Elijah Newren <[email protected]>
Signed-off-by: Junio C Hamano <[email protected]>
Allow each file to fix the warnings guarded by the macro separately by
moving the definition from the shared xinclude.h into each file that
needs it.

xmerge.c and xprepare.c do not contain any signed vs. unsigned
comparisons so the definition was not included in these files.

Signed-off-by: David Aguilar <[email protected]>
Signed-off-by: Junio C Hamano <[email protected]>
The loop iteration variable is non-negative and only used in comparisons
against other size_t values.

Signed-off-by: David Aguilar <[email protected]>
Signed-off-by: Junio C Hamano <[email protected]>
jiangxin and others added 17 commits March 11, 2025 07:35
* 'vi-2.49' of github.com:Nekosha/git-po:
  l10n: Updated translation for vi-2.49
Co-authored-by: Kate Golovanova <[email protected]>
Co-authored-by: Mikhail T. <[email protected]>
Co-authored-by: Tamara Lazerka <[email protected]>
Signed-off-by: Arkadii Yakovets <[email protected]>
Signed-off-by: Kate Golovanova <[email protected]>
Signed-off-by: Mikhail T. <[email protected]>
Signed-off-by: Tamara Lazerka <[email protected]>
* '2.49-uk-update' of github.com:arkid15r/git-ukrainian-l10n:
  l10n: uk: add 2.49 translation
Helped-by: 依云 <[email protected]>
Helped-by: Jiang Xin <[email protected]>
Signed-off-by: Teng Long <[email protected]>
* 'tl/zh_CN_2.49.0_rnd' of github.com:dyrone/git:
  l10n: zh_CN: updated translation for 2.49
Doc markup fix.

* ma/clone-doc-markup-fix:
  git-clone doc: fix indentation
"git version --build-options" stopped showing zlib version by
mistake due to recent refactoring, which has been corrected.

* tc/zlib-ng-fix:
  help: print zlib-ng version number
  help: include git-zlib.h to print zlib version
Doc updates.

* pb/doc-follow-remote-head:
  config/remote.txt: improve wording for 'remote.<name>.followRemoteHEAD'
  config/remote.txt: reunite 'severOption' description paragraphs
Signed-off-by: Junio C Hamano <[email protected]>
Update following components:

  * builtin/clone.c
  * builtin/commit.c
  * builtin/fetch.c
  * builtin/index-pack.c
  * builtin/pack-objects.c
  * builtin/refs.c
  * builtin/repack.c
  * builtin/unpack-objects.c
  * command-list.h
  * diff.c
  * object-file.c
  * parse-options.c
  * promisor-remote.c
  * refspec.c
  * remote.c

Translate following new components:

  * path-walk.c
  * builtin/backfill.c
  * t/helper/test-path-walk.c

Signed-off-by: Bagas Sanjaya <[email protected]>
* 'l10n-de-2.49' of github.com:ralfth/git:
  l10n: update German translation
Co-authored-by: Lumynous <[email protected]>
Signed-off-by: Yi-Jyun Pan <[email protected]>
* 'l10n/zh-TW/2025-03-09' of github.com:l10n-tw/git-po:
  l10n: zh_TW: Git 2.49.0 round 1
l10n-2.49.0-rnd1

* tag 'l10n-2.49.0-rnd1' of https://github.com/git-l10n/git-po:
  l10n: zh_TW: Git 2.49.0 round 1
  l10n: update German translation
  l10n: po-id for 2.49
  l10n: zh_CN: updated translation for 2.49
  l10n: uk: add 2.49 translation
  l10n: tr: Update Turkish translations for 2.49.0
  l10n: ko: fix minor typo in Korean translation
  l10n: it: fix spelling of "sorgente" (Italian for "source")
  l10n: sv.po: Fix Swedish typos
  l10n: sv.po: Update Swedish translation
  l10n: fr: 2.49 round 2
  l10n: bg.po: Updated Bulgarian translation (5836t)
  l10n: Updated translation for vi-2.49
Signed-off-by: Junio C Hamano <[email protected]>
@derrickstolee derrickstolee force-pushed the survey-upstream branch 2 times, most recently from 46f4ec4 to 299f3c3 Compare March 26, 2025 18:29
jeffhostetler and others added 10 commits March 26, 2025 14:30
Start work on a new 'git survey' command to scan the repository
for monorepo performance and scaling problems.  The goal is to
measure the various known "dimensions of scale" and serve as a
foundation for adding additional measurements as we learn more
about Git monorepo scaling problems.

The initial goal is to complement the scanning and analysis performed
by the GO-based 'git-sizer' (https://github.com/github/git-sizer) tool.
It is hoped that by creating a builtin command, we may be able to take
advantage of internal Git data structures and code that is not
accessible from GO to gain further insight into potential scaling
problems.

Co-authored-by: Derrick Stolee <[email protected]>
Signed-off-by: Jeff Hostetler <[email protected]>
Signed-off-by: Derrick Stolee <[email protected]>
By default we will scan all references in "refs/heads/", "refs/tags/"
and "refs/remotes/".

Add command line opts let the use ask for all refs or a subset of them
and to include a detached HEAD.

Signed-off-by: Jeff Hostetler <[email protected]>
Signed-off-by: Derrick Stolee <[email protected]>
When 'git survey' provides information to the user, this will be presented
in one of two formats: plaintext and JSON. The JSON implementation will be
delayed until the functionality is complete for the plaintext format.

The most important parts of the plaintext format are headers specifying the
different sections of the report and tables providing concreted data.

Create a custom table data structure that allows specifying a list of
strings for the row values. When printing the table, check each column for
the maximum width so we can create a table of the correct size from the
start.

The table structure is designed to be flexible to the different kinds of
output that will be implemented in future changes.

Signed-off-by: Derrick Stolee <[email protected]>
At the moment, nothing is obvious about the reason for the use of the
path-walk API, but this will become more prevelant in future iterations. For
now, use the path-walk API to sum up the counts of each kind of object.

For example, this is the reachable object summary output for my local repo:

REACHABLE OBJECT SUMMARY
========================
Object Type |  Count
------------+-------
       Tags |   1343
    Commits | 179344
      Trees | 314350
      Blobs | 184030

Signed-off-by: Derrick Stolee <[email protected]>
Now that we have explored objects by count, we can expand that a bit more to
summarize the data for the on-disk and inflated size of those objects. This
information is helpful for diagnosing both why disk space (and perhaps
clone or fetch times) is growing but also why certain operations are slow
because the inflated size of the abstract objects that must be processed is
so large.

Signed-off-by: Derrick Stolee <[email protected]>
In future changes, we will make use of these methods. The intention is to
keep track of the top contributors according to some metric. We don't want
to store all of the entries and do a sort at the end, so track a
constant-size table and remove rows that get pushed out depending on the
chosen sorting algorithm.

Co-authored-by: Jeff Hostetler <[email protected]>
Signed-off-by; Jeff Hostetler <[email protected]>
Signed-off-by: Derrick Stolee <[email protected]>
Since we are already walking our reachable objects using the path-walk API,
let's now collect lists of the paths that contribute most to different
metrics. Specifically, we care about

 * Number of versions.
 * Total size on disk.
 * Total inflated size (no delta or zlib compression).

This information can be critical to discovering which parts of the
repository are causing the most growth, especially on-disk size. Different
packing strategies might help compress data more efficiently, but the toal
inflated size is a representation of the raw size of all snapshots of those
paths. Even when stored efficiently on disk, that size represents how much
information must be processed to complete a command such as 'git blame'.

Since the on-disk size is likely to be fragile, stop testing the exact
output of 'git survey' and check that the correct set of headers is
output.

Signed-off-by: Derrick Stolee <[email protected]>
The 'git survey' builtin provides several detail tables, such as "top
files by on-disk size". The size of these tables defaults to 100,
currently.

Allow the user to specify this number via a new --top=<N> option or the
new survey.top config key.

Signed-off-by: Derrick Stolee <[email protected]>
While this command is definitely something we _want_, chances are that
upstreaming this will require substantial changes.

We still want to be able to experiment with this before that, to focus
on what we need out of this command: To assist with diagnosing issues
with large repositories, as well as to help monitoring the growth and
the associated painpoints of such repositories.

To that end, we are about to integrate this command into
`microsoft/git`, to get the tool into the hands of users who need it
most, with the idea to iterate in close collaboration between these
users and the developers familar with Git's internals.

However, we will definitely want to avoid letting anybody have the
impression that this command, its exact inner workings, as well as its
output format, are anywhere close to stable. To make that fact utterly
clear (and thereby protect the freedom to iterate and innovate freely
before upstreaming the command), let's mark its output as experimental
in all-caps, as the first thing we do.

Signed-off-by: Johannes Schindelin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet