User:Robbat2:ChangeLog-Generation

From Gentoo Wiki
Jump to:navigation Jump to:search

ChangeLog Generation

Project to improve ChangeLog generation.

Existing implementation

what

egencache

what's wrong

very slow

what it does

  1. For each package, in portdir:
    1. Fork AsyncFunction
    2. chdir($CP_DIR)
    3. Get timestamp of top commit, run: git log --format=%ct -1 .
    4. If older than ChangeLog, break
    5. Get the list of commits from start of history: git rev-list HEAD -- .
    6. For each commit:
      1. Get changed-files, run: git diff-tree --name-status --no-renames --format=... --root --relative=$CP_DIR -r $COMMIT -- .
      2. Parse git output
      3. Build ChangeLog entry
    7. Combine ChangeLog entries in expected direction (forward or reverse)

Every fork out to git means reading all of the git data structures on disk, and while they will be cached, it's still a lot of work.

analysis of what it does

At the time of writing this, the tree statistics:

  • 19225 packages.
  • 40039 ebuilds.
  • 74625 unique commits.

Running rev-list on each package directory results in

  • total of 184773 commits being listed (calling such output pkg-commit for clarity for the rest of this section).
  • The 14 commits that touch the most packages account for approximately 50% of all pkg-commits.
  • The next most-common 86 commits (such that you get 100 commits in total) only account for a further 8.3% of all pkg-commits.
  • There are packages with 300+ commits in their history.

best case

This will happen if there have been no changes between two runs. It still evaluates every single package in PortDB.

  1. It still evaluates every single package in PortDB.
  2. Fork for each package.
  3. Call a single git-log for EVERY single package (takes 200+ ms)

With 19000+ packages in the tree, this takes north of 3800 seconds.

worst case

This will happen only if there is a global change.

In addition to the best case above, it ALSO does:

  1. git-rev-list (2-4 seconds per package).
  2. git-diff-tree (50ms per commit)
  3. Commits touching multiple package paths are evaluated MORE than once.
    1. This makes tree-wide and category-wide commits VERY expensive for ChangeLogs.

New Proposal

Design

Data structures

  • For each commit:
    • metadata, message, files-changed
  • For each package:
    • List of commits

Design problems

Every commit has 3 timestamps:

  • Author time.
  • Commit time (implicit by gpg signing time).
  • Time first-seen on main tree.

They have two implications:

  • What date should be shown on a changelog entry.
  • What order should the changelog entries appear in.

In the case of merge commits, author & commit time could be very old, and we would still want the new commit to show up at the top of history if it was added to the main tree after another commit. This either means being aware multiple active paths through git history, or finding another way to imply the ordering of commits.

Additionally, even with regular commits, short of receive-time verification, there's NOTHING enforcing developer clock correctness.

Maybe the BEST solution is for server-side introduction of commit notes, with metadata like when the commit was seen.

Process

  1. Data-collect phase:
  2. Run a SINGLE git command
    1. Should be able to take last-seen commit and only output NEW history
    2. Output should make it easy to split commits.
    3. In the worst case, this will output the ENTIRE history; could trivially be restricted (eg by date).
  3. Parsing phase:
  4. Split the output of the git command
    • Responsibilities:
      1. populate the commits database
      2. return a list of changed packages
    • Two possible designs
    1. Pure single-threaded
      • For each commit, examine all of the data and split out to data structure
    2. Hybrid
      • First part is single-threaded, and just scans for the commit separator.
      • Pipe the text of each commit into a separate process that parses & returns data.
      • Pre-format the commit messages into the data structure (they can be re-formatted latter, but it will save a lot of future compute)
  5. Output phase:
  6. Trivially multi-threaded from this point, needs only reads from the commits data structure.
  7. For each changed package:
    1. If most recent commit was a package deletion, delete the ChangeLog file on disk (or stash elsewhere).
    2. Grab (commit-hash, changetime)
    3. Sort by changetime
    4. Write changelog by fetching commits in order.
    5. Optional:
      • write latest hash & changetime to changelog, for ease of validation/processing
      • use it to append or prepend new entries.
      • Do not trust mtime on changelog file

Process breakdown

Git output command
  • Format string: --format='COMMIT%m%n%H %ct %cN <%cE>%n%BFILES%m'
    • Needs to be easy to split: between commits, and separate metadata vs list of files & file-actions
  • File output: --name-status
  • No renames: --no-renames
  • Show root as big diff: --root
  • Use topographic order: --topo-order

git log $FORMAT --name-status --no-renames --root --topo-order

This generates good output, for complete history takes ~10 seconds to generate a 30MiB output.

Topographic order seems mostly correct, but I'm not certain if it remains static as more commits are added.