Project:Infrastructure/Git migration

Status
Final hosting is ready. Launch planning for weekend of August 8/9.

Blockers

 * Infra Manpower
 * Ensure availability of final history conversion host
 * Needs lots of RAM, parallel CPU and some SSD backing
 * Consider a 1-month Hetzner server bidding option
 * Consider a RackSpace OnMetal I/O node (if available by the hour/day)
 * Consider a large AWS instance by hour
 * r3.2xlarge, m4.4xlarge, c4.8xlarge; maybe even larger?

Steps
Top-level items in bold are considered critical path to service migration.


 * 1) Freeze
 * 2) * No more CVS commits to  ever again
 * 3) * CVS->rsync conversion frozen
 * 4) Take backups
 * 5) * Final tree snapshot
 * 6) * Final CVS history backup
 * 7) * Publish both
 * 8) Perform cleanups on final snapshot
 * 9) * Remove ChangeLog files
 * 10) * Convert to thin manifests
 * 11) Publish cleaned snapshot as reference
 * 12) Commit fixed snapshot as initial signed commit on new history
 * 13) Allow developers to clone new repo and commit to it
 * 14) Turn on git->rsync
 * 15) * Manifests: Converts thin->thick
 * 16) * Changelogs: (temporary) we explicitly copy the changelog-as-is from the final
 * 17) Review/fix all scripts for further breakages
 * 18) Perform history conversion
 * 19) Re-introduce cleanups in history
 * 20) * The state of (history conversion + cleanups) MUST match the state of (initial commit) at this point
 * 21) Make converted history available as graft point
 * 22) Adjust git->sync
 * 23) * Re-enable true ChangeLog generation
 * 24) * (maybe) Implement ChangeLog expiry mechanisms

Resources

 * 's validation code: https://github.com/rich0/gitvalidate
 * ferringb's generation code: git://pkgcore.org/git-conversion-tools

People
This is in a roughly chronological order, and apologies to anybody that was left out.


 * Alec Warner (antarus) - did the GSoC 2006 migration tests
 * Robin H. Johnson (robbat2) - infra guy, herding this project
 * Nguyen Thai Ngoc Duy (pclouds) - Former Gentoo developer, wrote Git features for the migration
 * Michael Haggerty - upstream cvs2svn author
 * Brian Harring (ferringb) - wrote much python to improve cvs2svn
 * Michael G. Schwern - Perl hacker, fixed git-svn for SVN 1.7 support
 * Rich Freeman (rich0) - validation scripts
 * Patrick Lauer (patrick) - Gentoo dev, running new 2014 work in migration

Contact
For Git migration discussions subscribe to gentoo-scm mailing list:

Goals

 * Each Git commit should be mapped to one or more CVS commits
 * Portage two-phase commits (commit 1: ebuilds/files/Manifest, commit 2: Manifest regenerated from $Header$ changes, optionally GPG-signed) should be mapped to a single commit
 * Portage trailer data in CVS commit log should be converted to newline format Git logs
 * As the validation settles, it should become possible to have CVS commits generate known Git commit IDs
 * Start list of validated commit IDs

Pseudocode
do { do { adjust conversion scripts do test conversion validated all newly converted commits } while (not validation passed on all commits) switch CVS to read only do final conversion final validation if(final validation passed) { activate Git repo for public commits lock CVS permanently } else { unlock CVS } } while(still using CVS)

Historical migration
Here is how to generate the historical migration in git:
 * Patch cvs2svn to use "/" as the separator in the date format in keywords. http://dev.gentoo.org/~rich0/gitmig/cvs2svn.patch
 * Use the migration scripts at: https://github.com/gentoo/git-migration-scripts-rich0
 * (provide list of dependencies for scripts)
 * Obtain tarball of cvsroot (or squashfs - preferable for cache use)
 * Place/mount cvs in cvs-repo
 * Run script.sh --fast
 * From git directory, run git bundle create master

Validation
Quick notes on how to test: CREATE TABLE `cvs` ( `key` int(11) NOT NULL AUTO_INCREMENT,  `filename` varchar(500) COLLATE utf8_bin NOT NULL,  `type` varchar(5) COLLATE utf8_bin NOT NULL,  `hash` varchar(50) COLLATE utf8_bin NOT NULL,  `timestamp` int(11) NOT NULL,  `author` varchar(200) COLLATE utf8_bin NOT NULL,  `message` text COLLATE utf8_bin NOT NULL,  `revision` varchar(10) COLLATE utf8_bin NOT NULL,  PRIMARY KEY (`key`),  KEY `filename` (`filename`(255),`hash`),  KEY `hash` (`hash`) ) ENGINE=MyISAM AUTO_INCREMENT=3132434 DEFAULT CHARSET=utf8 COLLATE=utf8_bin CREATE TABLE `git` ( `key` int(11) NOT NULL AUTO_INCREMENT,  `filename` varchar(500) COLLATE utf8_bin NOT NULL,  `type` varchar(5) COLLATE utf8_bin NOT NULL,  `hash` varchar(50) COLLATE utf8_bin NOT NULL,  `timestamp` int(11) NOT NULL,  `author` varchar(200) COLLATE utf8_bin NOT NULL,  `message` text COLLATE utf8_bin NOT NULL,  `commit` varchar(50) COLLATE utf8_bin NOT NULL,  PRIMARY KEY (`key`),  KEY `filename` (`filename`(255),`hash`),  KEY `hash` (`hash`) ) ENGINE=MyISAM AUTO_INCREMENT=3030211 DEFAULT CHARSET=utf8 COLLATE=utf8_bin load data local infile 'c' into table cvs fields terminated by ',' lines terminated by '\n' (filename,type,hash,timestamp,author,message,revision); load data local infile 'g' into table git fields terminated by ',' lines terminated by '\n' (filename,type,hash,timestamp,author,message,commit); create table onlycvs ENGINE = MYISAM select cvs.* from `cvs` left join `git` as g on cvs.hash=g.hash where g.hash is null ; create table onlygit ENGINE = MYISAM select g.* from `git` as g left join `cvs`on cvs.hash=g.hash where cvs.hash is null ; delete from onlycvs where revision="1.1.1.1" ; delete from onlycvs where filename like "%Manifest%" ; delete from onlygit where filename like "%Manifest%" ; create table baddate ENGINE = MYISAM select c.*,g.commit from `cvs` as c join `git` as g on (g.hash=c.hash and g.filename=c.filename) where abs(c.timestamp - g.timestamp) > 60*60 ; create table badmessage ENGINE = MYISAM select c.*, g.author as gauthor, g.commit, g.message as gmessage from `cvs` as c join `git` as g on (g.hash=c.hash and g.filename=c.filename) where c.message <> g.message and g.filename not like "%Manifest%" and abs(c.timestamp - g.timestamp) < 60*60; UPDATE `badmessage` SET `author`=BASE64_DECODE(`author`), `gauthor`=BASE64_DECODE(`gauthor`), `message`=BASE64_DECODE(`message`), `gmessage`=BASE64_DECODE(`gmessage`);
 * Source for the validation scripts at: https://github.com/rich0/gitvalidate.git
 * Clone the git bundle into a directory
 * Extract the cvs root into a directory
 * (uncertain - may need to set up local bind mounts or symlinks to match the path in the cvs keywords)
 * Checkout the cvs gentoo-x86 module into another directory
 * (uncertain - may need to edit config files to ensure that cvs checkouts hit the local root, and don't hit Gentoo infra - test before running the script, or watch the script and if it isn't using near 100% CPU it probably is hammering the server so stop it!)
 * Use git log to obtain the hash of the last git commit
 * Point TMPDIR at a location with ~10GB of space (/tmp on tmpfs may not cut it and sort will fail).
 * Run gitdump/gitprocesstree.sh > g
 * Run cvsdump/cvsprocesstree.sh  . > c
 * Create a table in mysql to hold the cvs output:
 * Create a table in mysql to hold the git output:
 * Define the base64 handling procedures found at http://stackoverflow.com/questions/358500/base64-encode-in-mysql
 * Load the data into the tables:
 * Process the data into several tables:

2006

 * The first major work in VCS Migration was done as a GSoC 2006 project by User:Antarus.
 * Git was mostly too resource intensive at this point for serious consideration, and was slower than CVS.
 * Conversion takes more than 7 days.
 * Decision to stay on CVS

2009

 * April:
 * Converting a recent CVS copy - Item 1: mailmap fun
 * Converting a recent CVS copy - Item 2: statistics
 * Conversion time: 18.5 hours
 * June:
 * Progress summary, 2009/06/01
 * Conversion time: 9 hours
 * Bug in cvs2svn/cvs2git causes lines of files to be lost
 * ExternalBlobGenerator module created by upstream author, originally closed source, and non-public: improves pass1 from 36204 seconds to 1598 seconds


 * October: Gentoo meeting at the GSoC Mentor Summit
 * All Gentoo developers present held a meeting, one of the major topics was blockers and plans for the Git migration.
 * Shawn Pearce, one of the major Git developers, and author of the Repo tool.
 * Decision of a monolith repo, per-category repo, per-package repos: monolith repo wins.

2010

 * User:ferringb takes on Python improvements with snakeoil and Unladen Swallow
 * Gentoo SCM conversion status report, 2010/01/27
 * Conversion time: 110 minutes
 * Commit Signing &amp; Sparse Trees identified as requirements

2011

 * August:
 * Re: gentoo-dev Progress on cvs->git migration (status report)
 * Unresolved items: commit signing, thin Manifests, merge policies
 * September:
 * Portage gets thin Manifest support
 * October:
 * commit: teach --gpg-sign option

2012

 * May-July:
 * Bug #418431: (git-svn is broken with SVN 1.7 and can corrupt data) causes a hassle for Git work (part of the migration process at this time relies heavily on the cvs2svn codebase)
 * October:
 * Email [gentoo-scm] Fwd: [gentoo-dev] CIA replacement on 2012/10/01 by rich0.
 * Bug #333531: portage migration to git (tracker bug)
 * Outstanding items: pre-upload hook, git2rsync scripts, validation, documentation
 * Email [gentoo-scm] CVS -> git, list of where non-infra folk can contribute on 2012/10/01 by ferringb
 * Lays out the many tasks well
 * http://git.stuge.se/?p=portage.git;a=commitdiff;h=thickandthin mentioned for merging, still not done?

2014

 * February: Progress made on some blockers (i.e. they were found obsoleted)
 * Bug #333531: portage migration to git (tracker bug)
 * Major outstanding items:
 * Wait for jk/pack-bitmap to land in a git release (pack-bitmap landed in git 2 release)
 * Enforce GPG commit signing
 * Get gitolite to log to syslog
 * March: GLEP 63 - Minimum requirement and a recommended set of GPG key management policies for the Gentoo Linux distribution.
 * May: Gentoo Keys: Tool that manages GPG key validation/updates and performs multiple "health" checks on GPG keys
 * October: Regular test migrations happening, based on 2014/09/15 snapshot:
 * [converted repo]
 * [Migration tools]
 * [History validation tools]