Task #10435 (closed)
Generate Spec History Extract
Reported by: | ajpatterson | Owned by: | ajpatterson |
---|---|---|---|
Priority: | major | Milestone: | 5.0.0-beta1 |
Component: | Specification | Version: | n.a. |
Keywords: | n.a. | Cc: | jburel, jamoore, crueden-x, mlinkert, rleigh |
Resources: | n.a. | Referenced By: | n.a. |
References: | n.a. | Remaining Time: | n.a. |
Sprint: | n.a. |
Description (last modified by ajpatterson)
Generate the git history extract for just the needed part of the spec folder from the ome.git to import into the bioformats.git
Attachments (1)
Change History (18)
comment:1 Changed 11 years ago by ajpatterson
comment:2 Changed 11 years ago by ajpatterson
- Description modified (diff)
comment:3 Changed 11 years ago by ajpatterson
- Description modified (diff)
comment:4 Changed 11 years ago by ajpatterson
- Summary changed from Generate History Extract to Generate Spec History Extract
comment:5 Changed 11 years ago by ajpatterson
The above git subtree approach loses the old svn history.
comment:6 Changed 11 years ago by ajpatterson
- Cc jburel jmoore crueden-x added
- Description modified (diff)
I also tried using filter-branch.
[andrew@voile ~/Work]$ git clone ome split-sample-spec [andrew@voile ~/Work]$ cd split-sample-spec/ [andrew@voile ~/Work/split-sample-spec]$ du -sh . 1.5G . [andrew@voile ~/Work/split-sample-spec]$ git filter-branch --subdirectory-filter components/specification HEAD [andrew@voile ~/Work/split-sample-spec]$ du -sh . 1.3G . [andrew@voile ~/Work/split-sample-spec]$ cd ../ [andrew@voile ~/Work]$ mkdir split-sample-spec2/ [andrew@voile ~/Work]$ cd split-sample-spec2/ [andrew@voile ~/Work/split-sample-spec2]$ git init [andrew@voile ~/Work/split-sample-spec2]$ git fetch ../split-sample-spec develop [andrew@voile ~/Work/split-sample-spec2]$ git checkout -b master FETCH_HEAD [andrew@voile ~/Work/split-sample-spec2]$ du -sh . 103M .
This also loses the old svn history.
comment:7 Changed 11 years ago by ajpatterson
I tried using this approach (filter-branch with custom script) :
This took over 24 hours to run and produced the same result. The git history is OK but the previously imported svn history is missing.
Note: Attaching script I used.
comment:8 Changed 11 years ago by crueden-x
Andrew and I went over this ticket again in person.
First, we identified the "junction points" where things got renamed. In the case of components/specification, this was:
46b98f2df51ca0d3f458fa7001159b5ac82c270d Merge: 9755be6 a0d6f86 Date: Thu Jan 20 09:02:44 2011 +0000 Merge specification/master project as components/specification
The above is where the SVN history from the specification repository got merged into openmicroscopy.git.
We created a new branch spec1 pointing at a0d6f86 (the last commit of the specification SVN repository).
We then rewrote spec1 into the components/specification folder, for consistency with the later history, using the command:
git filter-branch -f --prune-empty --tree-filter 'mkdir .tmp && mv * .tmp && mkdir components && mv .tmp components/specification'
Then I added a graft (by creating the file .git/info/grafts) as follows:
053f0d7c21ac8e67e6f829d91ef3270e98fcb1ba ef05bcb5448f38e2e809ce77e8049ecefe252b1c
Where ef05bcb was the rewritten final commit of spec1 (i.e., the last commit of the SVN history), and 053f0d7 is the first real Git commit to the new components/specification folder (i.e., after the 46b98f2 merge above). This graft overrode 053f0d7's parent to be ef05bcb, resulting in a seamless history from Git to SVN with no apparent mass rename.
There was then a complete history of components/specification and components/xsd-fu, but also many other things irrelevant to the migration. (Side note there: we decided not to migrate the validator code in components/validator and components/specification/Xml/Validator.) So it was time for some pruning.
Unfortunately, git's tree filter is very slow, and the grafted repository stood at 8949 commits. So I looked for a faster solution. Though inelegant, I settled on:
git filter-branch -f --prune-empty --index-filter ' git rm -r --cached --ignore-unmatch \ components/antlib \ components/bioformats \ components/blitz \ components/client \ components/common \ components/dsl \ components/insight \ components/model \ components/rendering \ components/romio \ components/server \ components/specification \ components/tools \ components/validator \ components/xsd-fu \ docs \ etc \ examples \ lib \ sql \ '
This is the same technique (--index-filter with an rm command) used by the git-delete-history.sh script mentioned above. It pruned out many irrelevant subtrees, reducing the repository from 8949 to 1617 commits (verified with git rev-list HEAD --count) at a rate of ~2-8 commits per second.
It also had the side effect of baking in the graft permanently, so I cleaned up:
rm .git/info/grafts
Of course, the above only covered the directories in existence at the tip of the branch. We also wanted to prune out the old validator directory:
git filter-branch -f --prune-empty --index-filter 'git rm -r --cached --ignore-unmatch components/specification/Xml/Validator'
This reduced the branch from 1617 to 1456 commits.
There were now few enough commits that I finished it off with a tree filter:
git filter-branch -f --prune-empty --tree-filter ' rm -f \ components/specification/Samples/2011-06/LAMBDA-modulo-sample.ome.tiff \ components/specification/Samples/2011-06/SPIM-modulo-sample.ome.tiff \ components/specification/Samples/2012-06/LAMBDA-modulo-sample.ome.tiff \ components/specification/Samples/2012-06/SPIM-modulo-sample.ome.tiff components/specification/Samples/OmeFiles/2011-06/LAMBDA-modulo-sample.ome.tiff \ components/specification/Samples/OmeFiles/2011-06/SPIM-modulo-sample.ome.tiff \ components/specification/Samples/OmeFiles/2012-06/LAMBDA-modulo-sample.ome.tiff \ components/specification/Samples/OmeFiles/2012-06/SPIM-modulo-sample.ome.tiff if [ -d components/specification ] then mv components/specification .spec fi if [ -d components/xsd-fu ] then mv components/xsd-fu .xsd-fu fi rm -rf * mkdir components if [ -d .spec ] then mv .spec components/specification fi if [ -d .xsd-fu ] then mv .xsd-fu components/xsd-fu fi '
This script first purges the four large sample files (identified above) if present. (I tried doing this as part of the index filters above, but kept receiving errors. So as a workaround, I just tacked the logic onto the tree filter here.) The script then squirrels away both the specification and xsd-fu components if present. It then deletes everything else, and finally restores specification and xsd-fu. After this operation, the repository was down to 1331 commits.
However, looking at directory structure, there were still several hidden files:
$ ls -a ./ ../ .classpath-template .git/ .gitignore .gitmodules .project components/
To prune those safely, I resorted to:
git filter-branch -f --prune-empty --tree-filter ' find . -maxdepth 1 -name '\''.*'\'' | \ grep -v '\''^\.$'\'' | \ grep -v '\''^\.\/\.git$'\'' | \ xargs rm -rf '
This finds all dot files & folders in the root directory, filters out '.' and '.git', and deletes the rest. Remaining commits: 1292.
At this point, things were looking good in the working copy, but there was still a problem with the history: it was filled with empty merge commits (sadly, "--prune-empty" is not smart enough to purge those).
To prune the empty merge commits, I used the strategy described on the Removing useless merge commit with "filter-branch" thread of the git mailing list:
git filter-branch -f --prune-empty --parent-filter ~/rewrite_parent.rb
Where the contents of rewrite_parent.rb are:
#!/usr/bin/ruby old_parents = gets.chomp.gsub('-p ', ' ') if old_parents.empty? then new_parents = [] else new_parents = `git show-branch --independent #{old_parents}`.split end puts new_parents.map{|p| '-p ' + p}.join(' ')
Not the simplest possible solution, but it did the job. Commit count reduced from 1292 to 807.
The commit history was looking really good now. But the authors were still very inconsistent, which made git shortlog -nse very hard to read. So I cleaned them up:
git filter-branch -f --commit-filter ' if [ "$GIT_AUTHOR_NAME" = "andrew" -o "$GIT_AUTHOR_NAME" = "ajpatterson" -o "$GIT_AUTHOR_NAME" = "Andrew J Patterson" ]; then GIT_COMMITTER_NAME="Andrew J Patterson"; GIT_COMMITTER_EMAIL="ajpatterson@lifesci.dundee.ac.uk"; GIT_AUTHOR_NAME="$GIT_COMMITTER_NAME"; GIT_AUTHOR_EMAIL="$GIT_COMMITTER_EMAIL"; git commit-tree "$@"; elif [ "$GIT_AUTHOR_NAME" = "cxallan" -o "$GIT_AUTHOR_NAME" = "callan" -o "$GIT_AUTHOR_NAME" = "Chris Allan" ]; then GIT_COMMITTER_NAME="Chris Allan"; GIT_COMMITTER_EMAIL="callan@glencoesoftware.com"; GIT_AUTHOR_NAME="$GIT_COMMITTER_NAME"; GIT_AUTHOR_EMAIL="$GIT_COMMITTER_EMAIL"; git commit-tree "$@"; elif [ "$GIT_AUTHOR_NAME" = "Roger Leigh" ]; then GIT_COMMITTER_NAME="Roger Leigh"; GIT_COMMITTER_EMAIL="r.leigh@dundee.ac.uk"; GIT_AUTHOR_NAME="$GIT_COMMITTER_NAME"; GIT_AUTHOR_EMAIL="$GIT_COMMITTER_EMAIL"; git commit-tree "$@"; elif [ "$GIT_AUTHOR_NAME" = "jburel" -o "$GIT_AUTHOR_NAME" = "jean-marie burel" ]; then GIT_COMMITTER_NAME="Jean-Marie Burel"; GIT_COMMITTER_EMAIL="j.burel@dundee.ac.uk"; GIT_AUTHOR_NAME="$GIT_COMMITTER_NAME"; GIT_AUTHOR_EMAIL="$GIT_COMMITTER_EMAIL"; git commit-tree "$@"; elif [ "$GIT_AUTHOR_NAME" = "jmoore" -o "$GIT_AUTHOR_NAME" = "Josh Moore" ]; then GIT_COMMITTER_NAME="Josh Moore"; GIT_COMMITTER_EMAIL="josh@glencoesoftware.com"; GIT_AUTHOR_NAME="$GIT_COMMITTER_NAME"; GIT_AUTHOR_EMAIL="$GIT_COMMITTER_EMAIL"; git commit-tree "$@"; elif [ "$GIT_AUTHOR_NAME" = "ctrueden" -o "$GIT_AUTHOR_NAME" = "Curtis Rueden" ]; then GIT_COMMITTER_NAME="Curtis Rueden"; GIT_COMMITTER_EMAIL="ctrueden@wisc.edu"; GIT_AUTHOR_NAME="$GIT_COMMITTER_NAME"; GIT_AUTHOR_EMAIL="$GIT_COMMITTER_EMAIL"; git commit-tree "$@"; elif [ "$GIT_AUTHOR_NAME" = "melissa" -o "$GIT_AUTHOR_NAME" = "mlinkert-x" -o "$GIT_AUTHOR_NAME" = "Melissa Linkert" ]; then GIT_COMMITTER_NAME="Melissa Linkert"; GIT_COMMITTER_EMAIL="melissa@glencoesoftware.com"; GIT_AUTHOR_NAME="$GIT_COMMITTER_NAME"; GIT_AUTHOR_EMAIL="$GIT_COMMITTER_EMAIL"; git commit-tree "$@"; elif [ "$GIT_AUTHOR_NAME" = "donald" ]; then GIT_COMMITTER_NAME="Donald MacDonald"; GIT_COMMITTER_EMAIL="donald@lifesci.dundee.ac.uk"; GIT_AUTHOR_NAME="$GIT_COMMITTER_NAME"; GIT_AUTHOR_EMAIL="$GIT_COMMITTER_EMAIL"; git commit-tree "$@"; else git commit-tree "$@"; fi ' HEAD
By this point, the history differed from the original SVN repository so much that leaving the git-svn-id metadata in place was rather misleading. So I removed it:
git filter-branch -f --msg-filter ' sed -e "/^git-svn-id:/d" '
Finally, we have a very clean history containing *only* the two desired subtrees, ready to be merged into bioformats.git. This branch has been pushed to ctrueden/ome-xml-history of openmicroscopy.git.
Remaining steps for specification migration:
- Verify the final size of the repository, pruning further if needed
- git merge it in to bioformats.git
- Make the build system work as-is
- Add more automation to the build system (i.e., code generation as part of the build)
- Update Jenkins accordingly
All: please let me know if you have any questions.
Andrew: since this is your ticket, I leave the rest to you.
comment:9 Changed 11 years ago by ajpatterson
Preparing the branch (in a clean repo). Used:
git clone bioformats bioplus cd bioplus/ git fetch ../ome ome-xml-history git graph git branch git checkout -b ome-xml-history FETCH_HEAD git checkout -b add-spec-and-xsdfu origin/develop git merge ome-xml-history git graph git blame components/specification/InProgress/ome.xsd git push gh add-spec-and-xsdfu
Push is in progress but github is on a go slow... will add link when push complete.
comment:10 Changed 11 years ago by ajpatterson
comment:11 Changed 11 years ago by jmoore
As Curtis mentioned, about 70MB increase:
Josh-Moores-MacBook-Pro:tmp moore$ git clone -b add-spec-and-xsdfu --single-branch https://github.com/qidane/bioformats Cloning into 'bioformats'... remote: Counting objects: 138323, done. remote: Compressing objects: 100% (25617/25617), done. remote: Total 138323 (delta 100717), reused 136220 (delta 99663) Receiving objects: 100% (138323/138323), 223.85 MiB | 264 KiB/s, done. Resolving deltas: 100% (100717/100717), done. Checking out files: 100% (2996/2996), done. Josh-Moores-MacBook-Pro:tmp moore$ du bioformats/ 228M bioformats//.git 76K bioformats//ant 44M bioformats//components 36K bioformats//config 4.7M bioformats//docs 21M bioformats//jar 44K bioformats//lib 56K bioformats//license 20K bioformats//pom 100K bioformats//tools 298M bioformats/ Josh-Moores-MacBook-Pro:tmp moore$ cd bioformats/ Josh-Moores-MacBook-Pro:bioformats moore$ git branch -a * add-spec-and-xsdfu remotes/origin/add-spec-and-xsdfu Josh-Moores-MacBook-Pro:bioformats moore$ cd .. Josh-Moores-MacBook-Pro:tmp moore$ git clone -b develop --single-branch https://github.com/openmicroscopy/bioformats bioformats_origin Cloning into 'bioformats_origin'... remote: Counting objects: 119712, done. remote: Compressing objects: 100% (19576/19576), done. remote: Total 119712 (delta 89999), reused 117600 (delta 88937) Receiving objects: 100% (119712/119712), 168.89 MiB | 387 KiB/s, done. Resolving deltas: 100% (89999/89999), done. Checking out files: 100% (2540/2540), done. Josh-Moores-MacBook-Pro:tmp moore$ du bioformats_origin/ 172M bioformats_origin//.git 76K bioformats_origin//ant 30M bioformats_origin//components 36K bioformats_origin//config 4.7M bioformats_origin//docs 21M bioformats_origin//jar 44K bioformats_origin//lib 56K bioformats_origin//license 20K bioformats_origin//pom 100K bioformats_origin//tools 228M bioformats_origin/ Josh-Moores-MacBook-Pro:tmp moore$
comment:12 Changed 11 years ago by jmoore
- Cc mlinkert rleigh added
comment:13 Changed 11 years ago by jmoore
Using a slightly modified version of the script here http://stackoverflow.com/questions/298314/find-files-in-git-repo-over-x-megabytes-that-dont-exist-in-head/7945209#7945209, I found these big files in the history:
Josh-Moores-MacBook-Pro:bioformats moore$ ./big.rb HEAD^2 components/specification 5.0 5.5M components/specification/Documentation/Artwork/Logos/OMERO/vector/omero-logo.eps (ca75a3a: 7 months ago) 5.5M components/specification/Documentation/Artwork/Logos/OMERO/vector/omero-logo-bw.eps (ca75a3a: 7 months ago) 5.8M components/specification/Documentation/Diagrams/Enterprise/omerodbdiagrams2010-09.eap (98eae52: 3 years ago) 10.5M components/specification/Documentation/Diagrams/Enterprise/Omero4-1DbDiagram-March2010.pdf (85614bd: 3 years ago) 5.5M components/specification/Documentation/Artwork/Logos/OME/vector/ome-logo.eps (ca75a3a: 7 months ago) 5.5M components/specification/Documentation/Artwork/Logos/Scifio/vector/scifio-logo-bw.eps (ca75a3a: 7 months ago) 7.6M components/specification/OME/src/DataModel/EA Diagrams/OmeroDbDiagrams.eap (1b2ea4b: 6 years ago) 7.6M components/specification/Documentation/Diagrams/Enterprise/OmeroDbDiagrams.eap (7149def: 4 years, 10 months ago) 5.5M components/specification/Documentation/Artwork/Logos/Bio-Formats/vector/bio-formats-logo.eps (ca75a3a: 7 months ago) 5.5M components/specification/Documentation/Artwork/Logos/Scifio/vector/scifio-logo.eps (ca75a3a: 7 months ago) 5.5M components/specification/Documentation/Artwork/Logos/OME/vector/ome-logo-bw.eps (ca75a3a: 7 months ago) 7.6M components/specification/Documentation/Diagrams/Enterprise/OmeroDbDiagrams.eap (53b6291: 3 years ago) 18.1M components/specification/Samples/OmeFiles/2011-06/SPIM-modulo-sample.ome.tiff (dbb91b6: 1 year, 6 months ago) 7.6M components/specification/OME/src/DataModel/EA Diagrams/OmeroDbDiagrams.eap (526ecf7: 6 years ago) 17.5M components/specification/Samples/OmeFiles/2011-06/SPIM-modulo-sample.ome.tiff (296fbc8: 1 year, 6 months ago) 5.5M components/specification/Documentation/Artwork/Logos/Bio-Formats/vector/bio-formats-logo-bw.eps (ca75a3a: 7 months ago) 18.2M components/specification/Samples/OmeFiles/2012-06/SPIM-modulo-sample.ome.tiff (948dcf0: 6 weeks ago)
Script is here:
#!/usr/bin/env ruby -w head, subpath, treshold = ARGV head ||= 'HEAD' Megabyte = 1000 ** 2 treshold = (treshold || 0.1).to_f * Megabyte print treshold if subpath pattern = "git rev-list #{head} -- #{subpath}" puts pattern else pattern = "git rev-list #{head}" end big_files = {} IO.popen(pattern, 'r') do |rev_list| rev_list.each_line do |commit| commit.chomp! for object in `git ls-tree -zrl #{commit}`.split("\0") bits, type, sha, size, path = object.split(/\s+/, 5) size = size.to_i big_files[sha] = [path, size, commit] if size >= treshold end end end big_files.each do |sha, (path, size, commit)| where = `git show -s #{commit} --format='%h: %cr'`.chomp puts "%4.1fM\t%s\t(%s)" % [size.to_f / Megabyte, path, where] end
comment:14 Changed 11 years ago by ajpatterson
New version pushed to https://github.com/qidane/bioformats/tree/add-spec-and-xsdfu
All files larger than 1.2 M have been removed from this history except for the following that would not delete:
3.3M components/specification/tags/Schema-2010-04-RC1/Working/Work-PostEvolutionWithXsd.EAP (7f76254: 2 years, 11 months ago) 2.3M components/specification/Xml/Working/Work-DataModel.EAP (1543f96: 5 years ago) 1.5M components/specification/Xml/Working/Work-ScreenWell.EAP (1543f96: 5 years ago) 1.7M components/specification/Xml/Working/PostEvolution.EAP (eba5e80: 6 years ago)
comment:15 Changed 11 years ago by ajpatterson
Problem sorted. It was a sequence of very similar file renames.
New version pushed to https://github.com/qidane/bioformats/tree/add-spec-and-xsdfu
comment:16 Changed 11 years ago by ajpatterson
- Resolution set to fixed
- Status changed from new to closed
comment:17 Changed 11 years ago by ajpatterson
Another approach found by Josh (in case we need to do this sort of thing again) :
http://www.donarmstrong.com/posts/migrating_from_svn_to_git_and_git_annex/
After a bit of playing had some success with: