Warning: Can't synchronize with repository "(default)" (/home/git/ome.git does not appear to be a Git repository.). Look in the Trac log for more information.
Notice: In order to edit this ticket you need to be either: a Product Owner, The owner or the reporter of the ticket, or, in case of a Task not yet assigned, a team_member"

Task #10435 (closed)

Opened 11 years ago

Closed 11 years ago

Last modified 11 years ago

Generate Spec History Extract

Reported by: ajpatterson Owned by: ajpatterson
Priority: major Milestone: 5.0.0-beta1
Component: Specification Version: n.a.
Keywords: n.a. Cc: jburel, jamoore, crueden-x, mlinkert, rleigh
Resources: n.a. Referenced By: n.a.
References: n.a. Remaining Time: n.a.
Sprint: n.a.

Description (last modified by ajpatterson)

Generate the git history extract for just the needed part of the spec folder from the ome.git to import into the bioformats.git

Attachments (1)

git-script.sh (616 bytes) - added by ajpatterson 11 years ago.
Script used to try and keep svn history

Download all attachments as: .zip

Change History (18)

comment:1 Changed 11 years ago by ajpatterson

After a bit of playing had some success with:

[andrew@voile ~/Work/ome]$ git subtree split -P components/specification -b split-spec

[andrew@voile ~/Work/ome]$ cd ..
[andrew@voile ~/Work]$ mkdir part-spec
[andrew@voile ~/Work]$ cd part-spec/
[andrew@voile ~/Work/part-spec]$ git init
[andrew@voile ~/Work/part-spec]$ git fetch ../ome split-spec
[andrew@voile ~/Work/part-spec]$ git checkout -b master FETCH_HEAD

It was still over 100M in size due to 4 files (SPIM and LAMBDA) in the /Samples/ folder 

Used:
git-delete-history.sh Samples/2011-06/LAMBDA-modulo-sample.ome.tiff
git-delete-history.sh Samples/2011-06/SPIM-modulo-sample.ome.tiff
git-delete-history.sh Samples/2012-06/LAMBDA-modulo-sample.ome.tiff
git-delete-history.sh Samples/2012-06/SPIM-modulo-sample.ome.tiff

To remove the four files but still over 90M as history not removed fully, so:

[andrew@voile ~/Work/min-sample-spec]$ du -sh .
 94M	.
[andrew@voile ~/Work/min-sample-spec]$ git reflog expire --all
[andrew@voile ~/Work/min-sample-spec]$ git repack -a -d
Counting objects: 3707, done.
Delta compression using up to 2 threads.
Compressing objects: 100% (862/862), done.
Writing objects: 100% (3707/3707), done.
Total 3707 (delta 2705), reused 3707 (delta 2705)
[andrew@voile ~/Work/min-sample-spec]$ git prune-packed
[andrew@voile ~/Work/min-sample-spec]$ git gc --aggressive --prune=now
Counting objects: 3707, done.
Delta compression using up to 2 threads.
Compressing objects: 100% (3567/3567), done.
Writing objects: 100% (3707/3707), done.
Total 3707 (delta 2705), reused 1002 (delta 0)
[andrew@voile ~/Work/min-sample-spec]$ du -sh .
 19M	.
[andrew@voile ~/Work/min-sample-spec]$ git graph

comment:2 Changed 11 years ago by ajpatterson

  • Description modified (diff)

comment:3 Changed 11 years ago by ajpatterson

  • Description modified (diff)

comment:4 Changed 11 years ago by ajpatterson

  • Summary changed from Generate History Extract to Generate Spec History Extract

comment:5 Changed 11 years ago by ajpatterson

The above git subtree approach loses the old svn history.

comment:6 Changed 11 years ago by ajpatterson

  • Cc jburel jmoore crueden-x added
  • Description modified (diff)

I also tried using filter-branch.

[andrew@voile ~/Work]$ git clone ome split-sample-spec
[andrew@voile ~/Work]$ cd split-sample-spec/
[andrew@voile ~/Work/split-sample-spec]$ du -sh .
1.5G	.
[andrew@voile ~/Work/split-sample-spec]$ git filter-branch --subdirectory-filter components/specification HEAD
[andrew@voile ~/Work/split-sample-spec]$ du -sh .
1.3G	.
[andrew@voile ~/Work/split-sample-spec]$ cd ../
[andrew@voile ~/Work]$ mkdir split-sample-spec2/
[andrew@voile ~/Work]$ cd split-sample-spec2/
[andrew@voile ~/Work/split-sample-spec2]$ git init
[andrew@voile ~/Work/split-sample-spec2]$ git fetch ../split-sample-spec develop
[andrew@voile ~/Work/split-sample-spec2]$ git checkout -b master FETCH_HEAD
[andrew@voile ~/Work/split-sample-spec2]$ du -sh .
103M	.

This also loses the old svn history.

comment:7 Changed 11 years ago by ajpatterson

I tried using this approach (filter-branch with custom script) :

http://stackoverflow.com/questions/14759345/how-to-split-a-git-repository-and-follow-directory-renames

This took over 24 hours to run and produced the same result. The git history is OK but the previously imported svn history is missing.

Note: Attaching script I used.

Changed 11 years ago by ajpatterson

Script used to try and keep svn history

comment:8 Changed 11 years ago by crueden-x

Andrew and I went over this ticket again in person.

First, we identified the "junction points" where things got renamed. In the case of components/specification, this was:

46b98f2df51ca0d3f458fa7001159b5ac82c270d
Merge: 9755be6 a0d6f86
Date:   Thu Jan 20 09:02:44 2011 +0000
    Merge specification/master project as components/specification

The above is where the SVN history from the specification repository got merged into openmicroscopy.git.

We created a new branch spec1 pointing at a0d6f86 (the last commit of the specification SVN repository).

We then rewrote spec1 into the components/specification folder, for consistency with the later history, using the command:

git filter-branch -f --prune-empty --tree-filter 'mkdir .tmp && mv * .tmp && mkdir components && mv .tmp components/specification'

Then I added a graft (by creating the file .git/info/grafts) as follows:

053f0d7c21ac8e67e6f829d91ef3270e98fcb1ba ef05bcb5448f38e2e809ce77e8049ecefe252b1c

Where ef05bcb was the rewritten final commit of spec1 (i.e., the last commit of the SVN history), and 053f0d7 is the first real Git commit to the new components/specification folder (i.e., after the 46b98f2 merge above). This graft overrode 053f0d7's parent to be ef05bcb, resulting in a seamless history from Git to SVN with no apparent mass rename.

There was then a complete history of components/specification and components/xsd-fu, but also many other things irrelevant to the migration. (Side note there: we decided not to migrate the validator code in components/validator and components/specification/Xml/Validator.) So it was time for some pruning.

Unfortunately, git's tree filter is very slow, and the grafted repository stood at 8949 commits. So I looked for a faster solution. Though inelegant, I settled on:

git filter-branch -f --prune-empty --index-filter '
git rm -r --cached --ignore-unmatch \
components/antlib \
components/bioformats \
components/blitz \
components/client \
components/common \
components/dsl \
components/insight \
components/model \
components/rendering \
components/romio \
components/server \
components/specification \
components/tools \
components/validator \
components/xsd-fu \
docs \
etc \
examples \
lib \
sql \
'

This is the same technique (--index-filter with an rm command) used by the git-delete-history.sh script mentioned above. It pruned out many irrelevant subtrees, reducing the repository from 8949 to 1617 commits (verified with git rev-list HEAD --count) at a rate of ~2-8 commits per second.

It also had the side effect of baking in the graft permanently, so I cleaned up:

rm .git/info/grafts

Of course, the above only covered the directories in existence at the tip of the branch. We also wanted to prune out the old validator directory:

git filter-branch -f --prune-empty --index-filter 'git rm -r --cached --ignore-unmatch components/specification/Xml/Validator'

This reduced the branch from 1617 to 1456 commits.

There were now few enough commits that I finished it off with a tree filter:

git filter-branch -f --prune-empty --tree-filter '
rm -f \
  components/specification/Samples/2011-06/LAMBDA-modulo-sample.ome.tiff \
  components/specification/Samples/2011-06/SPIM-modulo-sample.ome.tiff \
  components/specification/Samples/2012-06/LAMBDA-modulo-sample.ome.tiff \
  components/specification/Samples/2012-06/SPIM-modulo-sample.ome.tiff
  components/specification/Samples/OmeFiles/2011-06/LAMBDA-modulo-sample.ome.tiff \
  components/specification/Samples/OmeFiles/2011-06/SPIM-modulo-sample.ome.tiff \
  components/specification/Samples/OmeFiles/2012-06/LAMBDA-modulo-sample.ome.tiff \
  components/specification/Samples/OmeFiles/2012-06/SPIM-modulo-sample.ome.tiff
if [ -d components/specification ]
then
  mv components/specification .spec
fi
if [ -d components/xsd-fu ]
then
  mv components/xsd-fu .xsd-fu
fi
rm -rf *
mkdir components
if [ -d .spec ]
then
  mv .spec components/specification
fi
if [ -d .xsd-fu ]
then
  mv .xsd-fu components/xsd-fu
fi
'

This script first purges the four large sample files (identified above) if present. (I tried doing this as part of the index filters above, but kept receiving errors. So as a workaround, I just tacked the logic onto the tree filter here.) The script then squirrels away both the specification and xsd-fu components if present. It then deletes everything else, and finally restores specification and xsd-fu. After this operation, the repository was down to 1331 commits.

However, looking at directory structure, there were still several hidden files:

$ ls -a
./                   ../                  .classpath-template  .git/
.gitignore           .gitmodules          .project             components/

To prune those safely, I resorted to:

git filter-branch -f --prune-empty --tree-filter '
find . -maxdepth 1 -name '\''.*'\'' | \
  grep -v '\''^\.$'\'' | \
  grep -v '\''^\.\/\.git$'\'' | \
  xargs rm -rf
'

This finds all dot files & folders in the root directory, filters out '.' and '.git', and deletes the rest. Remaining commits: 1292.

At this point, things were looking good in the working copy, but there was still a problem with the history: it was filled with empty merge commits (sadly, "--prune-empty" is not smart enough to purge those).

To prune the empty merge commits, I used the strategy described on the Removing useless merge commit with "filter-branch" thread of the git mailing list:

git filter-branch -f --prune-empty --parent-filter ~/rewrite_parent.rb

Where the contents of rewrite_parent.rb are:

#!/usr/bin/ruby
old_parents = gets.chomp.gsub('-p ', ' ')

if old_parents.empty? then
  new_parents = []
else
  new_parents = `git show-branch --independent #{old_parents}`.split
end

puts new_parents.map{|p| '-p ' + p}.join(' ')

Not the simplest possible solution, but it did the job. Commit count reduced from 1292 to 807.

The commit history was looking really good now. But the authors were still very inconsistent, which made git shortlog -nse very hard to read. So I cleaned them up:

git filter-branch -f --commit-filter '
  if [ "$GIT_AUTHOR_NAME" = "andrew" -o "$GIT_AUTHOR_NAME" = "ajpatterson" -o "$GIT_AUTHOR_NAME" = "Andrew J Patterson" ];
  then
    GIT_COMMITTER_NAME="Andrew J Patterson";
    GIT_COMMITTER_EMAIL="ajpatterson@lifesci.dundee.ac.uk";
    GIT_AUTHOR_NAME="$GIT_COMMITTER_NAME";
    GIT_AUTHOR_EMAIL="$GIT_COMMITTER_EMAIL";
    git commit-tree "$@";
  elif [ "$GIT_AUTHOR_NAME" = "cxallan" -o "$GIT_AUTHOR_NAME" = "callan" -o "$GIT_AUTHOR_NAME" = "Chris Allan" ];
  then
    GIT_COMMITTER_NAME="Chris Allan";
    GIT_COMMITTER_EMAIL="callan@glencoesoftware.com";
    GIT_AUTHOR_NAME="$GIT_COMMITTER_NAME";
    GIT_AUTHOR_EMAIL="$GIT_COMMITTER_EMAIL";
    git commit-tree "$@";
  elif [ "$GIT_AUTHOR_NAME" = "Roger Leigh" ];
  then
    GIT_COMMITTER_NAME="Roger Leigh";
    GIT_COMMITTER_EMAIL="r.leigh@dundee.ac.uk";
    GIT_AUTHOR_NAME="$GIT_COMMITTER_NAME";
    GIT_AUTHOR_EMAIL="$GIT_COMMITTER_EMAIL";
    git commit-tree "$@";
  elif [ "$GIT_AUTHOR_NAME" = "jburel" -o "$GIT_AUTHOR_NAME" = "jean-marie burel" ];
  then
    GIT_COMMITTER_NAME="Jean-Marie Burel";
    GIT_COMMITTER_EMAIL="j.burel@dundee.ac.uk";
    GIT_AUTHOR_NAME="$GIT_COMMITTER_NAME";
    GIT_AUTHOR_EMAIL="$GIT_COMMITTER_EMAIL";
    git commit-tree "$@";
  elif [ "$GIT_AUTHOR_NAME" = "jmoore" -o "$GIT_AUTHOR_NAME" = "Josh Moore" ];
  then
    GIT_COMMITTER_NAME="Josh Moore";
    GIT_COMMITTER_EMAIL="josh@glencoesoftware.com";
    GIT_AUTHOR_NAME="$GIT_COMMITTER_NAME";
    GIT_AUTHOR_EMAIL="$GIT_COMMITTER_EMAIL";
    git commit-tree "$@";
  elif [ "$GIT_AUTHOR_NAME" = "ctrueden" -o "$GIT_AUTHOR_NAME" = "Curtis Rueden" ];
  then
    GIT_COMMITTER_NAME="Curtis Rueden";
    GIT_COMMITTER_EMAIL="ctrueden@wisc.edu";
    GIT_AUTHOR_NAME="$GIT_COMMITTER_NAME";
    GIT_AUTHOR_EMAIL="$GIT_COMMITTER_EMAIL";
    git commit-tree "$@";
  elif [ "$GIT_AUTHOR_NAME" = "melissa" -o "$GIT_AUTHOR_NAME" = "mlinkert-x" -o "$GIT_AUTHOR_NAME" = "Melissa Linkert" ];
  then
    GIT_COMMITTER_NAME="Melissa Linkert";
    GIT_COMMITTER_EMAIL="melissa@glencoesoftware.com";
    GIT_AUTHOR_NAME="$GIT_COMMITTER_NAME";
    GIT_AUTHOR_EMAIL="$GIT_COMMITTER_EMAIL";
    git commit-tree "$@";
  elif [ "$GIT_AUTHOR_NAME" = "donald" ];
  then
    GIT_COMMITTER_NAME="Donald MacDonald";
    GIT_COMMITTER_EMAIL="donald@lifesci.dundee.ac.uk";
    GIT_AUTHOR_NAME="$GIT_COMMITTER_NAME";
    GIT_AUTHOR_EMAIL="$GIT_COMMITTER_EMAIL";
    git commit-tree "$@";
  else
    git commit-tree "$@";
  fi
' HEAD

By this point, the history differed from the original SVN repository so much that leaving the git-svn-id metadata in place was rather misleading. So I removed it:

git filter-branch -f --msg-filter '
sed -e "/^git-svn-id:/d"
'

Finally, we have a very clean history containing *only* the two desired subtrees, ready to be merged into bioformats.git. This branch has been pushed to ctrueden/ome-xml-history of openmicroscopy.git.

Remaining steps for specification migration:

  • Verify the final size of the repository, pruning further if needed
  • git merge it in to bioformats.git
  • Make the build system work as-is
  • Add more automation to the build system (i.e., code generation as part of the build)
  • Update Jenkins accordingly

All: please let me know if you have any questions.

Andrew: since this is your ticket, I leave the rest to you.

Last edited 11 years ago by crueden-x (previous) (diff)

comment:9 Changed 11 years ago by ajpatterson

Preparing the branch (in a clean repo). Used:

git clone bioformats bioplus
cd bioplus/
git fetch ../ome ome-xml-history
git graph
git branch
git checkout -b ome-xml-history FETCH_HEAD
git checkout -b add-spec-and-xsdfu origin/develop
git merge ome-xml-history
git graph
git blame components/specification/InProgress/ome.xsd 
git push gh add-spec-and-xsdfu

Push is in progress but github is on a go slow... will add link when push complete.

comment:11 Changed 11 years ago by jmoore

As Curtis mentioned, about 70MB increase:

Josh-Moores-MacBook-Pro:tmp moore$ git clone -b add-spec-and-xsdfu --single-branch https://github.com/qidane/bioformats
Cloning into 'bioformats'...
remote: Counting objects: 138323, done.
remote: Compressing objects: 100% (25617/25617), done.
remote: Total 138323 (delta 100717), reused 136220 (delta 99663)
Receiving objects: 100% (138323/138323), 223.85 MiB | 264 KiB/s, done.
Resolving deltas: 100% (100717/100717), done.
Checking out files: 100% (2996/2996), done.
Josh-Moores-MacBook-Pro:tmp moore$ du bioformats/
228M	bioformats//.git
 76K	bioformats//ant
 44M	bioformats//components
 36K	bioformats//config
4.7M	bioformats//docs
 21M	bioformats//jar
 44K	bioformats//lib
 56K	bioformats//license
 20K	bioformats//pom
100K	bioformats//tools
298M	bioformats/
Josh-Moores-MacBook-Pro:tmp moore$ cd bioformats/
Josh-Moores-MacBook-Pro:bioformats moore$ git branch -a
* add-spec-and-xsdfu
  remotes/origin/add-spec-and-xsdfu

Josh-Moores-MacBook-Pro:bioformats moore$ cd ..
Josh-Moores-MacBook-Pro:tmp moore$ git clone -b develop --single-branch https://github.com/openmicroscopy/bioformats bioformats_origin
Cloning into 'bioformats_origin'...
remote: Counting objects: 119712, done.
remote: Compressing objects: 100% (19576/19576), done.
remote: Total 119712 (delta 89999), reused 117600 (delta 88937)
Receiving objects: 100% (119712/119712), 168.89 MiB | 387 KiB/s, done.
Resolving deltas: 100% (89999/89999), done.
Checking out files: 100% (2540/2540), done.
Josh-Moores-MacBook-Pro:tmp moore$ du bioformats_origin/
172M	bioformats_origin//.git
 76K	bioformats_origin//ant
 30M	bioformats_origin//components
 36K	bioformats_origin//config
4.7M	bioformats_origin//docs
 21M	bioformats_origin//jar
 44K	bioformats_origin//lib
 56K	bioformats_origin//license
 20K	bioformats_origin//pom
100K	bioformats_origin//tools
228M	bioformats_origin/
Josh-Moores-MacBook-Pro:tmp moore$ 

comment:12 Changed 11 years ago by jmoore

  • Cc mlinkert rleigh added

comment:13 Changed 11 years ago by jmoore

Using a slightly modified version of the script here http://stackoverflow.com/questions/298314/find-files-in-git-repo-over-x-megabytes-that-dont-exist-in-head/7945209#7945209, I found these big files in the history:

Josh-Moores-MacBook-Pro:bioformats moore$ ./big.rb HEAD^2 components/specification 5.0
 5.5M	components/specification/Documentation/Artwork/Logos/OMERO/vector/omero-logo.eps	(ca75a3a: 7 months ago)
 5.5M	components/specification/Documentation/Artwork/Logos/OMERO/vector/omero-logo-bw.eps	(ca75a3a: 7 months ago)
 5.8M	components/specification/Documentation/Diagrams/Enterprise/omerodbdiagrams2010-09.eap	(98eae52: 3 years ago)
10.5M	components/specification/Documentation/Diagrams/Enterprise/Omero4-1DbDiagram-March2010.pdf	(85614bd: 3 years ago)
 5.5M	components/specification/Documentation/Artwork/Logos/OME/vector/ome-logo.eps	(ca75a3a: 7 months ago)
 5.5M	components/specification/Documentation/Artwork/Logos/Scifio/vector/scifio-logo-bw.eps	(ca75a3a: 7 months ago)
 7.6M	components/specification/OME/src/DataModel/EA Diagrams/OmeroDbDiagrams.eap	(1b2ea4b: 6 years ago)
 7.6M	components/specification/Documentation/Diagrams/Enterprise/OmeroDbDiagrams.eap	(7149def: 4 years, 10 months ago)
 5.5M	components/specification/Documentation/Artwork/Logos/Bio-Formats/vector/bio-formats-logo.eps	(ca75a3a: 7 months ago)
 5.5M	components/specification/Documentation/Artwork/Logos/Scifio/vector/scifio-logo.eps	(ca75a3a: 7 months ago)
 5.5M	components/specification/Documentation/Artwork/Logos/OME/vector/ome-logo-bw.eps	(ca75a3a: 7 months ago)
 7.6M	components/specification/Documentation/Diagrams/Enterprise/OmeroDbDiagrams.eap	(53b6291: 3 years ago)
18.1M	components/specification/Samples/OmeFiles/2011-06/SPIM-modulo-sample.ome.tiff	(dbb91b6: 1 year, 6 months ago)
 7.6M	components/specification/OME/src/DataModel/EA Diagrams/OmeroDbDiagrams.eap	(526ecf7: 6 years ago)
17.5M	components/specification/Samples/OmeFiles/2011-06/SPIM-modulo-sample.ome.tiff	(296fbc8: 1 year, 6 months ago)
 5.5M	components/specification/Documentation/Artwork/Logos/Bio-Formats/vector/bio-formats-logo-bw.eps	(ca75a3a: 7 months ago)
18.2M	components/specification/Samples/OmeFiles/2012-06/SPIM-modulo-sample.ome.tiff	(948dcf0: 6 weeks ago)

Script is here:

#!/usr/bin/env ruby -w

head, subpath, treshold = ARGV

head ||= 'HEAD'
Megabyte = 1000 ** 2
treshold = (treshold || 0.1).to_f * Megabyte
print treshold

if subpath
    pattern = "git rev-list #{head} -- #{subpath}"
    puts pattern
else
    pattern = "git rev-list #{head}"
end

big_files = {}

IO.popen(pattern, 'r') do |rev_list|
  rev_list.each_line do |commit|
    commit.chomp!
    for object in `git ls-tree -zrl #{commit}`.split("\0")
      bits, type, sha, size, path = object.split(/\s+/, 5)
      size = size.to_i
      big_files[sha] = [path, size, commit] if size >= treshold
    end
  end
end

big_files.each do |sha, (path, size, commit)|
  where = `git show -s #{commit} --format='%h: %cr'`.chomp
  puts "%4.1fM\t%s\t(%s)" % [size.to_f / Megabyte, path, where]
end

comment:14 Changed 11 years ago by ajpatterson

New version pushed to https://github.com/qidane/bioformats/tree/add-spec-and-xsdfu

All files larger than 1.2 M have been removed from this history except for the following that would not delete:

 3.3M	components/specification/tags/Schema-2010-04-RC1/Working/Work-PostEvolutionWithXsd.EAP	(7f76254: 2 years, 11 months ago)
 2.3M	components/specification/Xml/Working/Work-DataModel.EAP	(1543f96: 5 years ago)
 1.5M	components/specification/Xml/Working/Work-ScreenWell.EAP	(1543f96: 5 years ago)
 1.7M	components/specification/Xml/Working/PostEvolution.EAP	(eba5e80: 6 years ago)

comment:15 Changed 11 years ago by ajpatterson

Problem sorted. It was a sequence of very similar file renames.

New version pushed to https://github.com/qidane/bioformats/tree/add-spec-and-xsdfu

comment:16 Changed 11 years ago by ajpatterson

  • Resolution set to fixed
  • Status changed from new to closed

Generated along with #10436

See: https://github.com/qidane/bioformats/tree/add-spec-and-xsdfu

PR will be opened after directory renames #10439, #10438 and scripts/build updated #10441, #10440

comment:17 Changed 11 years ago by ajpatterson

Another approach found by Josh (in case we need to do this sort of thing again) :

http://www.donarmstrong.com/posts/migrating_from_svn_to_git_and_git_annex/

Note: See TracTickets for help on using tickets. You may also have a look at Agilo extensions to the ticket.

1.3.13-PRO © 2008-2011 Agilo Software all rights reserved (this page was served in: 0.73256 sec.)

We're Hiring!