Investigation of Alternative Storage Mechanisms
We have looked at different mechanisms for storing and querying ROI, these can be grouped into several categories:
- Document Based Databases
- MongoDB MongoDB
- CouchDB http://couchdb.apache.org/
- Graph Databases
- Neo4J http://neo4j.org/
- InfoGrid http://infogrid.org/
- BigTable implementations
- Cassandra http://incubator.apache.org/cassandra/
- HyperTables http://www.hypertable.org/
- HBase http://hadoop.apache.org/hbase/
- Key value DB
- Berkeley Database http://www.oracle.com/technology/products/berkeley-db/index.html
- Tokyo Cabinet http://1978th.net/tokyocabinet/
- Project Voldermort http://project-voldemort.com/
- Other
Other groups are also evaluating the same mechanisms http://blog.boxedice.com/2009/07/25/choosing-a-non-relational-database-why-we-migrated-from-mysql-to-mongodb/
The current investigation is still in progress, it has mainly focused on MongoDB, Cassandra and more recently Neo4J.
Document Databases
A document database is part of the NOSQL movement, data is stored as a document in the database, there is no concept of schema and any type of information my be stored. The database contains a series of collections into which any document may be inserted. In contrast to RBD's there are no tables, joins nor are they're transactions. Each operation on an document is atomic, this is to fit into the Map-Reduce http://en.wikipedia.org/wiki/MapReduce paradigm of many of the distributed database systems.
Each document can contain sub-components which can be accessed individually, one could think of a single document being a row in a non-normalised RDB.
MongoDB
PROS
- It has bindings to numerous languages(C++, C#, Java, Python, ...).
- Allows storage, indexing, linking of any user data.
- Annotations are now very easy, efficient.
- Has mechanisms for schema upgrade.
- Dynamic Queries.
- Replication.
- Sharding.
- Map-Reduce framework.
- Fast.
- GridFS is a distributed file storage mechanism within Mongo.
CONS
- Schemaless, data integrity will need to be worked on.
- Graph structures not inherently supported.
DEPLOYMENTS
- SourceForge http://sourceforge.net/
- BusinessInsider http://www.businessinsider.com/
- New York Times http://www.nytimes.com/
- Disqus http://www.disqus.com/
SourceForge has released a python wrapper to pyMongo, called Ming it supports data integrity, datastructure upgrade paths and simpler access to querying.
MongoDB uses JSON to insert documents into its database. It allows all elements of the datastructure to be queried.
Using pyMongo to insert ROI with two Ellipse
connection = Connection();
db = connection['databaseName'];
collection = db.['collectionName'];
collection.insert({"tags" : [ ], "label" : null, "shapes" : [{
"tags" : [{"tag" : "foo", "namespace" : "bob"}],
"rx" : 17,
"ry" : 17,
"label" : null,
"cy" : 75,
"cx" : 3,
"t" : 0,
"z" : 0,
"type" : "Ellipse",
"id" : 3
},
{
"tags" : [{"tag" : "foo", "namespace" : "bob"}],
"rx" : 10,
"ry" : 16,
"label" : null,
"cy" : 82,
"cx" : 45,
"t" : 0,
"z" : 0,
"type" : "Ellipse",
"id" : 5
}], "type" : "Roi", "id" : 565 })
MongoDB allows you to query on any part of the datastructure and to retrieve only those portions of the data that is interesting, so in the above example shapes may be queried but only attributes of the ROI may be requested.
Example Query - Find ROI with id 2 and shapes with id 3
connection = Connection();
db = connection['databaseName'];
collection = db.['collectionName'];
collection.find({"id":2,"shapes.id":3})
MongoDB allows you to seach using regular expressions.
Example Query - Find all Shapes with tag containing mitosis
connection = Connection();
db = connection['databaseName'];
collection = db.['collectionName'];
collection.find({"shapes.tags.tag":'/.*mitosis.*/i'})
CouchDB
PROS
- Allows storage, indexing, linking of any user data.
- Annotations are now very easy, efficient.
- Has mechanisms for schema upgrade.
- is ACID
CONS
- Schemaless, data integrity will need to be worked on.
- Graph structures not inherently supported.
- Does not support sharding
- No replication
- No Dynamic Queries
DEPLOYMENTS
- BBC http://www.erlang-factory.com/conference/London2009/speakers/endafarrell
- Assay Depot http://www.assaydepot.com/
Other viewpoints
http://nosql.mypopescu.com/post/298557551/couchdb-vs-mongodb
Graph Databases
The graph databases are another solution to the storage of structured data in a non-relational database. Commonly these databases use nodes to represent objects and user defined relationships to link the nodes. Currently we have looked at two of these databases, Neo4J and InfoGrid.
Neo4J
Neo4J is a graph database, it's written in java with a native inference engine underneath. A good tutorial on Neo4J is http://www.slideshare.net/nguyenandan/basic-neo4j-code-examples-2008-05-08-2714356.
PROS
- Handles graph structures nicely
- Transactional
- Supported by Gremlin Gremlin
- Native RDF http://components.neo4j.org/neo-rdf-sail/
CONS
- No C++ language binding.
- Not distributed.
- Tables are not so easily modelled.
DEPLOYMENTS
- The Swedish Defence forces http://www.mil.se
- Windh Technologies http://www.windh.com
- Flextoll http://www.flextoll.se
Define a set of relationships
public enum OMERORelations implements RelationshipType { ASSOCIATE DERIVE AGGREGATE COMPOSE }
Creating an image image link in Neo4J
Node image = neo.createNode(); image.setProperty("IObject",imageI); image.setProperty("id",imageI.getId().getValue()); image.setProperty("name",imageI.getName().getValue()); Node derivedImage = neo.createNode(); derivedImage.setProperty("IObject",derivedImageI); derivedImage.setProperty("id",derivedImageI.getId().getValue()); derivedImage.setProperty("name",derivedImageI.getName().getValue()); Relationship relationship = image.createRelationshipTo( derivedImage, OMERORelations.DERIVE ); relationship.setProperty("type","Deconvolution");
Creating an roi image link in Neo4J
Node image = neo.createNode();
image.setProperty("IObject",imageI);
image.setProperty("id",imageI.getId().getValue());
image.setProperty("name",imageI.getName().getValue());
Node derivedImage = neo.createNode();
derivedImage.setProperty("IObject",derivedImageI);
derivedImage.setProperty("id",derivedImageI.getId().getValue());
derivedImage.setProperty("name",derivedImageI.getName().getValue());
Relationship relationship = image.createRelationshipTo( derivedImage, OMERORelations.DERIVE );
relationship.setProperty("type","ROI");
relationship.setProperty("operation","crop");
relationship.setProperty("roi",cropRoiI);
Example of Graph Traversal
Traverser derivedTraverser = imageNode.traverse(
Traverser.Order.BREADTH_FIRST,
StopEvaluator.END_OF_NETWORK,
ReturnableEvaluator.ALL_BUT_START_NODE,
RelTypes.DERIVE,
Direction.OUTGOING );
System.out.println( "Derived Images:" );
for ( Node derived : derivedTraverser )
{
String type = derived.getProperty("type");
if(type=="Decovolution")
System.out.print(derived.getProperty( "name" ) + " was deconvolved from " + imageNode.getProperty("name"));
if(type=="roi")
System.out.print(derived.getProperty( "name" ) + " was "+ derivedTraverser.lastRelationshipTraversed().getProperty("operation") +" from " + imageNode.getProperty("name"));
}
BigTable Implementations
BigTable is a storage method popularised by Google, described as "a sparse, distributed multi-dimensional sorted map", it can be though of as a variant of a ColumnDB. The storage solution allows the insertion of data where the columns can vary between rows.
Currently we have only investigated HBase, Cassandra and Hypertables, and for this investigation focused on Cassandra which is a project, submitted to apache from facebook.
As the table solutions of BigTables are effectively complex key/value store, this needs a sophisticated toolset to get the most out of this solutions, for instance google has created sawzall to query this system. Digg have created a language to work with Cassandra LazyBoy.
BigTable implementations like the document orientated databases do not allow joins, GQL, googles variant of SQL for BigTables does not support joins.