Monday, November 15, 2010

Notes on Devoxx 1st day

Hadoop by Tom White. 
It was an interesting presentation on Hadoop, MapReduce, Hive, and Pig. My presence in the session was a pretty much last minute change, because for some reasons I found Language in Action Lab less interesting than Hadoop. I'm not disappointed.

The first half of his presentation is about Hadoop and MapReduce:
  • Hadoop is based on HDFS (Hadoop File System) and map reduce algorithms.
  • HDFS is a 100% Java-based file system that handles replication, distribution, and access to the files.
  • HDFS contains meta data nodes that store the information about file and data noted, the nodes that contain the data.
  • So, data in Hadoop are distributed among several nodes. The data themselves are stored in a block of 64MB or 128MB (?)
  • The map reduce algorithms are basically algorithms that produce set of values from large datasets. The data are first mapped (they can also be filtered) to a set of key values in every node. Then, the results are aggregated and reduced by the reducer. 
  • The mechanisms to send the output of map to the reducer are sort and shuffle mechanisms.
  • Combiner is an object that reduces the values resulted from the map calculation, but instead of doing the reduction in other nodes, combiner does the reduction in the same node as mapper. Of course, not all reductions can be done in one node, since the node must be independent and must not assume anything about the other nodes.
  • Then, Tom shows some codes on Hadoop: driver codes, map, reduce codes. Interesting ones.
The second half is about Hive and Pig:
  • Hive is actually a query language much like  SQL-likethat is used on top of Hadoop Map/Reduce.
  • Hive supports some map/reduce specific extension to the SQL and it also allows user defined function to be used in the query language.
  • Map/Reduce is considered to be complex, and Tom showed an example on the system that has complex map reduce architecture (not really sure to understand the complexity though).
  • Hive codes are compiled into Map/Reduce.
  • Hive has a construct called Serde.
  • Hive has a managed storage area.
  • Hive has an interesting support to multi table inserts (multi table inserts that share the same FROM clause).
  • Hive is used by Facebook that launches ~7000 hives  a day (?).
  • Pig is like Hive, but it is not intended to look like SQL.
  • Unlike Hive, Pigs have no schema.
  • Pig looks like LINQ to Entity Framework (This is my feeling, not the presenter).
Overall, it was an interesting presentation. The presentation was scheduled to last 3 hours, but it had completed after around 2.5 hours. The underflow was for some reasons: the public were not that reactive, maybe because it was the very first session of the conference, the presenter did not have too many contents to present. When the public is not reactive, the presenter might should have shown couple of more contents though. 
The demo was not that appealing. To be honest, I didn't see that much demo, and Tom had difficulties in running some of his scripts because he had difficulties reading the screen. 
Other than that, I really think it was a nice presentation to start Devoxx with. Would like to play around with Hadoop, Hive, and Pig some day.

MongoDB by Alvin Richards

Unlike Hadoop that I discovered at Devoxx, I played around already with MongoDB, a document-oriented database. The presentation however gave pretty much different views of MongoDB. 

Here are some notes:
  • MongoDB is a document-oriented database implemented in C++ with driver in many languages, including Java, Erlang, Scala, ... , Smalltalk. 
  • The presentation started by the roles of Relational Database and argued that  the separation between data and logic was one of the main contributions of the relational database to the database technology. MongoDB continues the philosophy.
  • Using "blog system" as examples, MongoDB -- or document-oriented database in general introduces instead what is called "pre joining the data" or embedding and linking.
  • Tintin Destination Moon example was used all along the presentation, pretty fun.
  • MongoDB supported automatic id generation that guarantees uniqueness and immutability of the id.
  • Indexing is supported.
  • Some operations: db.posts.save, db.posts.find, db.posts.ensureIndex( {author:1}), db.system.indexes.find.
  • Map reduce algorithms are built-into the MongoDB.
  • GroupBy support.
  • Support of "class" inheritance, example: rectangle, circle. Null is used for attributes that are not applicalble to a class (like radius for rectangle).
  • Quick explanations on OneToMany and ManytoMany. I didn't get it completely  in detail though.
  • Replication support. Alvin explained the algorithms, the voting, and those things.
  • Sharding.
  • The sharding can be done by range and are configurable: it is assumed that the owner of the data know the best sharding strategy.
  • Config servers hold meta data.
  • Mongos does the switching: routing requests to the right servers.
  • To code using MongoDB in Java, one can use raw mongo DB driver or Morphia.
  • Raw Mongo Driver works with lower level representation of mongo DB data (JSON) while Morphia work with classes.
  • Morphia is an annotation oriented system, just like JPA (hmm........, for me it's quite strange).
  • @Id, @Entity, @Transient ..., are examples of Morphia annotations.
  • Relationship is also configurable using annotations
  • Durability strategy is interesting. One can configure that the write acknowledge comes when the memory of the master is updated, but can also configure that the write acknowledge comes when the disk of the master and slaves are updated. Some strategy in betweens are possible (memory of the slaves, disk on master, disk on slaves, ...)
  • Reading on replication is also configurable. You can configure to read only the master or you may only want to read the data replicated consistently in all slaves.
  • Sharding is transparent to Java code.
  • Backup strategy: mongodump / mongostore, fcync + lock.
  • fsync flushes buffers to disk.
  • Slave delay is a the recommended strategy to cope with database administration errors.
  • Recommended configurations: memory (should be huge), file system , I/O ...
  • MongoDB does not support transaction, the only supports is atomic write, either it writes or it does not. 
  • No view support yet, but maybe it's coming.
  • Limited security support (I wonder whether encryption is supported).
  • No attempt on standardisation yet. For me, this can be a show stopper for a bigger company for adopting MongoDB though.
For me, the presentation is the best presentation of the day. I loved the rich contents the presenter proposed. The publics were very reactive, many interesting questions were thrown and the answeres were pretty good. Again, it's a nice presentation, and it confirms my intention to play around MongoDB. 

Spring Tools by  Christian Dupuis
Compared to the other two, the presentation was much shorter: 30 mins. What could you do in 30 minutes. Well, not that much. That's a problem with the presentation: the presenter threw soo much contents, each is mentioned in 2-3 minutes. The demo was not that convincing, well, it was a little bit disappointing.
  • The idea of the presentation is that Spring tools support framework and languages (including  Maven and AspectJ).
  • It was nice to see aspect visualisation though.
  • Spring Roo is part of the distribution of STS.
  • STS is actually a specific distribution of Eclipse.
  • SpringInsight is a powerful profiling tool for enterprise application.  
Indeed, SpringInsight quick presentation was interesting and that saved a little bit the presentation from my point of view.

Apache Mahout by Isabel Drost
Apache Mahout is a tool for machine learning. It is built on Apache Hadoop and came from Lucene. The presentation was more on machine learning than from codes point of view, not even the relation with Hadoop was clear from the presentation.  But, the presentation was very clear, and that was enough, it is always better to have things clear than complete but unclear.

  • To have a successful machine learning -- especially from text --  first, it must remove noises, followed by converting the text to vector. 
  • Model is then developed.
  • Followed by train a prediction model.
  • Finally, evaluate the prediction.
  • Some applications: recommendation builing, spam filtering, ...
  • Current release: 0.4
  • To start, one can use Amazon EC2 or from a simple desktop.
  • Berlin Buzzwords 2011, May/June.
  • FOSDEM : Data analysis forum.

Groovy BOF
To be honest, I was not sure what I was looking for in this BOF. The idea was just to listen discussions on a language that runs in JVM -- just like Scala. And yes, I didn't understand quite many of it. Oh, yes. I got an impression that GWT was something different and pretty opposite of Groovy. Pretty interesting. There were also discussions on Maven support vs Groovy build system. 
Yeah, maybe needs to have a look a little bit on Groovy before making sense all of those discussions.

jPatterns
Finally, I prefered to leave after Groovy BOF, I was just too hungry to stay longer. So, no jPatterns. 

-

Tomorrow should be exciting. I'll try to get into Scala LAB, Quickies (Programming in Pain), HBase (yes, that's again a schedule change),  J2EE Tooling, Excel in JVM, Scala BOF, Hazelcast, and eventually Asynchronity in Java Web Services.


No comments: