Massive Technical Interviews Tips: data serialization framework misc

Wednesday, October 14, 2015

data serialization framework misc

https://www.igvita.com/2011/08/01/protocol-buffers-avro-thrift-messagepack/

Protocol Buffers (PB) is the "language of data" at Google. Put simply, Protocol Buffers are used for serialization, RPC, and about everything in between.

Initially developed in early 2000's as an optimized server request/response protocol (hence the name), they have become the de-facto data persistence format and RPC protocol.

By comparison, Thrift which was open sourced by Facebook in late 2007, looks and feels very similar to Protocol Buffers - in all likelihood, there was some design influence from PB there. However, unlike PB, Thrift makes RPC a first class citizen: Thrift compiler provides a variety of transport options (network, file, memory), and also tries to target many more languages.

While Thrift and PB differ primarily in their scope, Avro and MessagePack should really be compared in light of the more recent trends: rising popularity of dynamic languages, and JSON over XML. As most every web developers knows, JSON is now ubiquitous, and easy to parse, generate, and read, which explains its popularity. JSON also requires no schema, provides no type checking, and it is a UTF-8 based protocol - in other words, easy to work with, but not very efficient when put on the wire.

MessagePack is effectively JSON, but with efficient binary encoding. Like JSON, there is no type checking or schemas, which depending on your application can be either be a pro or a con. But, if you are already streaming JSON via an API or using it for storage, then MessagePack can be a drop-in replacement.

Avro, on the other hand, is somewhat of a hybrid. In its scope and functionality it is close to PB and Thrift, but it was designed with dynamic languages in mind. Unlike PB and Thrift, the Avro schema is embedded directly in the header of the messages, which eliminates the need for the extra compile stage. Additionally, the schema itself is just a JSON blob - no custom parser required! By enforcing a schema Avro allows us to do data projections (read individual fields out of each record), perform type checking, and enforce the overall message structure.

kryo serialization
https://github.com/EsotericSoftware/kryo#quickstart
http://techstickynotes.blogspot.com/2012/02/simple-examples-using-kryo-for.html

https://ogirardot.wordpress.com/2015/01/09/changing-sparks-default-java-serialization-to-kryo/

Apache Spark’s default serialization relies on Java with the default readObject(…) and writeObject(…) methods for all Serializable classes. This is a very fine default behavior as long as you don’t rely on it too much…

Why ? Because Java’s serialization framework is notoriously inefficient, consuming too much CPU, RAM and size to be a suitable large scale serialization format.

Apache Spark as a system relies on it a lot :

Every task run from Driver to Worker gets serialized : Closure serialization
Every result from every task gets serialized at some point : Result serialization

And what’s implied is that during all closure serializations all the values used inside will get serialized as well, for the record, this is also one of the main reasons to use Broadcast variables when closures might get serialized with big values.

all you need is to specify which serializer you want to use when you define yourSparkContext using the SparkConf like that :

val conf = new SparkConf()

      .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")

      // Now it's 24 Mb of buffer by default instead of 0.064 Mb

      .set("spark.kryoserializer.buffer.mb","24") 

One of the advanteages of protocol buffers is that it can exchange info with C, C++, python and java.

Unfortunately, you can not really use it for arbitrary POJOs because you (a) need schemas for every type and (b) must code-generate objects to use.

Protocol Buffers aren't designed for possibly circular references or for object sharing

Hadoop Data Serialization Battle: Avro, Protocol Buffers, Thrift
http://vanillajava.blogspot.co.uk/2011/10/serialization-using-bytebuffer-and.html
https://github.com/eishay/jvm-serializers/wiki

https://github.com/eishay/jvm-serializers/wiki
https://open.bekk.no/efficient-java-serialization
standard Java serialization is inefficient both in terms of speed and size.
http://dataworld.blog.com/2013/05/06/hadoop-serialization-framework-ref/

Hadoop’s Writable-based serialization framework provides a more efficient and customized serialization and representation of the data for MapReduce programs than using the general-purpose Java’s native serialization framework.
As opposed to Java’s serialization, Hadoop’s Writable framework does not write the type name with each object expecting all the clients of the serialized data to be aware of the types used in the serialized data. Omitting the type names makes the serialization process faster and results in compact, random accessible serialized data formats that can be easily interpreted by non-Java clients.
Hadoop’s Writable-based serialization also has the ability to reduce the object-creation overhead by reusing the Writable objects, which is not possible with the Java’s native serialization framework.

http://stackoverflow.com/questions/14257505/data-serialization-framework

Why cant we use java serialization api's instead of these separate frameworks, are there any flaws in java serialization api's.

I would assume you can use Java Serialization unless you know otherwise.

The main reasons not to use it are

you know there is a performance problem.
you need to exchange data across languages. Java Serialization is only for Java.

In short, unless you know you can't, try Java Serialization first.

1. The problem with java serialization is that it's not agnostic of your code. Meaning that is tightly coupled to the structure of you classes. Other serialization frameworks provide you with some flexibility/control that it's useful to bypass this kind of situations. Even though there is a way in java standard mechanism to control serialization through the writeObject readObject methods, it is a problem that other fwks have addressed in a more elegant way.

Second, you cannot interexchange the output of your java serialization with other language - platforms.

Last, but not least. Java serialization does not produce the more compact result possible, which might lead to performance degradation if you perform things like transfer data over a network. Other protocols (like Oracle's POF or protocol buffers) are more optimized to produce an smaller output.

2. Regarding your second question I guess that what that means is that you don't need to run any precompile job that generates code in the case that the structure of your serialized classes changes. I personally hate frameworks that force some kind of compile-time code generation. I hate the hassle of having to even have to look at generated code, but that is just me and my ocd.

Wednesday, October 14, 2015

data serialization framework misc

Labels

Popular Posts