Massive Technical Interviews Tips: Thrift Misc

Saturday, September 19, 2015

Thrift Misc

https://bravenewgeek.com/thrift-on-steroids-a-tale-of-scale-and-abstraction/
Apache Thrift is an RPC framework developed at Facebook for building “scalable cross-language services.” It consists of an interface definition language (IDL), communication protocol, API libraries, and a code generator that allows you to build and evolve services independently and in a polyglot fashion across a wide range of languages.

By and large, Facebook’s engineering culture has tended towards choosing the best tools and implementations available over standardizing on any one programming language and begrudgingly accepting its inherent limitations.

Since the number of service calls grows rapidly, it is necessary to maintain a well-defined interface for every call. We knew we wanted to use an IDL for managing this interface, and we ultimately decided on Thrift. Thrift forces service owners to publish strict interface definitions, which streamlines the process of integrating with services. Calls that do not abide by the interface are rejected at the Thrift level instead of leaking into a service and failing deeper within the code. This strategy of publicly declaring your interface emphasizes the importance of backwards compatibility, since multiple versions of a service’s Thrift interface could be in use at any given time. The service author must not make breaking changes, and instead must only make non-breaking additions to the interface definition until all consumers are ready for deprecation.

we settled on Thrift due to its maturity and wide use in production, its performance, its architecture (it separates out the transports, protocols, and RPC layer with the first two being pluggable), its rich feature set, and its wide range of language support

In addition to RPC, we wanted to promote a more asynchronous, message-passing style of communication with pub/sub. This would allow for greater flexibility in messaging patterns like fan-out and fan-in, interest-based messaging, and reduced coupling and fragility of services. This enables things like the worker pattern where we can distribute work to a pool of workers and scale that pool independently

Unfortunately, Thrift doesn’t provide any kind of support for pub/sub, and we wanted the same guarantees for it that we had with RPC, like type safety and versioning with code-generated APIs and service contracts. Aside from this, Thrift has a number of other, more glaring problems:

Head-of-line blocking: a single, slow request will block any subsequent requests for a client.
Out-of-order responses: an out-of-order response puts a Thrift transport in a bad state, requiring it to be torn down and reestablished, e.g. if a slow request times out at the client, the client issues a subsequent request, and a response comes back for the first request, the client blows up.
Concurrency: a Thrift client cannot be shared between multiple threads of execution, requiring each thread to have its own client issuing requests sequentially. This, combined with head-of-line blocking, is a major performance killer. This problem is compounded when each transport has its own resources, such as a socket.
RPC timeouts: Thrift does not provide good facilities for per-request timeouts, instead opting for a global transport read timeout.
Request headers: Thrift does not provide support for request metadata, making it difficult to implement things like authentication/authorization and distributed tracing. Instead, you are required to bake these things into your IDL or in a wrapped transport. The problem with this is it puts the onus on service providers rather than allowing an API gateway or middleware to perform these functions in a centralized way.
Middleware: Thrift does not have any support for client or server middleware. This means clients must be wrapped to implement interceptor logic and middleware code must be duplicated within handler functions. This makes it impossible to implement AOP-style logic in a clean, DRY way.

RPC is much more mature and the notion of a “service mesh” has taken the container world by storm with things like Istio, Linkerd, and Envoy.
http://stackoverflow.com/questions/9732381/why-thrift-why-not-http-rpcjsongzip

Thrift generates the client and server code completely, including the data structures you are passing, so you don't have to deal with anything other than writing the handlers and invoking the client. and everything, including parameters and returns are automatically validated and parsed. so you are getting sanity checks on your data for free.
Thrift is more compact than HTTP, and can easily be extended to support things like encryption, compression, non blocking IO, etc.
Thrift can be set up to use HTTP and JSON pretty easily if you want it (say if your client is somewhere on the internet and needs to pass firewalls)
Thrift supports persistent connections and avoids the continuous TCP and HTTP handshakes that HTTP incurs.

Personally, I use thrift for internal LAN RPC and HTTP when I need connections from outside.

strongest thrift advantages are convenient interoperable RPC invocations and convenient handling of binary data.

https://www.quora.com/What-is-the-advantage-of-using-Thrift-as-opposed-to-exposing-an-HTTP-REST-API
Advantages of Thrift:

Thrift generates both the server and client interfaces for a given service, and in a consistent manner. Client calls will be more consistent (and have rudimentary type/structure checking), and generally be less error prone.

Related to above: Thrift's RPC-like behavior means that you get type safety, exceptions are passed and handled in a sane manner, etc - you're not reinventing the wheel.

Thrift supports various protocols, not just HTTP. If you are dealing with large volumes of service calls, or have bandwidth requirements, the client/server can transparently switch to more efficient transports (such as one of the binary transports).

Thrift is a mature piece of software; well tested and used.

Disadvantages:

Thrift is poorly documented.

It is more work to get started on the client side, when the clients are directly building the calling code. It's less work for the service owner if they are building libraries for clients.

Yet another dependency.

If you are providing a simple service & API, Thrift is probably not the right tool.

Advantage: Thrift's encoding is binary. If you're sending a little bit of data, it doesn't matter. If you're doing a data-heavy API, you're wasting a lot of cycles and bandwidth to, for example, convert numbers from binary into ASCII strings and then parsing them back into binary.

Disadvantage: Hard to debug
https://en.wikipedia.org/wiki/Apache_Thrift

Cross-language serialization with lower overhead than alternatives such as SOAP due to use of binary format
A lean and clean library. No framework to code. No XML configuration files.
The language bindings feel natural. For example, Java uses ArrayList<String>. C++ uses std::vector<std::string>.
The application-level wire format and the serialization-level wire format are cleanly separated. They can be modified independently.
The predefined serialization styles include: binary, HTTP-friendly and compact binary.
Doubles as cross-language file serialization.
Soft versioning^[clarify] of the protocol. Thrift does not require a centralized and explicit mechanism like major-version/minor-version. Loosely coupled teams can freely evolve RPC calls.
No build dependencies or non-standard software. No mix of incompatible software licenses.

https://www.quora.com/How-does-Facebook-use-Apache-Thrift
don't work on the Graph API, but I presume one of the biggest reasons why it doesn't use Thrift is just to keep the API as simple/flexible/lightweight as possible. HTTP+JSON is arguably much simpler to debug than most thrift protocols.

Additionally, it makes it very easy for most developers to understand what's happening when they interact with the API. One important aspect of Thrift is that it attempts to abstract away almost everything that happens below the surface of the client layer; So, when something goes wrong, it can be much harder to figure out where the problem is and how to fix it.

Finally, HTTP and JSON libraries are pretty commonly found in many different languages and frameworks out there already (and they're widely used and well known/supported). You won't find Thrift libraries included with most core/standard libraries.
https://en.wikipedia.org/wiki/Apache_Thrift

Soft versioning^[clarify] of the protocol. Thrift does not require a centralized and explicit mechanism like major-version/minor-version. Loosely coupled teams can freely evolve RPC calls.

What is RPC framework and Apache Thrift?

An RPC framework in general is a set of tools that enable the programmer to call a piece of code in a remote process, be it on a different machine or just another process on the same machine.

In the particular case of Apache Thrift, we talk about a framework designed to be efficient, and available across both OS platforms and programming languages. Additionally, you have some flexibility regarding transports (such as sockets, pipes, etc) and protocols (binary, JSON, even compressed), plus some more options like SSL or SASL support.

The code for both server and client is generated from a Thrift IDL file. To get it running, you basically have to add only the intended program logic and put all the pieces together.

http://blog.zhengdong.me/2012/05/10/hello-world-by-thrift-using-java
brew install thrift
namespace java me.zhengdong.thrift struct Item { 1: i64 id, 2: string content, } service CrawlingService { void write(1:list<Item> items), }
thrift -out . --gen java item.thrift
public class Server { private void start(Configuration conf){ try { // Set port TServerSocket serverTransport = new TServerSocket(7911); // Set CrawlingHandler we defined before // to processor, which handles RPC calls // Remember, one service per server CrawlingHandler handler = new CrawlingHandler(); CrawlingService.Processor<CrawlingService.Iface> processor = new CrawlingService.Processor<CrawlingService.Iface>(handler); TServer server = new TThreadPoolServer( new TThreadPoolServer.Args(serverTransport).processor(processor)); System.out.println("Starting server on port 7911 ..."); server.serve(); } catch (TTransportException e) { e.printStackTrace(); } } public static void main(String args[]){ Server server = new Server(); server.start(); } }

public class Client { public void write(List<Item> items){ TTransport transport; try { transport = new TSocket("localhost", 7911); transport.open(); TProtocol protocol = new TBinaryProtocol(transport); CrawlingService.Client client = new CrawlingService.Client(protocol); client.write(items); transport.close(); } catch (TTransportException e) { e.printStackTrace(); } catch (TException e) { e.printStackTrace(); } } }

http://thrift.apache.org/tutorial/
Writing a .thrift file
After the Thrift compiler is installed you will need to create a .thrift file. This file is an interface definition made up of thrift types and Services. The services you define in this file are implemented by the server and are called by any clients.

Generate Thrift file to source code
thrift --gen <language> <Thrift filename>
To recursivly generate source code from a Thrift file and all other Thrift files included by it, run

thrift -r --gen <language> <Thrift filename>

/**
* This method has a oneway modifier. That means the client only makes
* a request and does not listen for any response at all. Oneway methods
* must be void.
*/
oneway void zip()
i32 calculate(1:i32 logid, 2:Work w) throws (1:InvalidOperation ouch),
exception InvalidOperation {
1: i32 whatOp,
2: string why
}
typedef i32 MyInteger
const i32 INT32CONSTANT = 9853
enum Operation {
ADD = 1,
SUBTRACT = 2,
MULTIPLY = 3,
DIVIDE = 4
}
struct Work {
1: i32 num1 = 0,
2: i32 num2,
3: Operation op,
4: optional string comment,
}
thrift -r --gen js:node tutorial.thrift
http://thrift.apache.org/tutorial/java

http://thrift-tutorial.readthedocs.org/en/latest/thrift-stack.html
thrift -version
include "shared.thrift"

Runtime Library
The protocol and transport layer are part of the runtime library. This means that it is possible to define a service and change the protocol and transport without recompiling the code.

Protocol Layer
The protocol layer provides serialization and deserialization. Thrift supports the following protocols :

TBinaryProtocol - A straight-forward binary format encoding numeric values as binary, rather than converting to text.
TCompactProtocol - Very efficient, dense encoding of data (See details below).
TDenseProtocol - Similar to TCompactProtocol but strips off the meta information from what is transmitted, and adds it back in at the receiver. TDenseProtocol is still experimental and not yet available in the Java implementation.
TJSONProtocol - Uses JSON for encoding of data.
TSimpleJSONProtocol - A write-only protocol using JSON. Suitable for parsing by scripting languages
TDebugProtocol - Uses a human-readable text format to aid in debugging.
Tranport Layer
The transport layer is responsible for reading from and writing to the wire. Thrift supports the following:

TSocket - Uses blocking socket I/O for transport.
TFramedTransport - Sends data in frames, where each frame is preceded by a length. This transport is required when using a non-blocking server.
TFileTransport - This transport writes to a file. While this transport is not included with the Java implementation, it should be simple enough to implement.
TMemoryTransport - Uses memory for I/O. The Java implementation uses a simple ByteArrayOutputStream internally.
TZlibTransport - Performs compression using zlib. Used in conjunction with another transport. Not available in the Java implementation.
Processor
The processor takes as arguments an input and an output protocol. Reads data from the input, processes the data throught the Handler specified by the user and then writes the data to the output.

Supported Servers
A server will be listening for connections to a port and will send the data it receives to the Processor to handle.

TSimpleServer - A single-threaded server using std blocking io. Useful for testing.
TThreadPoolServer - A multi-threaded server using std blocking io.
TNonblockingServer - A multi-threaded server using non-blocking io (Java implementation uses NIO channels). TFramedTransport must be used with this server.

http://thrift-tutorial.readthedocs.org/en/latest/usage-example.html
https://thrift.apache.org/tutorial/java
https://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html

http://blog.csdn.net/proudmore/article/details/46413743
1)问了什么是RPC, 怎么实现的. solution
remote procedure call 流程, client/server mode

client: local call->pack args->transmit call packet to server server: receive packet->unpack-> call,work, return -> pack result->transmit packet to client client: receive packet->unpack->local return

https://en.wikipedia.org/wiki/Apache_Thrift

http://blog.panic.so/share/2015/05/08/Thrift%E7%9A%84%E7%90%86%E8%A7%A3/

Thrift逻辑层次

Thrift提供了进行RPC需要的要素，包括了定义接口原型的语言，各种协议的序列化实现，各种协议的Socket通信实现以及通过接口原型能够生成各个种语言的处理逻辑（Processor）的实现，而我们要做的仅仅是写写接口(Handler)实现。

thrift抽象了四层逻辑结构，看起来跟TCP/IP协议结构很像。Transport层代表数据传输，进行I/O操作，可以是上面提到的Socket I/O也可以使磁盘文件I/O；Protocol层抽象了序列化和反序列化功能；Processor层抽象了上面提到的函数实现的逻辑；最后Server层负责整体的调度和控制。另外我们需要实现的函数逻辑在这里称为Handler也属于Processor层。

那么如何运作的呢？Client也就是发起请求的一端，将请求信息经过Protocol层进行序列化，通过Transport层将数据数据发送出去；Server端从Transport层获取数据，在Protocol层进行反序列化，将数据交给Processor，在这层识别是要调用什么函数，并交给我们实现的handler执行，再将返回值进行序列化，写回Transport。

Transport层:

TStreamTransport和TSocket是直接建立在文件I/O和网络I/O上的数据Transport实现；TBufferedTransport在前两基础上封装增加的读写缓存，减少文件活着网络I/O的次数来提升性能；TFramedTransport则是封装了一套按帧传输的简单协议，不仅增加了缓存，还在每次传输的数据的头部增加了4个byte来保存本次传输的数据长度。Buffered和Framed性能区别不大，只不过在非阻塞的Server实现中（如java的TNonblockingServer）只能使用Framed，因为非阻塞的Server需要判断数据是否准备就绪。

Protocol层:

BinaryProtocl将数据转化为byte数组，i32变成4个byte；CompactProtocol是作为BinaryProtocol的升级版，序列化之后的数据更紧凑，处理速度也会更快；JsonProtocl则是将数据转化为json格式。

Server层:

Server负责整体的调度。以Python的库为来说吧，最简单的SimpleServer是串行处理请求；ThreadedServer则是将请求分配给单独的线程处理来实现并发；ThreadPoolServer使用线程池来消除创建线程带来的性能损耗；ForkingServer，ProcessPoolServer是对应进程实现；NonblockingServer则是使用I/O多路复用来减少I/O阻塞，提升吞吐量。Go版本的Server实现则非常简单，请求交给协程处理就好了，轻松实现I/O多路复用。

Saturday, September 19, 2015

Thrift Misc

Thrift逻辑层次

Transport层:

Server层:

Labels

Popular Posts