Massive Technical Interviews Tips: A Distributed Systems Reading List

http://dancres.github.io/Pages/

Thought Provokers

Ramblings that make you think about the way you design. Not everything can be solved with big servers, databases and transactions.

Harvest, Yield and Scalable Tolerant Systems - Real world applications of CAP from Brewer et al

On Designing and Deploying Internet Scale Services - James Hamilton
Latency Exists, Cope! - Commentary on coping with latency and it's architectural impacts
Latency - the new web performance bottleneck - not at all new (see Patterson), but noteworthy
The Perils of Good Abstractions - Building the perfect API/interface is difficult
Chaotic Perspectives - Large scale systems are everything developers dislike - unpredictable, unordered and parallel
Website Architecture - A collection of scalable architecture papers from various of the large websites
Data on the Outside versus Data on the Inside - Pat Helland
Memories, Guesses and Apologies - Pat Helland
SOA and Newton's Universe - Pat Helland
Building on Quicksand - Pat Helland
Why Distributed Computing? - Jim Waldo
A Note on Distributed Computing - Waldo, Wollrath et al
Stevey's Google Platforms Rant - Yegge's SOA platform experience

Amazon

Somewhat about the technology but more interesting is the culture and organization they've created to work with it.

A Conversation with Werner Vogels - Coverage of Amazon's transition to a service-based architecture
Discipline and Focus - Additional coverage of Amazon's transition to a service-based architecture
Vogels on Scalability
SOA creates order out of chaos @ Amazon

Google

Current "rocket science" in distributed systems.

MapReduce
Chubby Lock Manager
Google File System
BigTable
Data Management for Internet-Scale Single-Sign-On
Dremel: Interactive Analysis of Web-Scale Datasets
Large-scale Incremental Processing Using Distributed Transactions and Notifications
Megastore: Providing Scalable, Highly Available Storage for Interactive Services - Smart design for low latency Paxos implementation across datacentres.
Spanner - Google's scalable, multi-version, globally-distributed, and synchronously-replicated database.
Photon - Fault-tolerant and Scalable Joining of Continuous Data Streams. Joins are tough especially with time-skew, high availability and distribution.
Mesa: Geo-Replicated, Near Real-Time, Scalable Data Warehousing - Data warehousing system that stores critical measurement data related to Google's Internet advertising business.

eBay

Interesting they dumped most of J2EE and use a lot of db partitioning. Check out their site upgrade tool as well.

SD Forum 2006

Consistency Models

Key to building systems that suit their environments is finding the right tradeoff between consistency and availability.

CAP Conjecture - Consistency, Availability, Parition Tolerance cannot all be satisfied at once
Consistency, Availability, and Convergence - Proves the upper bound for consistency possible in a typical system
CAP Twelve Years Later: How the "Rules" Have Changed - Eric Brewer expands on the original tradeoff description
Consistency and Availability - Vogels
Eventual Consistency - Vogels
Avoiding Two-Phase Commit - Two phase commit avoidance approaches
2PC or not 2PC, Wherefore Art Thou XA? - Two phase commit isn't a silver bullet
Life Beyond Distributed Transactions - Helland
If you have too much data, then 'good enough' is good enough - NoSQL, Future of data theory - Pat Helland
Starbucks doesn't do two phase commit - Asynchronous mechanisms at work
You Can't Sacrifice Partition Tolerance - Additional CAP commentary
Optimistic Replication - Relaxed consistency approaches for data replication

Theory

Papers that describe various important elements of distributed systems design.

Distributed Computing Economics - Jim Gray
Rules of Thumb in Data Engineering - Jim Gray and Prashant Shenoy
Fallacies of Distributed Computing - Peter Deutsch
Impossibility of distributed consensus with one faulty process - also known as FLP [access requires account and/or payment, a free version can be found here]
Unreliable Failure Detectors for Reliable Distributed Systems. A method for handling the challenges of FLP
Lamport Clocks - How do you establish a global view of time when each computer's clock is independent
The Byzantine Generals Problem
Lazy Replication: Exploiting the Semantics of Distributed Services
Scalable Agreement - Towards Ordering as a Service
Scalable Eventually Consistent Counters over Unreliable Networks - Scalable counting is tough in an unreliable world

Languages and Tools

Issues of distributed systems construction with specific technologies.

Programming Distributed Erlang Applications: Pitfalls and Recipes - Building reliable distributed applications isn't as simple as merely choosing Erlang and OTP.

Infrastructure

Principles of Robust Timing over the Internet - Managing clocks is essential for even basics such as debugging

Storage

Paxos Consensus

Understanding this algorithm is the challenge. I would suggest reading "Paxos Made Simple" before the other papers and again afterward.

The Part-Time Parliament - Leslie Lamport
Paxos Made Simple - Leslie Lamport
Paxos Made Live - An Engineering Perspective - Chandra et al
Revisiting the Paxos Algorithm - Lynch et al
How to build a highly available system with consensus - Butler Lampson
Reconfiguring a State Machine - Lamport et al - changing cluster membership
Implementing Fault-Tolerant Services Using the State Machine Approach: a Tutorial - Fred Schneider

Other Consensus Papers

Mencius: Building Efficient Replicated State Machines for WANs - consensus algorithm for wide-area network

Gossip Protocols (Epidemic Behaviours)

P2P

Chord: A Scalable Peer-to-peer Lookup Protocol for Internet Applications
Kademlia: A Peer-to-peer Information System Based on the XOR Metric
Pastry: Scalable, decentralized object location and routing for large-scale peer-to-peer systems
PAST: A large-scale, persistent peer-to-peer storage utility - storage system atop Pastry
SCRIBE: A large-scale and decentralised application-level multicast infrastructure - wide area messaging atop Pastry

http://blog.jobbole.com/84575/

Thought Provokers

一些让你考虑你设计方式的随笔。不是所有事都可以靠大服务器，数据库和事物来解决的。

Harvest, Yield and Scalable Tolerant Systems CAP原理在现实世界里的应用来自Brewer等人
On Designing and Deploying Internet Scale Services James Hamilton
Latency Exists, Cope! 处理延迟及其架构方面影响的说明
Latency – the new web performance bottleneck 内容不太新了，但是值得关注下
The Perils of Good Abstractions 构建完美的API/接口很困难
Chaotic Perspectives 大规模系统有开发人员不喜欢的所有东西——不可预测，无序，并行
Website Architecture 一些来自各类大型网站的可扩展架构文章
Data on the Outside versus Data on the Inside Pat helland
Memories, Guesses and Apologies Pat Helland
SOA and Newton’s Universe – Pat Helland
Building on Quicksand – Pat Helland
Why Distributed Computing – Jim Waldo
A Note on Distributed Computing – Waldo, Wollrath 等人
Stevey’s Google Platforms Rant – Yegge的SOA平台经验

Amazon

有些有关的技术，但更有趣的是他们创造的与之配合的文化和结构。

A Conversation with Werner Vogels 关于亚马逊转型为一个基于服务的架构的采访报道
Discipline and Focus 关于亚马逊转型为一个基于服务的架构的另一篇采访
Vogels on Scalability
SOA creates order out of chaos @ Amazon

Google

当前分布式系统领域的“火箭科学”（形容艰深的学问）

MapReduce
Chubby Lock Manager
Google File System
BigTable
Data Management for Internet-Scale Single-Sign-On
Dremel: Interactive Analysis of Web-Scale Datasets
Large-scale Incremental Processing Using Distributed Transactions and Notifications
Megastore: Providing Scalable, Highly Available Storage for Interactive Services – 实现跨数据中心、低延迟的paxos算法的巧妙设计。
Spanner – Google的可扩展、多版本、全球分布且同步复制的数据库。
Photon – 连续数据流的容错和扩容。扩容是非常困难的，尤其是在时钟偏移、高可用性和分布式的情况下.
Mesa: Geo-Replicated, Near Real-Time, Scalable Data Warehousing 用于存储谷歌互联网广告业务相关的关键测量数据的数据仓库系统。

eBay

有趣的是他们抛弃了大多数的J2EE，并使用了大量的数据库分区。同时，看看他们的网站升级工具。

SD Forum 2006

一致性模型

构建能够适应环境的系统的关键是寻求正确权衡一致性和可用性。

CAP Conjecture – 一致性，可用性，分区容忍性不可能同时满足
Consistency, Availability, and Convergence – 证明了在一个典型系统中一致性可能的上界。
CAP Twelve Years Later: How the “Rules” Have Changed – Eric Brewer 在原来权衡描述工作上的扩展
Consistency and Availability – Vogels
Eventual Consistency – Vogels
Avoiding Two-Phase Commit – 两阶段提交的避免方法
2PC or not 2PC, Wherefore Art Thou XA – 两阶段提交不是银弹
Life Beyond Distributed Transactions – Helland
If you have too much data, then ‘good enough’ is good enough – NoSQL, 数据理论的未来- Pat Helland
Starbucks doesn’t do two phase commit – 在起作用的异步机制
You Can’t Sacrifice Partition Tolerance – 另外的 CAP 说明
Optimistic Replication – 数据主从复制的弱一致性方法

理论

一些描述了分布式系统设计中各种各样的重要因素的论文。

Distributed Computing Economics – Jim Gray
Rules of Thumb in Data Engineering – Jim Gray and Prashant Shenoy
Fallacies of Distributed Computing – Peter Deutsch
Impossibility of distributed consensus with one faulty process 也称为FLP [访问需要帐号或付费，免费版本在这里： here]
Unreliable Failure Detectors for Reliable Distributed Systems.一种处理FLP难题的方法
Lamport Clocks -当每台电脑的时钟都是独立的时候，你如何建立对时间的全局视图。
The Byzantine Generals Problem
Lazy Replication: Exploiting the Semantics of Distributed Services
Scalable Agreement – Towards Ordering as a Service
Scalable Eventually Consistent Counters over Unreliable Networks 在不可靠的世界，可扩展计数很困难。

语言和工具

使用特定技术构建分布式系统的问题。

Programming Distributed Erlang Applications: Pitfalls and Recipes 构建可靠的分布式应用并不仅仅是的选择Erlang还是OTP的问题那么简单。

基础设施

Principles of Robust Timing over the Internet 即便是调试这么基础的事，管理时钟也很重要。

存储

Paxos 一致性算法

理解这种算法是一个挑战。我建议在阅读其他论文之前先读读“Paxos Made Simple”，然后在读完其他论文之后，再读一遍。

The Part-Time Parliament – Leslie Lamport
Paxos Made Simple – Leslie Lamport
Paxos Made Live – An Engineering Perspective – Chandra等人
Revisiting the Paxos Algorithm – Lynch 等人
How to build a highly available system with consensus – Butler Lampson
Reconfiguring a State Machine – Lamport 等人 -改变集群的成员
Implementing Fault-Tolerant Services Using the State Machine Approach: a Tutorial – Fred Schneider

其他一致性文章

Mencius: Building Efficient Replicated State Machines for WANs – 针对广域网的一致性算法

Gossip 协议（传染行为）

P2P

Chord:一种针对互联网应用的可扩展的点对点查找协议。
Kademlia: 一种基于XOR的点对点信息系统
Pastry: 可扩展的，去中心化的对象位置和对大规模点对点系统的路由。
PAST: 一种大规模，持久化的点对点存储功能——Pastry上的存储系统
SCRIBE: 一个大规模且去中心化的应用层多播基础设施——Pastry上的广域消息系统。

Wednesday, December 2, 2015

A Distributed Systems Reading List

Thought Provokers

Amazon

Google

eBay

Consistency Models

Theory

Languages and Tools

Infrastructure

Storage

Paxos Consensus

Other Consensus Papers

Gossip Protocols (Epidemic Behaviours)

P2P

Thought Provokers

Amazon

Google

eBay

一致性模型

理论

语言和工具

基础设施

存储

Paxos 一致性算法

其他一致性文章

Gossip 协议（传染行为）

P2P

Labels

Popular Posts