Sunday, June 28, 2015

Flickr Architecture - High Scalability



http://blog.gainlo.co/index.php/2016/03/01/system-design-interview-question-create-a-photo-sharing-app/
Flickr Architecture - High Scalability -
Platform


  • PHP && MySQL
  • Shards
  • Memcached for a caching layer.
  • Squid in reverse-proxy for html and images.
  • Smarty for templating
  • PEAR for XML and Email parsing
  • ImageMagick, for image processing
  • SystemImager for deployment
  • Ganglia for distributed system monitoring
  • Subcon stores essential system configuration files in a subversion repository for easy deployment to machines in a cluster.
  • Cvsup for distributing and updating collections of files across a network.





  • Use dedicated servers for static content.
  • Talks about how to support Unicode.
  • Use a share nothing architecture.
  • Everything (except photos) are stored in the database.
  • Statelessness means they can bounce people around servers and it's easier to make their APIs.
  • Scaled at first by replication, but that only helps with reads.
  • Create a search farm by replicating the portion of the database they want to search.
  • Use horizontal scaling so they just need to add more machines.



  • Lessons Learned

  • Think of your application as more than just a web application
  • Go stateless. Statelessness makes for a simpler more robust system that can handle upgrades without flinching.
  • Re-architecting your database sucks.
  • Capacity plan
  • Start slow
  • Measure reality. Capacity planning math should be based on real things, not abstract ones.
  • Build in logging and metrics. Usage stats are just as important as server stats. Build in custom metrics to measure real-world usage to server-based stats.
  • Cache. Caching and RAM is the answer to everything.
  • Abstract. Create clear levels of abstraction between database work, business logic, page logic, page mark-up and the presentation layer. This supports quick turn around iterative development.
  • Layer. Layering allows developers to create page level logic which designers can use to build the user experience. Designers can ask for page logic as needed. It's a negotiation between the two parties.
  • Release frequently
  • Forget about small efficiencies
  • Test in production. Build into the architecture mechanisms (config flags, load balancing, etc.) with which you can deploy new hardware easily into (and out of) production.
  • Forget benchmarks. Benchmarks are fine for getting a general idea of capabilities, but not for planning. Artificial tests give artificial results, and the time is better used with testing for real.
  • Find ceilings.
    - What is the maximum something that every server can do ?
    - How close are you to that maximum, and how is it trending ?
    - MySQL (disk IO ?)
    - SQUID (disk IO ? or CPU ?)
    - memcached (CPU ? or network ?)
  • Be sensitive to the usage patterns for your type of application.
  • Be sensitive to the demands of exponential growth
  • Plan for peaks

  • http://s.niallkennedy.com/blog/uploads/flickr_php.pdf
    JOIN’s are slow
    • Normalised data is for sissies
    • Keep multiple copies of data around
    • Makes searching faster
    • Have to ensure consistency in the application logic

    http://www.scribd.com/doc/2592098/DVPmysqlucFederation-at-Flickr-Doing-Billions-of-Queries-Per-Day
    What Problems was Flickr Having?
    Master Slave Topology•Slave Lag•Multiple SPOF•Unable to keep up with demand•Unable to Serve Search Traffic
    •Multiple Second page load times

    Design to attain Goal
    •Since write intensive, need more then 1master, need many write points.•To get rid of SPOFs - be redundant.•To allow maintenance real-time, trafficneeds to stick to servers, and ‘a’ server needs to be able to handle the all traffic.•To serve pages fast with many queriesneed small data that fits in memory.

    Federation Key Components
    •Shards•Global Ring•PHP logic to connect to the shards andkeep the data consistent

    Shards
    •Shards are a slice of a main database•Shards are set up in Active Master-Master Ring Replication
     –Done by sticking a user to a server in a shard –Shard assignments are from a randomnumber for new accounts –Migration is done from time to time –Can run on any hardware grade

    Global Ring
    • a.k.a Lookup Ring
     –For stuff that can’t be federated –Like where stuff isOwner_id

    Allow for maintenance
    •Each server in a Shard is 50% loaded
     –i.e. 1 server in a shard can take the full load if a server of that shard is down or inmaintenance mode

    http://laughingmeme.org/2009/09/29/try-coding-dear-boy/


    “We generally try do the dumbest thing that will work first. And that’s usually as far as we get. Almost everything we do is pretty straightforward, and as such is well documented around the Web, sometimes by us, generally by others. And when we do get fiendishly clever, as we do now and again, it’s usually a highly tuned (read idiosyncratic) solution for the problems we’re trying to solve.”
    http://www.slideshare.net/techdude/scalable-web-architectures-common-patterns-and-approaches/138
    https://www.scribd.com/doc/2592098/DVPmysqlucFederation-at-Flickr-Doing-Billions-of-Queries-Per-Day

    - 可延伸网络架构和分布式系统 | Scalable Web Architecture and Distributed systems(Flickr)
    English: http://www.aosabook.org/en/distsys.html
    Chinese: http://goo.gl/u3MoqU

    - Flickr 网站架构研究
    (一)Flickr 网站架构综述
    (二)数据库最初的扩展-Replication
    (三)Shard - 大型网站数据库扩展的终极武器?
    (四)海量文件系统设计的考虑因素
    (五)"Flickr File System"探秘(上)
    (六)"Flickr File System"探秘(下)
    (七)系统监控和故障管理

      Building Scalable Web Sites (by Cal Henderson, from Flickr).

    -    Database War Stories #3: Flickr (by Tim O'Reilly)





    https://mp.weixin.qq.com/s/Vc2RYh1evaMetwkca0lP-w

    有损服务


    • 放弃绝对一致,追求速度极致

    对比互联网业务,或者说对于相册业务来说,用户赞了,我们就希望马上看到赞成功,至于赞的总数是1000,还是1001,同一时期赞的不同人赞的先后顺序,用户不是很关注。这也是说,我们能够通过牺牲绝对一致性,换取更优良的用户体验。

    就图片上传而言,元旦零点的毛刺是平时高峰的4倍以上,对于我们准备扩容来说,如果满足这个毛刺,成本那就非常巨大,我们相册怎么做的?
    速度极致:用户一发表完照片,后端内存cache住数据后返回成功,本地客户端会生成一个假FEEDS,此时照片主人可以查看到我已经发表成功。入库存储的时,索引写成功后,数据还未同步到多地,此时也先返回成功。相册后台生成二级假FEEDS,用于缓解数据同步未完成之前的用户体验。
    尽力而为:为了提升用户体验,我们会搭建缓存server,缓存住后,返回用户成功,后台会尽快入库,完成数据落地。零点的请求,失败后,发往重试消息队列,后台会对失败的信息尽可能快速投递。

    对于客户端而言,接入网络的不稳定性会导致上传、下载过程受到中断,后台除了支持断点能力,客户端还需一定的策略重试,必要时候,通过弹窗提示用户。
    对于逻辑层而言,需要记录与客户端的状态,高峰时期,需要保护存储。对于读请求,尽可能缓存数据,对于写数据而言,维持请求队列,达到削峰的目的。
    对于存储层而言,对于读请求,尽可能一次性拉取,缓存在内存,减少磁盘io;对于写请求,记录binglog,尽可能快速落磁盘。

    晚高峰、节假日,用户请求图片带宽峰值过高时,大图浮层界面不预拉取图片,或者拉取更少图片,做到削峰效果。
    当业务突增,例如元旦、除夕,或者突发热点,对于特定的大图下载场景,我们会通过下发中图代替。
    极端情况下,我们仅允许访问小图、缩略图。再网上的故障等级,也可以提示用户相册功能暂时不可用。
    相册主要由三类请求构成,拉列表(相册列表、照片列表),上传和下载,平时峰值为拉列表13w/s,上传5w/s,下载30w/s。初步计算,机器缺口在1000台以上,资源储备有限,短时间内根本没有办法拿到这么多机器。眼看请求还在快速上涨,机器负载越来越高,我们的压力也越来越大,如果不采取紧急措施,服务随时有崩溃的可能。

    及时拒绝,防止雪崩:开启过载保护插件,对于逻辑模块,cpu90%以上机器,拒绝访问,防雪崩。
    • 一次拉取,保护后端:对于列表类请求,40张和100张对索引负载一样,逻辑模块一次拉取多点照片,单机cache住。
    • 用户至上,按需调整:目前产品形态是,相册列表、照片都是时间序逆序排列拉取,本次活动,用户都需要拉取到老数据,将主人态拉取切换到时间序正序。
    • 尽力而为,有损服务:启动容灾标记,列表只吐120张照片,挡掉下拉更多请求。
    • 多多益善:紧急协调了所有可用机器,紧急扩容索引、压缩模块
    • 伸缩有度,降级服务:对于业务的大图场景,我们统一返回小图,确保用户可访问。
    • 轻重分离,就重避轻:对于图片缩略图展示逻辑,超分技术会对人脸图片做特殊处理,对于人脸检测,我们是异步模块处理,这里异步模块对索引有大量访问,我们停止了这部分处理,先保证主路径,其他的服务优先级降低。

    Read full article from Flickr Architecture - High Scalability -

    Labels

    Review (572) System Design (334) System Design - Review (198) Java (189) Coding (75) Interview-System Design (65) Interview (63) Book Notes (59) Coding - Review (59) to-do (45) Linux (43) Knowledge (39) Interview-Java (35) Knowledge - Review (32) Database (31) Design Patterns (31) Big Data (29) Product Architecture (28) MultiThread (27) Soft Skills (27) Concurrency (26) Cracking Code Interview (26) Miscs (25) Distributed (24) OOD Design (24) Google (23) Career (22) Interview - Review (21) Java - Code (21) Operating System (21) Interview Q&A (20) System Design - Practice (20) Tips (19) Algorithm (17) Company - Facebook (17) Security (17) How to Ace Interview (16) Brain Teaser (14) Linux - Shell (14) Redis (14) Testing (14) Tools (14) Code Quality (13) Search (13) Spark (13) Spring (13) Company - LinkedIn (12) How to (12) Interview-Database (12) Interview-Operating System (12) Solr (12) Architecture Principles (11) Resource (10) Amazon (9) Cache (9) Git (9) Interview - MultiThread (9) Scalability (9) Trouble Shooting (9) Web Dev (9) Architecture Model (8) Better Programmer (8) Cassandra (8) Company - Uber (8) Java67 (8) Math (8) OO Design principles (8) SOLID (8) Design (7) Interview Corner (7) JVM (7) Java Basics (7) Kafka (7) Mac (7) Machine Learning (7) NoSQL (7) C++ (6) Chrome (6) File System (6) Highscalability (6) How to Better (6) Network (6) Restful (6) CareerCup (5) Code Review (5) Hash (5) How to Interview (5) JDK Source Code (5) JavaScript (5) Leetcode (5) Must Known (5) Python (5)

    Popular Posts