AppFormix Blog

How PayPal Runs the World's Largest Private OpenStack Cloud

By Chris Hamilton on December 08, 2015

Challenges of PayPal creating the largest OpenStack Cloud

 The PayPal/Ebay divorce became official July 17th earlier this year.  Seven months prior, PayPal Operators were commissioned to start the incredible task of surgically removing the PayPal Enterprise cloud and transplanting it to a brand new rack body.  Not only did their great cloud migration succeed, but PayPal’s OpenStack Enterprise Cloud is now the largest OpenStack cloud in the world.  Here’s how they did it.

At over 400,000 cores, the PayPal Enterprise Cloud is the largest OpenStack private cloud in the world.  After their split from Ebay, they managed to build this cloud and still be ready for all your holiday online purchasing needs!  The following is about the set of priorities and principles that AppFormix distilled from what PayPal revealed in a talk by Anand Palanisamy and Kalin Nikolov this past OpenStack Summit.  It includes a story of their technical challenges, the lessons they learned, the tools they built and used, and what they hope to accomplish in the near future.

The Build

So, you are the PayPal Operator and you have less than 6 months to build your new enterprise cloud and less than 10 weeks to separate and migrate it from source cloud.  What are your priorities? What is your advice?

Build Priorities:

  • Cloud Availability.  PayPal sought to achieve this by using multi-tenancy and multi-cells.  (See Nova Cell use below).
  • Abandon old tooling that no longer matches priorities, even if you spent 4 years building it.  Vendor empowerment to build plugins is one of the most important uses for PayPal.  Ops from PayPal and Ebay spent the past 4 years building an in-house Load Balancer as a Service (LBaaS) tool, but on moving, PayPal switched to Community LBaaS, so vendors will have an easier time building plugins and other features that utilize PayPal.
  • Live cloud updating and Live cloud testing.  Live updates and live cloud testing is one of the most important capabilities in making your cloud robust and quickly scalable. “If you are in a live upgrade mode, you free yourself up to be in the community," says Anand.
  •  Real-time capacity scaling.  PayPal really wanted to achieve a 1-2 day turn around from the moment a new rack arrives at the data center until it is running apps.

The primary objectives of the new PayPal cloud was availability, scalability, live updating and live cloud testing.  Security as a priority also goes without saying for PayPal, the pioneer of web payments.  As part of their commitment to scalability and availability, PayPal set in place long term goals to construct availability zones in a week. At the time of this talk (which was only a couple months after completion of the new cloud), PayPal has managed to lower 6 months of availability zone construction to 4 weeks.

PayPal_Cloud_Characteristics.png

Cloud Migration: PayPal's Great Data Pilgrimage

Once PayPal had their cloud set up, how did they actually go about migrating to it?
PayPal's cloud build-out itself completed at April’s end 2015.  The PayPal operations and IT teams then had 10 weeks to migrate.  At the split from Ebay, the PayPal Operations team faced the daunting task of migrating well over 8000 instances.  They had a team of only 15 people working the VM support for post and pre migration issues of developers and applications.  There was also just about 2 petabytes of volume data to be copied, in addition to the local data owned by each and every VM and Owner.  "We are not talking about simple copying either, but copying disk files from the 3 different geographical regions of what had been the Ebay/PayPal Availability Zones," says Anand. Woa now. Take a breath, because that’s not all...
 
PayPal_Data_Center_and_VM_Migration.png
 
PayPal wanted to use the migration from Ebay as an opportunity to be out with the old Nova Availability Zones and use the new and improved Nova Cell service for the first time to increase availability and scalability. For the un-initiated, the Nova Cells service adds scaling and geographic distribution capabilities without the complicated database or mq clustering of Zones. They also allow operators to separate host scheduling from cell scheduling. Cells are similar to the AWS EC2 Zone concept, only Cells are designed to run private enterprise OpenStack cloud distributions. 
 
What advice would PayPal Operators give to other Operators faced with daugnting migration tasks?  Here is our take away from the talk...
 

Lessons from Migration

  •  Infrastructure Visibility is the best preventer of drift configuration.  Anand says a complete view of cloud deployment is vital, especially in the infrastructure layers.  "You deploy code and if you're hyper-visor is down, once it comes back up, it's going to have a different code base and configuration."
  • Don’t wait for all the Apps and Developers to migrate.  “It is NOT true that if you set up a cloud, people will move to it.”  Sure you may loose a couple, but “In the cloud world, no one cares about 2 VMs," Anand concludes with a chuckle.
  • Don't auto deploy configurations from the get-go.  Anand recommended turning off all auto config enabled by tools like Puppet and Salt.  "We deploy to 5% - 10%.  We make sure everything is good, then we control it out everywhere else."  This manual deployment is then orchestrated over a time.
  • When migrating API service, be CAREFUL! “You can’t just take down APIs because a lot of automation services are built on that.”  Anand suggests that you find windows of time in between your critical operations cycle to migrate APIs and the services that use them.  From experience, he also suggests that you never try to migrate all the services in one window.  When migration is part of an upgrade, also make sure that all upgrades are complete and not impacting other layers.
  • Keep in mind what your are migrating. Don't do anything risky or try to save time at the expense of taking down your production API's or VIP instances.
  • Remember, "If there is an issue, you can't always just role the configuration change back."  While it may be tedious, Anand suggests you disable auto configuration for migration, especially when migrating API's. 

 The Tooling

We know now that PayPal is not opposed to severing emotional (and technical) connections to tooling, SO, the tools they do elect to keep and use are noteworthy.  PayPal used a combination of open source and in-house tooling on their journey to a better cloud.

Open Source Tooling

  • Graphite for graphing
  • Puppet & Salt for configuration management
  • Zabbix for monitoring
  • Cobbler for bare-metal provisioning. Kalind.

 In-House Tooling

The in-house hero-tool of the migration was a tool called FlywayFlyway_PayPal_Mirgation_tool.png Flyway was the PayPal solution tool for migrating cloud resources and it was responsible for migrating thousands of VMs from Ebay to PayPal. Flyway can copy:

  • Nova VMs
  • Users, Tenants, Roles, Keypairs, Quotas
  • Images and Snapshots
  • Cinder volumes and data
  • Trove Database instances
  • LBaaS VIP instances and certs

PayPal is going to be open sourcing the awesome Flyway tool to Github in the near future.  Check back at the PayPal Github to see when Flyway is available!

In addition to Flyway, PayPal used the following in-house tools, some of with have not yet made it on to the PayPal Github repo.
 

  • Stackwatch/Stackmetrics for cloud health and metrics
  • Reparo for server remediation and provisioning
  • Cloudinfo for cloud view and visibility
  • CloudMinion for capacity reclamation
  • CMS
 

Near Future Operations Projects

Curious about the immediate future projects for PayPal cloud? Anand says they will be working on upgrading to Kilo, Infra-AZ, and configuration drift management. Perhaps at the next summit, we will hear about how they changed to a controlled, masterless-Puppet that runs deployment.  
 
BTW, What was the best question in this talk? At the end of the talk, an audience member asked How may dollars PayPal saved by making their key deployment changes? The answer: A LOT. Production cloud is managed differently, but in the Cell Service alone they found that 40% of VMs were unused.  Fix that...now that is real savings!
 
That's all for now! You can watch the talk video to get more details or go to the OpenStack Summit Tokyo site to check out other things PayPal is up to. 
While you're there, check out AppFormix's conversation with operators from PayPal and other large enterprise clouds about the Challenges in Planning, Building, and Operating Large Scale Infrastructure.
 

Deploying the World's Largest Private OpenStack Cloud

Speakers: Anand Palanisamy and Kalin Nikolov (PayPal)   

 
 
 

 Questions or comments? Comment below or TWEET TO US! Tweet us! @AppFormix

Share

Subscribe to AppFormix Blog!

Increase Your Cloud ROI with AppFormix Analytics.

  Get a FREE Trial Now!

    

Subscribe to the AppFormix Cloud Operations Blog

About AppFormix

AppFormix is the leading provider of infrastructure performance optimization for cloud-based datacenters. AppFormix increases the ROI of existing enterprise infrastructure through software that enables consistent performance of applications running in Virtual Machines or containers, either on-premise or in the public cloud.

Get a FREE Trial of AppFormix Analytics and try our new alerting feature.