The PayPal/Ebay divorce became official July 17th earlier this year. Seven months prior, PayPal Operators were commissioned to start the incredible task of surgically removing the PayPal Enterprise cloud and transplanting it to a brand new rack body. Not only did their great cloud migration succeed, but PayPal’s OpenStack Enterprise Cloud is now the largest OpenStack cloud in the world. Here’s how they did it.
At over 400,000 cores, the PayPal Enterprise Cloud is the largest OpenStack private cloud in the world. After their split from Ebay, they managed to build this cloud and still be ready for all your holiday online purchasing needs! The following is about the set of priorities and principles that AppFormix distilled from what PayPal revealed in a talk by Anand Palanisamy and Kalin Nikolov this past OpenStack Summit. It includes a story of their technical challenges, the lessons they learned, the tools they built and used, and what they hope to accomplish in the near future.
So, you are the PayPal Operator and you have less than 6 months to build your new enterprise cloud and less than 10 weeks to separate and migrate it from source cloud. What are your priorities? What is your advice?
- Cloud Availability. PayPal sought to achieve this by using multi-tenancy and multi-cells. (See Nova Cell use below).
- Abandon old tooling that no longer matches priorities, even if you spent 4 years building it. Vendor empowerment to build plugins is one of the most important uses for PayPal. Ops from PayPal and Ebay spent the past 4 years building an in-house Load Balancer as a Service (LBaaS) tool, but on moving, PayPal switched to Community LBaaS, so vendors will have an easier time building plugins and other features that utilize PayPal.
- Live cloud updating and Live cloud testing. Live updates and live cloud testing is one of the most important capabilities in making your cloud robust and quickly scalable. “If you are in a live upgrade mode, you free yourself up to be in the community," says Anand.
- Real-time capacity scaling. PayPal really wanted to achieve a 1-2 day turn around from the moment a new rack arrives at the data center until it is running apps.
The primary objectives of the new PayPal cloud was availability, scalability, live updating and live cloud testing. Security as a priority also goes without saying for PayPal, the pioneer of web payments. As part of their commitment to scalability and availability, PayPal set in place long term goals to construct availability zones in a week. At the time of this talk (which was only a couple months after completion of the new cloud), PayPal has managed to lower 6 months of availability zone construction to 4 weeks.
Cloud Migration: PayPal's Great Data Pilgrimage
Once PayPal had their cloud set up, how did they actually go about migrating to it?
Lessons from Migration
- Infrastructure Visibility is the best preventer of drift configuration. Anand says a complete view of cloud deployment is vital, especially in the infrastructure layers. "You deploy code and if you're hyper-visor is down, once it comes back up, it's going to have a different code base and configuration."
Don’t wait for all the Apps and Developers to migrate. “It is NOT true that if you set up a cloud, people will move to it.” Sure you may loose a couple, but “In the cloud world, no one cares about 2 VMs," Anand concludes with a chuckle.
- Don't auto deploy configurations from the get-go. Anand recommended turning off all auto config enabled by tools like Puppet and Salt. "We deploy to 5% - 10%. We make sure everything is good, then we control it out everywhere else." This manual deployment is then orchestrated over a time.
When migrating API service, be CAREFUL! “You can’t just take down APIs because a lot of automation services are built on that.” Anand suggests that you find windows of time in between your critical operations cycle to migrate APIs and the services that use them. From experience, he also suggests that you never try to migrate all the services in one window. When migration is part of an upgrade, also make sure that all upgrades are complete and not impacting other layers.
- Keep in mind what your are migrating. Don't do anything risky or try to save time at the expense of taking down your production API's or VIP instances.
- Remember, "If there is an issue, you can't always just role the configuration change back." While it may be tedious, Anand suggests you disable auto configuration for migration, especially when migrating API's.
Open Source Tooling
- Graphite for graphing
- Puppet & Salt for configuration management
- Zabbix for monitoring
- Cobbler for bare-metal provisioning. Kalind.
The in-house hero-tool of the migration was a tool called Flyway. Flyway was the PayPal solution tool for migrating cloud resources and it was responsible for migrating thousands of VMs from Ebay to PayPal. Flyway can copy:
- Nova VMs
- Users, Tenants, Roles, Keypairs, Quotas
- Images and Snapshots
- Cinder volumes and data
- Trove Database instances
- LBaaS VIP instances and certs
PayPal is going to be open sourcing the awesome Flyway tool to Github in the near future. Check back at the PayPal Github to see when Flyway is available!
In addition to Flyway, PayPal used the following in-house tools, some of with have not yet made it on to the PayPal Github repo.
- Stackwatch/Stackmetrics for cloud health and metrics
- Reparo for server remediation and provisioning
- Cloudinfo for cloud view and visibility
- CloudMinion for capacity reclamation
Near Future Operations Projects
Speakers: Anand Palanisamy and Kalin Nikolov (PayPal)
Questions or comments? Comment below or TWEET TO US! Tweet us! @AppFormix