FAA Cloud Services Playbook

App owners expect to benefit from cloud on their own terms. By assessing tiers of functionality against ROI, the FAA cloud can transform the delivery of IT solutions without alienating stakeholders or breaking the bank.

Today, operations and development do not work well together, resulting in duplication, delays, and cost overruns. To increase the success rate of IT projects, the FAA needs a new approach. We’ve created a playbook of successful practices from the private sector and government that, if followed together, will help modernize apps and shift Dev and Ops towards DevOps.

Introduction to Tri-Modal Approach to Cloud Services

Image of bicycle gear
shift

Unlike other cloud service delivery models, Infrastructure as a Service provides a number of alternate onramps to the cloud that span from minimizing changes and shipping existing on-premise VMs to full refactoring of applications to take advantage of new cloud architectural patterns. As a result, this document proposes and aligns IaaS standards across a tri-modal approach to cloud services.

The first approach, called Low Gear, provides a minimum viable set of standards for legacy applications migrating to the cloud with little to no code or configuration changes. These applications are moving to the cloud by simply exporting existing virtual machines images from on premise ESXi environments as VMDK or OVF files and attaching the files to service orders for CCSD 6.1.1 monthly-priced virtual machines using the process outlined by section C.4.4.2.4 of the FCS contract. Once in the cloud, a systems administrator boots the virtual machine, assigns new IP addresses from the deployment environment’s block of addresses, and installs agents and client software to integrate with the Agility Platform and other monitoring and control components in the cloud. This approach offers the quickest and easiest path to migrating systems to the cloud, but at the expense of carrying over complex images and forgoing the flexibility to pursue more advanced features such as horizontal scalability. Once in the cloud, the application is operationally managed just as it is today, including manual testing, manual deployment through the existing Dev/Test, Pre-Prod, Prod pipeline.

The second approach, called Medium Gear, provides an interim set of standards for legacy applications that benefit from refactoring onto standard configurations of images and automation assets orderable from the deployment environment’s service catalog. These applications are moving to the cloud through the process of re-implementing the application architecture as an orchestration script that provides automated deployment of the solution stack. This re-implementation is an iterative development process of creating a new top-level orchestration script using an existing catalog of “stem cell” base images and automation assets. In the case of the FCS cloud environment, this is done by creating a top-level Agility Blueprint through the Agility Designer Tool using Agility blueprints, workloads, and packages. If architecturally appropriate, these Agility Blueprints would implement horizontal scaling through scaling plans that tie to CCSD 6.1.1 monthly-priced virtual machines with the operating system and elasticity supplemental features. Once the application architecture is in place and tested, a systems administrator manually migrates over the application logic and data. Once in the cloud, the application logic and data is operationally managed just as it is today, but all updates to the underlying middleware and operating system are now handled as feature requests and updates to the version-controlled blueprint and automation assets that implement environment that the application runs on top of.

The third approach, called High Gear, provides standards for new and existing applications that seek full adoption of agile methodologies as outlined in the U.S. Digital Services Playbook. This extends the automated deployment infrastructure involved in Medium Gear to a full DevOps toolchain by adding continuous integration components and automated testing. In the case of the FCS cloud environment, this DevOps toolchain shall be managed by the Agility Release Manager. Depending on the complexity of the application’s architecture, a high-gear application may elect to bypass the complexity of IaaS and implement a microservices approach by deploying containers or alternate higher-level deployment units using platform as a server (PaaS). Unlike traditional applications, the full operational management of the application is controlled by a strong product manager that controls the entire deployment pipeline and manages work that spans agile development teams and agile infrastructure teams. In order to execute on this new approach, the FAA shall likely require a DevOps support team that can perform direct action as a digital services team and provide DevOps Dojo support to teach ADE and AIF personnel the cultural and technical practices needed to establish agile development teams and agile infrastructure teams with a common DevOps culture.

low gear: play 1

Migrate Apps off "Stacked" Infrastructure

The Tri-Modal Approach to Cloud Services described the migration of low gear applications to the cloud as uploading virtual machines images from a source VMWare ESXi environments to an destination FCS cloud domain. This “drag-and-drop” is clearly attractive, but an implicit requirement in this approach is that the virtual machine images migrating to the cloud effectively serve as a deployment unit of a particular application.

In FAA hosting environments, the requirement is largely unmet, as individual applications run on top of common middleware platforms (such as .NET application servers or Oracle RAC clusters) that host components of multiple applications on top a single virtual machine. More likely than not, these applications have different funtional requirements, security controls, and system owners, leading to different outcomes during a Cloud Suitability Assessment. While it made rational sense to minimize the number of platforms when conducting manual systems administration in a traditional FAA data center, this model does not work well in cloud environments.

As a result, applications moving to the cloud must be migrated to virtual machines that provide middleware services for only one application. This prevents infrastructure components from crossing system boundaries and enables migration of existing VM images to the cloud. While highly discouraged, it may be permissible to migrate common infrastructure to the cloud if all systems that reside on that shared infrastructure are cloud suitable without code or configuration changes. However, this should be considered a temporary measure and only done alongside a near-term commitment to refactor the various applications onto logically isolated deployment units in the cloud.

Key Questions

Checklist

  1. Complete a detailed Cloud Suitability Assessment to determine if your application is a candidate for migration
  2. Work with Operations to understand how the application is currently hosted
  3. Work with Legal to understand if your middleware licenses can be migrated to the cloud
  4. Refactor applications within FAA data centers to prevent accidental data leakage of neighbor applications.
low gear: play 2

Configure Apps to scale vertically by resizing VMs

Despite the fact that NIST standards equate equate elasticity to the horizontal scaling of nodes, it is also possible to vertically scale virtual machines by shutting down a virtual machine instance, resizing the virtual machine instance using a larger virtual machine size, and restarting the virtual machine. Modern operating systems and properly-configured middleware detect these larger resources and begin using them without additional manual intervention.

This practice is familiar to VMWare administrators and likely occurs with FAA applications today, but unlike the FAA’s VMWare environment, cloud providers do not support concepts such as “Memory Hot Add” or “CPU Hot Plug” which allows hypervisor to dynamically increase virtual machine resources without having to reboot. A desire for these sorts of services likely influences the inclusion of VM_VPUA_ADD and network I/O bandwidth supplemental features in the FCS contract, but the actual implmementation of these services would require a reboot.

While this is disruptive and likely requires downtime, this approach allows existing applications that do not have an architecture suitable to horizontal scaling a quick and easy path to increased capacity. To the greatest degree possible by the FCS contract and the Agility Platform, low gear applications should use this capability to resize applications in anticipation of known spikes in demand or in response to degraded user experience.

Key Questions

Checklist

  1. Collect historical performance data for the application if available, including all configuration changes to give the underlying compute infrastructure additional resources, such as “Memory Hot Add” or “CPU Hot Plug.”
  2. Develop Load Tests and UX Load Tests based on the typical usage characteristics of your appliction.
  3. Analyze the resulting performance metrics and determine if the application is (http://stackoverflow.com/questions/868568/what-do-the-terms-cpu-bound-and-i-o-bound-mean)[CPU bound, Memory bound, I/O bound, or Cache bound.]
  4. Develop metrics and scaling criteria to address the performance limitations at scale and implement these criteria using an orchestration engine.
  5. Test to ensure scaling action works as expected.
low gear: play 3

Backup automatically via Snapshots

Less-critical applications should consider implementing a “backup and restore” model of disaster recovery by periodically backing up the disk volumes associated with an application’s VMs as snapshots in object-storage. This approach potentially offers basic disaster recovery services for applications with a lower recovery point objective (RPO).

In the FCS environments, the creation of snapshots can be automated by creating an Agility Platform Event Policy with a periodic event time based on the target RPO. Recovery of low gear applications occurs by reverting volumes to an existing snapshot, but applications shifting to medium gear may consider alternative recovery processes that utilitize orchestration scripts to redeploy a recovery environment from scratch in order to maintain the existing environment for problem determination or to gracefully cut over to a fresh environment when the existing infrastructutre is facing a “sick but not dead” scenario.

Note: AWS has backend restrictions that limit an account to five simultaneous snapshot actions at a single time.

Key Questions

Checklist

  1. Review existing recovery objectives prior to migrating to the cloud.
  2. Once in the cloud, configure an Agility Platform Event Policy with a periodic event time based on the target RPO that creates snapshots of all disk volumes.
  3. Adjust the Event Policy and policy for associated object storage buckets to ensure that proper security measures are in place for this data in flight and at rest, including access controls and encryption.
  4. If permissable, adjust the Agility Platform Event Policy to shift older snapshots to lower-cost archival object storage.
  5. Periodically test recovery actions for this application and adjust policy and procedures based on lessons learned.
  6. If the application eventually creates orchestration scripts capable of deployment, use these orchestration scripts during restore activities.
low gear: play 4

Replicate app to physically seperated backup

When disaster recovery requirements exceed those provided by basic backup and recovery of snapshots, the components of a Low Gear application shall be replicated from a hot primary environment to a cold standby environment with mirrored VMs located in a separate availability zone, region, or data center. The FAA envisions outsourcing the planning and execution of this capability to our cloud integrator through the higher-level Disaster Recovery CLINs outlined in section 7.6 of the FCS Cloud Computing Services Description.

If possible, the management platform (Agility Platform or vCenter) shall be configured with automation that performs health checks and powers up and cuts over to the back-up environment when a failure is detected. Alternatively, the management platform shall notify a human operator for manual failover. This solution can be made highly available by having a warm standby running at all times.

Key Questions

Checklist

  1. Review existing recovery objectives and align to either DR-TF1 or DR-TF2.
  2. In conjunction with migration planning, contact the cloud integrator to allow them to begin DR planning.
  3. Order DR-TF1 or DR-TF2 along with other FCS CLINs.
  4. Complete acceptance testing, including an intial DR test.
  5. Work with the cloud integrator to perform an annual DR test.
low gear: play 5

Administer by Command Line and Automate by Scripts

To the greatest degree possible, manual systems administration shall occur using a command line interface such as BASH shell or PowerShell and all activities shall be logged and retained.

PowerShell Training

Most Windows admins use the GUI, and if they know a command line, it’s probably the DOS Prompt and scripting via *.bat and *.cmd scripts. Over the past few years, Microsoft has spent considerable time building out a richer and full-reatures shell environment called PowerShell, and mastering this environment should be a top priority for all FAA personnel that regularly work with the Microsoft stack.

  1. PowerShell 4 Foundations at CBT Nuggets
  2. Windows PowerShell v2-v3-v4 Ultimate Training at CBT Nuggets
  3. Getting Started with PowerShell 3.0 Jump Start at Microsoft Virtual Academy
  4. Advanced Tools & Scripting with PowerShell 3.0 Jump Start at Microsoft Virtual Academy
  5. Getting Started with PowerShell Desired State Configuration (DSC) at Microsoft Virtual Academy
  6. Advanced PowerShell Desired State Configuration (DSC) and Custom Resources at Microsoft Virtual Academy

Linux/UNIX Bash Training

Most UNIX and Linux systems administators are culturally experienced with working in the Bash shell. However, many personnel can benefit from training beyond core shell scripting to the use of more robust scripting languages such as Python. Here is a range of materials to cover novice to experienced administrators.

  1. Introduction to Linux at edX
  2. Advanced Bash-Scripting Guide at Linux Documentation Project
  3. Python for Unix and Linux System Administration at Linux Tone

Key Questions

Checklist

  1. Identify any existing scripts and assess for suitability to formalize as an automation asset under version control.
  2. Work to ensure FAA and contractor personnel are trained in CLI and scripting techniques.
  3. In conjuction with migration to the cloud, update Windows Server instances to the latest FAA-approved version of the Windows Management Framework..
  4. Perform administration via command line and judiciously generate and place new automation assets under formal version control.
medium gear: play 6

Separate app tiers into separate VM pools and subnets

Excepting the most trivial of applications, application tiers shall be refactored into separate pools of one or more virtual machines running in seperate subnets. This achieves greater functional decoupling of the applications tiers and enables greater flexibility in implementing security controls between those tiers using Network ACLs or other security models provided by the deployment environment.

Key Questions

Checklist

  1. Seperate out tiers into seperate VMs located in seperate subnets.
  2. Tighten down the security controls for each subnet, allowing through only the identified and documented protocols and ports.
medium gear: play 7

Automate deployment of Operating Systems using Stem Cell Images

Avoid golden images and implement virtual machines on base Stem Cell Images, which eschew specialization, middleware, or server roles and only provide the absolute minimal subset of operating system components and management agents (either FAA or external partners such as CSGov) to provide a pre-hardened base operating system targeting a particular deployment environment (on-premise FAA versus cloud).

In the case of FCS, these stem cell images are orderable in the Agility Designer tool as Agility Workloads. In order to ensure commonality between FCS and on premise FAA environments, the FAA should formally define and maintain standard server images that apply across all environments and work with CSGov to have the images behind the FCS Operating System supplemental features implement these base standards.

Key Questions

Checklist

  1. In preparation for this play, AIF must adopt standards and change control for server images similar to client workstations.
  2. In preparation for this play, AIF must create standard “stem cell images” based on our core server standards and work with our cloud integrator to ensure functional commonality.
  3. Assess golden image state against new the new FAA standard “stem cell image” and document state changes for inclusion in automation to be deployed by automated configuration management system.
medium gear: play 8

Automate deployment of Platforms and Middleware stacks using an Automated Configuration Management tool

In concert with the base Stem Cell images discussed above, automate the installation and configuration of middleware using automation assets developed on a common automated configuration management platform that spans all FAA data centers and cloud environments. These automation assets shall deploy the standard configurations of software stacks, such as those outlined in AIT Business Plan Item 15C.119B1, Standard Configurations and Platforms. Examples of automated configuration management tools include Puppet, Chef, and Ansible

Key Questions

Checklist

  1. Prior to play, AIT must create a standard for automated configuration management and work with our cloud integrator to use this solution.
  2. Use the gap analysis results of comparing current golden image and new FAA-wide “stem cell” image standard to write a specification for the desired state needed to achieve functional parity with your pre-existing golden image.
  3. Identify additional manual deployment steps and add to specification.
  4. Implement automation to achieve this desired state using the standard automated configuration management tool.
  5. Test to ensure automation works properly
  6. Maintain automation under version control and manage as a software asset.
medium gear: play 9

Automate deployment of Applications using an Orchestration Engine.

A medium tier application most likely is a complex multi-tiered application with isolated tiers implemented on pools of virtual machines. In the previous two plays, the deployment of operating systems, platforms, and middleware was automated through the combination of “stem cell” images and automated configuration management assets. However, in order to fully automate the deployment of a complex multi-tiered application, an orchestration engine must deploy these components according to an application architecture. In the case of Agility Platform, this is performed by creating a top-level Agility Blueprint through the Agility Designer Tool using embedded blueprints, workloads, and packages. If architecturally appropriate, these Agility Blueprints would implement horizontal scaling through scaling plans that tie to CCSD 6.1.1 monthly-priced virtual machines with the operating system and elasticity supplemental features.

Key Questions

Checklist

  1. Create a master orchestration template for the applicaton that uses images and automation assets to implement the application’s architecture.
  2. Place this master orchestration template under version control and work with configuration management and security to make this the primary unit of analysis for their respective processes
medium gear: play 10

Develop Images, Automation, and Orchestration using agile techniques and SCM tools.

In an Infrastructure as Code world, traditional operations teams need to form agile infrastrucutre teams capable of building and managing a service catalog of automation assets using the same agile techniques and professional software configuration management (SCM) tools used by application developers. In a DevOps setting, these agile infrastructure teams work closely with agile application / digital services teams to smooth out the traditional division between development and operations. The assets produced by these agile infrastructure teams populate a service catalog and are consumed by application teams on an as needed basis. The application teams are consumers of these infrastructure automation assets and stakeholders that make feature and function requests and potentially fork automation code and initiate pull requests. The agile infrastructure team works with configuration management and security to get the asset pre-approved and pre-authorized in order to minimize the authorization footprint for individual FAA applications.

Key Questions

Checklist

  1. AIT must form agile infrastrucutre teams focused on delivering automation assets
  2. Leverage existing SCM tools used for applications for the creation of automation assets
  3. Create and execute a development lifecycle for automation assets
medium gear: play 11

Deploy Apps through dev, test, and prod using a master orchestration script

Applications combine stem cell images, automation assets, and nested orchestration scripts into a single master orchestration script that acts as the major deployment unit for applications as they move through dev, test, and prod throughout the software development lifecycle. These orchestration scripts shall programmatically define scale units and scaling criteria for deployment environments that provide elasticity.

The master orchestration script acts as the deployment unit across the DevOps toolchain, and in the case of the FCS cloud, this process of deploying applications across this toolchain should be managed by the Agility Release Manager.

Checklist

  1. Develop a standard FAA DevOps Toolchain.
  2. Configure this FAA DevOps Toolchain in the Agility Release Manager.
medium gear: play 12

Push software updates through an automated configuration management system

Updates to all software installed and configured by automation assets shall not occur via manual systems administration, but shall occur across the FAA’s portfolio of applications by updating the automation asset and having an automated configuration management system push out the update. In situations where manual systems administrations is required, this shall occur using a command line interface such as BASH shell or PowerShell and all activities shall be logged and retained.

Checklist

  1. Identify undates or hotfixes that span multiple systems.
  2. Develop an automated configuration management unit of work that can be pushed out to multiple systems, inlcuding a plan to rollback the change if needed.
  3. Test the unit of work against a representative sample of systems.
  4. Work with security and configuration change management to get blanket approval to push this change across multiple systems.
  5. Validate successful deployment of change and close out change with security and configuraiton change management teams.
high gear: play 13

Reuse U.S. Digital Services Assets and Default to Open

Members of the U.S. Digital Services community have spent considerable time developing a general framework for designing cloud-native digital services and they strongly encourage reuse of their assets. In fact, this site uses one of their templates. In accordance with OMB wishes, high gear applications shall be designed using the U.S. Digital Services Playbook and use implemented using the U.S. Web Design Standards as a starting point.

Checklist

  1. Search USDS and 18F assets when investigating the development of new capabilities.
  2. Consider using 18F RFP / contract ghostwriting services when pursuing new capabilities.
  3. With the exception of priviliged or protected information, make FAA assets openly available and share with DOT and other agencies.
high gear: play 14

Design for cloud by default... even if you can't get deploy there yet.

Cloud-native apps use highly distributed architectures that decomposed functionality into load-balanced pools of loosely-coupled stateless nodes that communicate over well defined standard interfaces, such as message queues. This architecture enables horizontal scaling and designs for failure using automated health checks that look for node failure and automatically restore service.

Even in situations where latency or coupling with legacy systems of record might seem to make a particular app unsuitable for cloud in the near term, development should target these architectural principals anyways. Federal cloud is rapidly evolving to include support for FISMA high and DHS-compliant TIC cloud services, so there is a high likelihood that if not initially cloud suitable, new applications shall become suitable for cloud services during their lifespan. Until that becomes reality, the FAA operations team should partner with our vendors to judiously bring cloudy innovations back into FAA data centers. In a world where Node.js and MongoDB apps run on Ubuntu on mainframe and VMWare offers drop-in modules for OpenStack and Docker containers, there is little excuse not to design with cloud in mind.

Key Questions

Checklist

  1. Complete a detailed cloud suitability assessment
  2. Identify major cloud inhibitors
  3. Work with the FCS program office to compensate for those inhibitors
  4. Work with your current hosting provider to provide API-driven provisioning, container hosting, and other core cloud features
high gear: play 15

Maximize use of vendor-agnotic services capable of deployment to alternative CSPs

The FAA Cloud offers a hybrid cloud brokerage model that allows applications to migrate between AWS, Azure, and private clouds via the Agility Platform. That may not seem like a killer feature to developers, but it most certainly does to operations, enterprise architects, and the IT executive team. As such, new high gear applications should maximize use of vendor-netrual subset of cloud capabilities offered by the FCS contract via the Agility Store cloud service catalog. All applications seeking specialty high-value proprietary cloud services should contact the FCS program office early during their requirements gathering process to allow for cost-benefit analysis weighing functionality against lock-in, portfolio managment, and integration work.

For example, a system owner may consider the Watson Visual Recognition service to be a perfect for a new Google Glass safety inspector app. The FCS program office may not be able to provide this particular service, but it will work with our strategic cloud partner to bring an architectural alternative into the cloud service catalog to fulfill your requirements, such as Google’s open-source Tensorflow technology provisioned on top of virtual machine contract line items. Alternately, the FCS program management team may approve the use of a specialty proprietary cloud services contingent on the implementation of compensating controls, such as the abstraction of APIs into modularized connectors that enable development of a drop-in replacement for the service should the need arise.

Key Questions

Checklist

  1. Map functional requirements against services in the Agility Store
  2. Work with the FCS program office to perform functional gap analysis and identify new services worth adding to the FCS contrat.