App owners expect to benefit from cloud on their own terms. By assessing tiers of functionality against ROI, the FAA cloud can transform the delivery of IT solutions without alienating stakeholders or breaking the bank.
Today, operations and development do not work well together, resulting in duplication, delays, and cost overruns. To increase the success rate of IT projects, the FAA needs a new approach. We’ve created a playbook of successful practices from the private sector and government that, if followed together, will help modernize apps and shift Dev and Ops towards DevOps.
Unlike other cloud service delivery models, Infrastructure as a Service provides a number of alternate onramps to the cloud that span from minimizing changes and shipping existing on-premise VMs to full refactoring of applications to take advantage of new cloud architectural patterns. As a result, this document proposes and aligns IaaS standards across a tri-modal approach to cloud services.
The first approach, called Low Gear, provides a minimum viable set of standards for legacy applications migrating to the cloud with little to no code or configuration changes. These applications are moving to the cloud by simply exporting existing virtual machines images from on premise ESXi environments as VMDK or OVF files and attaching the files to service orders for CCSD 6.1.1 monthly-priced virtual machines using the process outlined by section C.22.214.171.124 of the FCS contract. Once in the cloud, a systems administrator boots the virtual machine, assigns new IP addresses from the deployment environment’s block of addresses, and installs agents and client software to integrate with the Agility Platform and other monitoring and control components in the cloud. This approach offers the quickest and easiest path to migrating systems to the cloud, but at the expense of carrying over complex images and forgoing the flexibility to pursue more advanced features such as horizontal scalability. Once in the cloud, the application is operationally managed just as it is today, including manual testing, manual deployment through the existing Dev/Test, Pre-Prod, Prod pipeline.
The second approach, called Medium Gear, provides an interim set of standards for legacy applications that benefit from refactoring onto standard configurations of images and automation assets orderable from the deployment environment’s service catalog. These applications are moving to the cloud through the process of re-implementing the application architecture as an orchestration script that provides automated deployment of the solution stack. This re-implementation is an iterative development process of creating a new top-level orchestration script using an existing catalog of “stem cell” base images and automation assets. In the case of the FCS cloud environment, this is done by creating a top-level Agility Blueprint through the Agility Designer Tool using Agility blueprints, workloads, and packages. If architecturally appropriate, these Agility Blueprints would implement horizontal scaling through scaling plans that tie to CCSD 6.1.1 monthly-priced virtual machines with the operating system and elasticity supplemental features. Once the application architecture is in place and tested, a systems administrator manually migrates over the application logic and data. Once in the cloud, the application logic and data is operationally managed just as it is today, but all updates to the underlying middleware and operating system are now handled as feature requests and updates to the version-controlled blueprint and automation assets that implement environment that the application runs on top of.
The third approach, called High Gear, provides standards for new and existing applications that seek full adoption of agile methodologies as outlined in the U.S. Digital Services Playbook. This extends the automated deployment infrastructure involved in Medium Gear to a full DevOps toolchain by adding continuous integration components and automated testing. In the case of the FCS cloud environment, this DevOps toolchain shall be managed by the Agility Release Manager. Depending on the complexity of the application’s architecture, a high-gear application may elect to bypass the complexity of IaaS and implement a microservices approach by deploying containers or alternate higher-level deployment units using platform as a server (PaaS). Unlike traditional applications, the full operational management of the application is controlled by a strong product manager that controls the entire deployment pipeline and manages work that spans agile development teams and agile infrastructure teams. In order to execute on this new approach, the FAA shall likely require a DevOps support team that can perform direct action as a digital services team and provide DevOps Dojo support to teach ADE and AIF personnel the cultural and technical practices needed to establish agile development teams and agile infrastructure teams with a common DevOps culture.
The Tri-Modal Approach to Cloud Services described the migration of low gear applications to the cloud as uploading virtual machines images from a source VMWare ESXi environments to an destination FCS cloud domain. This “drag-and-drop” is clearly attractive, but an implicit requirement in this approach is that the virtual machine images migrating to the cloud effectively serve as a deployment unit of a particular application.
In FAA hosting environments, the requirement is largely unmet, as individual applications run on top of common middleware platforms (such as .NET application servers or Oracle RAC clusters) that host components of multiple applications on top a single virtual machine. More likely than not, these applications have different funtional requirements, security controls, and system owners, leading to different outcomes during a Cloud Suitability Assessment. While it made rational sense to minimize the number of platforms when conducting manual systems administration in a traditional FAA data center, this model does not work well in cloud environments.
As a result, applications moving to the cloud must be migrated to virtual machines that provide middleware services for only one application. This prevents infrastructure components from crossing system boundaries and enables migration of existing VM images to the cloud. While highly discouraged, it may be permissible to migrate common infrastructure to the cloud if all systems that reside on that shared infrastructure are cloud suitable without code or configuration changes. However, this should be considered a temporary measure and only done alongside a near-term commitment to refactor the various applications onto logically isolated deployment units in the cloud.
Despite the fact that NIST standards equate equate elasticity to the horizontal scaling of nodes, it is also possible to vertically scale virtual machines by shutting down a virtual machine instance, resizing the virtual machine instance using a larger virtual machine size, and restarting the virtual machine. Modern operating systems and properly-configured middleware detect these larger resources and begin using them without additional manual intervention.
This practice is familiar to VMWare administrators and likely occurs with FAA applications today, but unlike the FAA’s VMWare environment, cloud providers do not support concepts such as “Memory Hot Add” or “CPU Hot Plug” which allows hypervisor to dynamically increase virtual machine resources without having to reboot. A desire for these sorts of services likely influences the inclusion of VM_VPUA_ADD and network I/O bandwidth supplemental features in the FCS contract, but the actual implmementation of these services would require a reboot.
While this is disruptive and likely requires downtime, this approach allows existing applications that do not have an architecture suitable to horizontal scaling a quick and easy path to increased capacity. To the greatest degree possible by the FCS contract and the Agility Platform, low gear applications should use this capability to resize applications in anticipation of known spikes in demand or in response to degraded user experience.
Less-critical applications should consider implementing a “backup and restore” model of disaster recovery by periodically backing up the disk volumes associated with an application’s VMs as snapshots in object-storage. This approach potentially offers basic disaster recovery services for applications with a lower recovery point objective (RPO).
In the FCS environments, the creation of snapshots can be automated by creating an Agility Platform Event Policy with a periodic event time based on the target RPO. Recovery of low gear applications occurs by reverting volumes to an existing snapshot, but applications shifting to medium gear may consider alternative recovery processes that utilitize orchestration scripts to redeploy a recovery environment from scratch in order to maintain the existing environment for problem determination or to gracefully cut over to a fresh environment when the existing infrastructutre is facing a “sick but not dead” scenario.
Note: AWS has backend restrictions that limit an account to five simultaneous snapshot actions at a single time.
When disaster recovery requirements exceed those provided by basic backup and recovery of snapshots, the components of a Low Gear application shall be replicated from a hot primary environment to a cold standby environment with mirrored VMs located in a separate availability zone, region, or data center. The FAA envisions outsourcing the planning and execution of this capability to our cloud integrator through the higher-level Disaster Recovery CLINs outlined in section 7.6 of the FCS Cloud Computing Services Description.
If possible, the management platform (Agility Platform or vCenter) shall be configured with automation that performs health checks and powers up and cuts over to the back-up environment when a failure is detected. Alternatively, the management platform shall notify a human operator for manual failover. This solution can be made highly available by having a warm standby running at all times.
To the greatest degree possible, manual systems administration shall occur using a command line interface such as BASH shell or PowerShell and all activities shall be logged and retained.
Most Windows admins use the GUI, and if they know a command line, it’s probably the DOS Prompt and scripting via *.bat and *.cmd scripts. Over the past few years, Microsoft has spent considerable time building out a richer and full-reatures shell environment called PowerShell, and mastering this environment should be a top priority for all FAA personnel that regularly work with the Microsoft stack.
Most UNIX and Linux systems administators are culturally experienced with working in the Bash shell. However, many personnel can benefit from training beyond core shell scripting to the use of more robust scripting languages such as Python. Here is a range of materials to cover novice to experienced administrators.
Excepting the most trivial of applications, application tiers shall be refactored into separate pools of one or more virtual machines running in seperate subnets. This achieves greater functional decoupling of the applications tiers and enables greater flexibility in implementing security controls between those tiers using Network ACLs or other security models provided by the deployment environment.
Avoid golden images and implement virtual machines on base Stem Cell Images, which eschew specialization, middleware, or server roles and only provide the absolute minimal subset of operating system components and management agents (either FAA or external partners such as CSGov) to provide a pre-hardened base operating system targeting a particular deployment environment (on-premise FAA versus cloud).
In the case of FCS, these stem cell images are orderable in the Agility Designer tool as Agility Workloads. In order to ensure commonality between FCS and on premise FAA environments, the FAA should formally define and maintain standard server images that apply across all environments and work with CSGov to have the images behind the FCS Operating System supplemental features implement these base standards.
In concert with the base Stem Cell images discussed above, automate the installation and configuration of middleware using automation assets developed on a common automated configuration management platform that spans all FAA data centers and cloud environments. These automation assets shall deploy the standard configurations of software stacks, such as those outlined in AIT Business Plan Item 15C.119B1, Standard Configurations and Platforms. Examples of automated configuration management tools include Puppet, Chef, and Ansible
A medium tier application most likely is a complex multi-tiered application with isolated tiers implemented on pools of virtual machines. In the previous two plays, the deployment of operating systems, platforms, and middleware was automated through the combination of “stem cell” images and automated configuration management assets. However, in order to fully automate the deployment of a complex multi-tiered application, an orchestration engine must deploy these components according to an application architecture. In the case of Agility Platform, this is performed by creating a top-level Agility Blueprint through the Agility Designer Tool using embedded blueprints, workloads, and packages. If architecturally appropriate, these Agility Blueprints would implement horizontal scaling through scaling plans that tie to CCSD 6.1.1 monthly-priced virtual machines with the operating system and elasticity supplemental features.
In an Infrastructure as Code world, traditional operations teams need to form agile infrastrucutre teams capable of building and managing a service catalog of automation assets using the same agile techniques and professional software configuration management (SCM) tools used by application developers. In a DevOps setting, these agile infrastructure teams work closely with agile application / digital services teams to smooth out the traditional division between development and operations. The assets produced by these agile infrastructure teams populate a service catalog and are consumed by application teams on an as needed basis. The application teams are consumers of these infrastructure automation assets and stakeholders that make feature and function requests and potentially fork automation code and initiate pull requests. The agile infrastructure team works with configuration management and security to get the asset pre-approved and pre-authorized in order to minimize the authorization footprint for individual FAA applications.
Applications combine stem cell images, automation assets, and nested orchestration scripts into a single master orchestration script that acts as the major deployment unit for applications as they move through dev, test, and prod throughout the software development lifecycle. These orchestration scripts shall programmatically define scale units and scaling criteria for deployment environments that provide elasticity.
The master orchestration script acts as the deployment unit across the DevOps toolchain, and in the case of the FCS cloud, this process of deploying applications across this toolchain should be managed by the Agility Release Manager.
Updates to all software installed and configured by automation assets shall not occur via manual systems administration, but shall occur across the FAA’s portfolio of applications by updating the automation asset and having an automated configuration management system push out the update. In situations where manual systems administrations is required, this shall occur using a command line interface such as BASH shell or PowerShell and all activities shall be logged and retained.
Members of the U.S. Digital Services community have spent considerable time developing a general framework for designing cloud-native digital services and they strongly encourage reuse of their assets. In fact, this site uses one of their templates. In accordance with OMB wishes, high gear applications shall be designed using the U.S. Digital Services Playbook and use implemented using the U.S. Web Design Standards as a starting point.
Cloud-native apps use highly distributed architectures that decomposed functionality into load-balanced pools of loosely-coupled stateless nodes that communicate over well defined standard interfaces, such as message queues. This architecture enables horizontal scaling and designs for failure using automated health checks that look for node failure and automatically restore service.
Even in situations where latency or coupling with legacy systems of record might seem to make a particular app unsuitable for cloud in the near term, development should target these architectural principals anyways. Federal cloud is rapidly evolving to include support for FISMA high and DHS-compliant TIC cloud services, so there is a high likelihood that if not initially cloud suitable, new applications shall become suitable for cloud services during their lifespan. Until that becomes reality, the FAA operations team should partner with our vendors to judiously bring cloudy innovations back into FAA data centers. In a world where Node.js and MongoDB apps run on Ubuntu on mainframe and VMWare offers drop-in modules for OpenStack and Docker containers, there is little excuse not to design with cloud in mind.
The FAA Cloud offers a hybrid cloud brokerage model that allows applications to migrate between AWS, Azure, and private clouds via the Agility Platform. That may not seem like a killer feature to developers, but it most certainly does to operations, enterprise architects, and the IT executive team. As such, new high gear applications should maximize use of vendor-netrual subset of cloud capabilities offered by the FCS contract via the Agility Store cloud service catalog. All applications seeking specialty high-value proprietary cloud services should contact the FCS program office early during their requirements gathering process to allow for cost-benefit analysis weighing functionality against lock-in, portfolio managment, and integration work.
For example, a system owner may consider the Watson Visual Recognition service to be a perfect for a new Google Glass safety inspector app. The FCS program office may not be able to provide this particular service, but it will work with our strategic cloud partner to bring an architectural alternative into the cloud service catalog to fulfill your requirements, such as Google’s open-source Tensorflow technology provisioned on top of virtual machine contract line items. Alternately, the FCS program management team may approve the use of a specialty proprietary cloud services contingent on the implementation of compensating controls, such as the abstraction of APIs into modularized connectors that enable development of a drop-in replacement for the service should the need arise.