Design for Failure!
Worth a read: 5 Lessons We’ve Learned Using AWS. If you don’t have a Chaos Monkey, then get one fast and maybe even a gorilla.
- Latency Monkey induces artificial delays in our RESTful client-server communication layer to simulate service degradation and measures if upstream services respond appropriately.
- Conformity Monkey finds instances that don’t comply with best-practices and shuts them down.
- Doctor Monkey taps into health checks that run on each instance as well as monitors other external signs of health (i.e. CPU load) to detect unhealthy instances.
- Security Monkey searches for security violations or vulnerabilities, such as improperly configured AWS security groups, and terminates the offending instances. It also makes sure all our SSL and DRM certificates are valid and are not coming up for renewal.
- 10-18 Monkey (short for Localization-Internationalization, or l10n-i18n) detects configuration problems in instances serving customers in multiple geographic regions, using different languages and character sets.
- Chaos Gorilla is similar to Chaos Monkey, but simulates an outage of an entire Amazon availability zone.