Gremlin's Status Checks
- Gremlin, where Medina works as a Senior Chaos Engineer, has launched a feature called status checks to check the health of a system before running chaos experiments. 2m46s
- Status checks can be integrated with tools like DataDog, New Relic, and PagerDuty, and users can also create their own using API endpoints. 3m50s
Impact of Complex Systems
- Complex systems are impacted by many factors, including world events like pandemics. 6m17s
- The pandemic highlighted the difference between organizations that were prepared for high traffic and those that were not. 7m30s
Game Days and Chaos Engineering Workshops
- Gremlin has resources for running game days, but a fully developed remote game day runbook has not yet been created. 10m33s
- Successful virtual game days can be run with proper planning, communication, and collaboration tools like Zoom and Google Docs. 11m40s
- Assigning specific roles, such as commander, note-taker, observer, and tester, helps participants focus on their tasks and contributes to a more successful game day experience. 12m30s
- Gremlin's chaos engineering workshops incorporate hands-on experiments in a cloud infrastructure environment, using Kubernetes, monitoring tools, and a microservice demo environment, to provide practical experience. 13m14s
Benefits of Chaos Engineering
- Chaos engineering can reveal inaccuracies in architecture diagrams by demonstrating how an entire application can break down when traffic to a single service or container is blocked. 16m17s
- When implementing chaos engineering, it is recommended to prioritize testing critical, high-impact services (tier zero and tier one) to maximize the return on investment. 18m26s
- Past incidents, documented in a blameless postmortem format, provide valuable insights for chaos engineering experiments by highlighting system vulnerabilities and areas for improvement. 19m9s
Importance of Training and Ethics
- There is a lack of focus on training and ethics in software engineering despite the potential for technology to cause harm. 22m41s
- Organizations should ideally begin planning 3-6 months in advance for important dates like Cyber Monday to ensure system resilience. 24m51s
- Code freezes are a warning sign that things need to change and that teams may not be equipped to handle changes during incident-heavy periods. 26m23s
Gremlin's Resources
- Gremlin offers free monthly training courses on chaos engineering, including fundamentals and automation. 27m31s
- Gremlin's Chaos Conf will be held virtually on October 6-8, featuring tracks on reliability practices, completing the DevOps loop, and data-driven reliability culture. 28m10s
- The best way to contact Ana Medina is through her Twitter handle, Anna _ Medina. 29m23s







