YouTube video summary

Evolving Trainline Architecture for Scale, Reliability and Productivity

Technology26 Nov 202416 min summaryFrom InfoQ
Evolving Trainline Architecture for Scale, Reliability and Productivity
InfoQ
YouTube

Introduction and Overview of Trainline

  • The presentation will cover lessons learned from scaling Trainline's architecture, including handling more traffic, enabling multiple engineers to work on the architecture simultaneously, and scaling the efficiency of the platform for cost-effectiveness and growth 38s.
  • Trainline is Europe's number one rail digital platform, retailing rail tickets for users worldwide and providing services throughout the rail journey, including platform information, disruption assistance, and compensation for delays 1m42s.
  • The company provides its services through its B2C brand, Trainline.com, as well as a white-label solution to partners in the carrier space and the wider travel ecosystem 2m24s.
  • Trainline is a public company, well-established, and profitable, with a significant size of business, although specific numbers will be discussed later in the presentation 3m35s.
  • The presentation will also cover the business lens of productivity, team impact, cost efficiency, and financial business impact, in addition to actual traffic growth and handling more business 1m16s.
  • The speaker will discuss how Trainline has made its architecture possible for more engineers to work on it simultaneously, allowing for a faster pace of growth and innovation 56s.
  • The presentation will last around 25-30 minutes, with time for questions at the end, and attendees are encouraged to note down any questions that arise during the presentation 3m19s.
  • Trainline has a significant scale in terms of technical impact, with around five billion net ticket sales and 350 searches per second for journeys and origin-destination pairs across 3.8 million monthly unique routes 3m45s.
  • The company has around 500 people in its tech and product organization, with the majority being tech professionals 4m28s.
  • Trainline has real-time information on the location of each live train in Europe, which presents a large problem space in terms of data and actions required when trains are delayed, canceled, or changed 4m46s.
  • The company has over 270 API integrations with individual rail and bus carriers, with zero standardization in this space, resulting in high maintenance costs and non-trivial integration challenges 6m27s.
  • The lack of standardization in rail APIs is in contrast to the airline industry, which has standardized APIs through systems like Amadeos 6m48s.
  • The complexity of the problem space was not immediately apparent, but it became clear over time that certain problems, such as the aggregation of supply, were harder than initially thought 6m13s.
  • The rail industry faces a problem with inconsistent and disintegrated APIs, which have different access patterns and limitations, such as look-to-book ratios and rate limits, making it challenging to aggregate supply from various rail companies 7m17s.
  • Europe has 100 times more train stations than airports, adding to the complexity and scale of the problem, particularly when it comes to handling journey searches across multiple APIs 8m11s.

Challenges in the Rail Industry

  • The aggregation of supply in the rail industry is unique, but the problem of handling transactions over a finite inventory is not, and is similar to the classic "Ticket Master problem" 8m34s.
  • Selling seats on unique trains with limited inventory is more complex than selling digital products, as it requires checking inventory and handling transactions reliably and quickly 9m3s.
  • The company currently handles around 1300 transactions per minute at peak times, which has grown from 800 in the past three years, partly due to the recovery of rail travel and the company's growth in Europe 9m33s.
  • The speed at which people expect to receive their tickets, literally within a second, adds to the complexity of the problem, requiring instant fulfillment 10m33s.
  • When buying rail travel, most people purchase tickets a couple of months or weeks in advance, but about 60% of tickets bought on Trainline are purchased on the day, often just before boarding the train, requiring instant processing and industry-level standard interactions with processes to validate the ticket at barrier gates 10m48s.
  • The expectations for processing and validating tickets are high, involving complex interactions with industry-level processes to ensure the ticket is recognized as valid at barrier gates 11m32s.

Lessons Learned from Scaling

  • The talk will cover three lessons learned from scaling, focusing on team and productivity, cost efficiency, and scaling with growth in traffic and achieving higher reliability 11m45s.
  • The first lesson will focus on the impact of architecture on team productivity, highlighting how it can both enable and hinder progress as team sizes change 11m51s.
  • The second lesson will discuss cost efficiency and scaling the efficiency of the platform 12m5s.
  • The third lesson will cover scaling with growth in traffic, achieving higher reliability, and dealing with availability issues 12m10s.

Team Productivity and Organizational Structure

  • When the speaker joined Trainline in July 2021, the company had around 350 engineers organized in a cluster model, with teams focused on specific parts of the technical stack, such as Android, iOS, web, and backend 12m49s.
  • However, this organizational structure led to low team productivity, as most projects required collaboration between at least five and often up to 10 different teams, resulting in complex project management and delays 13m43s.
  • The previous team structure was slow and not suitable for the scale, leading to a massive reorganization in January 2022, where the team was restructured into a platform and verticals model, with platform owning the technical stack and verticals having people with different skill sets, shaped around clear ownership of product and business goals 14m26s.
  • The platform and verticals model improved alignment of the team to goals, but came with challenges like tension between platform and vertical teams, where vertical teams wanted to deliver quickly and platform teams wanted to ensure proper implementation and refactoring 15m14s.
  • The tension between platform and vertical teams is an embedded function of the model, but can be frustrating for people on both sides, and is a good thing to have, but sometimes needs to be managed 15m46s.
  • Recently, the team reorganized again, reducing the ownership of the platform to very core services and moving 50% of the tech surface to the verticals, which were renamed diagonals, to get the best parts of both models and streamline the process 15m58s.
  • The new model aims to remove some of the tension between platform and verticals, and the team is trying to find the right balance between the two 16m33s.
  • The key question in all three models is who owns each part of the technical surface, who is in charge of sustain work, mandatory technical upgrades, and driving the technical strategy and vision for that part of the codebase 16m57s.
  • The team needs to determine who effectively does the core work to sustain and drive the technical roadmap, in addition to who builds all the features 17m26s.
  • The architectural implications of these changes are relevant to the audience, and the team is trying to find the right balance between different models to achieve their goals 16m46s.
  • The team's goals and alignment can be defined by answering two questions: which product or business or tech goal is the team on the hook for, and what are the key performance indicators (KPIs) for that goal, with different teams having different answers to these questions 17m37s.
  • The three key performance indicators (KPIs) are A (alignment of engineering investment to business goals), P (productivity), and Q (quality of technical work), which translate to the risk of adding technical debt or making the platform worse or better 18m16s.
  • In the past, clusters were used, but they were poor for alignment, as engineers didn't care about product or business goals and only focused on their part of the code, resulting in poor end-to-end productivity and good quality due to people working in a restrained small part of the technology surface 18m35s.
  • When the team moved to a platform of verticals, alignment became super crisp and clear, with engineers on the hook for specific business goals, resulting in perfect alignment, better productivity, and good quality, but with some tension between verticals and the platform team 19m7s.
  • The current model has removed the platform team's policing of contributions, slightly diluting the clarity of alignment, but enabling productivity, and it's essential to note that different models work better at different times for a company, and shifting the model can bring a different balance and kind of productivity 20m5s.

Architectural Implications and Ownership

  • Customer and business needs do not respect architectural boundaries, and it is essential to acknowledge this fact when designing a platform 20m54s.
  • Even with a well-designed platform, business priorities and needs can change, requiring the architecture to evolve accordingly 21m29s.
  • Business strategy, technology ownership, and organizational structure will also change over time, and it is crucial to adapt to these changes 21m39s.
  • Conway's Law states that technology ends up getting the shape of the organization that makes it, and how teams are organized affects the technology built 22m18s.
  • There is no perfect reverse Conway maneuver, making it challenging to change the technical architecture designed by a certain organizational structure 22m39s.
  • It is essential to build technology and architectures with the fact of ownership transfers and external contributions in mind 23m7s.
  • Enforcing consistency is crucial, and leaders should establish a company-wide approach to technology, rather than allowing individual preferences to dictate the way things are done 23m34s.
  • This approach is necessary because team members and structures are likely to change over time, and a consistent approach ensures that technology can be maintained and updated efficiently 23m46s.
  • To achieve scale, reliability, and productivity, it is essential to enforce consistency within an organization by using as few languages and technologies as possible, even if it means sacrificing individual autonomy, to make it easier to transfer ownership and allow for external contributions 24m8s.
  • Consistency is key, as it allows for the transfer of "Lego blocks" rather than custom-built items, making it easier to reassemble them in different ways 24m19s.
  • Trainline has been around for 20 years and still has technology in production that was written 15 years ago, highlighting the importance of keeping things consistent with few languages and technologies 25m3s.

Cost Efficiency and Optimization

  • Production costs are a significant concern, with Trainline's AWS bill accounting for about 25% of their overall software engineer compensation bill 26m11s.
  • The cost of the platform was growing faster than traffic, prompting the need to take a goal to run things in an efficient way and make sure the organization is disciplined and running a tight ship 27m14s.
  • The goal is not to make drastic cuts but to ensure the organization is running efficiently and making the most of its resources 27m20s.
  • The goal was to drive down the annual run rate of production costs by 10% to reduce the entire annual bill. 27m26s
  • The team has a massive surface area with over 700 microservices and more than 100 databases, making it challenging to identify areas for cost reduction. 27m43s
  • To achieve the goal, the team considered various levers, including cleaning up unneeded data, consolidating non-production environments, reviewing old low-value services, and right-sizing the platform. 28m25s
  • They also looked at data retention policies and reviewed architectural choices, such as the use of cloud functions and lambdas, to determine if they were efficient. 28m55s
  • The team decided to delegate the problem to individual teams that own parts of the technology stack, tasking each team with driving 10% of the bill down for their area. 29m58s
  • The plan was to use attribution to track what's driving the costs and hold each team accountable for reducing their costs. 30m6s
  • A goal was set for each team to reduce their part of the bill by 10%, but this led to some teams taking risks and doing things that ultimately caused problems, such as outages, due to underprovisioning of services 30m39s.
  • Instead of reviewing architecture choices and making changes, most teams simply scaled down their services, which sometimes worked but also led to outages 31m27s.
  • In a three-month period, there were four outages caused by underprovisioning, which occurred when the platform hit peak traffic, usually on Tuesdays 31m42s.
  • The goal of reducing costs by 10% was achieved, but it was not the best value for money, as engineers spent more time than necessary, and there were unintended consequences, such as outages 32m28s.
  • Cost management is important for the long-term efficiency of the platform, but understanding where to make cuts in a large, fragmented microservice-based system requires centralized thinking and cannot be delegated to individual teams 32m55s.
  • Predicting which cost-reducing efforts are worth it can be tricky, especially for those without a full understanding of the system, and blindly pushing down cost-saving goals to individual teams can lead to more problems than solutions 33m21s.
  • A centralized task force that works with individual teams to evaluate where investments to save costs are worth it would be a better approach than delegating cost-saving goals to teams 33m42s.

Scaling for Growth and Reliability

  • The third lesson learned is about scaling for growth in traffic and reliability, which includes managing cost system cost savings efforts centrally and avoiding fully delegating, as it can backfire 33m55s.
  • The speaker briefly covers three big bouts of outages, which could each be a talk on their own, and asks the audience to keep the information confidential 34m26s.
  • The first outage occurred in October 2021, when Trainline went down for four hours one day and two hours the next day, due to the platform struggling to handle the sudden increase in traffic after the COVID-19 pandemic 34m53s.
  • The cause of the outage was the contention in terms of database connections, as many new microservices had been added, each maintaining connections to the database, leading to a bottleneck in the relational databases 36m16s.
  • The relational databases were hosted on a single machine, which couldn't keep up with the many connections, but the issue was eventually tweaked and tuned, and the platform survived the period 36m43s.
  • A year later, in October, another outage occurred, also related to the database, but this time involving old Oracle databases 37m14s.
  • The company experienced a significant outage due to a gradual increase in load on the orders database, which was caused by the addition of features related to the journey experience, such as following a journey and receiving notifications about platform changes, over the course of a year 37m46s.
  • The company's observability was primarily focused on transactional flows, and as a result, the increase in load on the orders database went unnoticed until it caused an outage 38m24s.
  • The company had to implement database-related fixes to resolve the issue and prevent similar outages in the future 38m55s.
  • The company recently experienced a series of DDoS attacks, which may have been attributed to nation-state actors, and had to tighten up its DDoS protections and make other changes to mitigate the attacks 39m4s.
  • Despite initial concerns that the platform was being DDoS attacked again, it was discovered that the issue was actually caused by sloppy retry strategies throughout the stack, which allowed small issues to snowball into larger problems and eventually bring the platform down 39m56s.
  • The company did not have a coordinated retry strategy, which contributed to the problem, and everything from client-side retries to backend service retries ended up creating a 10x load that brought the platform down 40m25s.
  • The architectural lesson learned from past experiences is that none of the issues were caused by a single team, change, or regression, but rather by a buildup of small problems over time, making it difficult to predict and detect bottlenecks in a large microservice-based system 40m51s.
  • Predicting bottlenecks in such systems is challenging due to the complexity and spread of microservices, with each team chasing their own goals and contributing to the overall problem, often resulting in a tragedy of the commons 41m40s.
  • The best approach to handling this issue is to regularly review longer-term traffic mix or load changes, such as reviewing changes in critical databases or services over a period of six months, to identify potential bottlenecks and guide teams accordingly 42m15s.
  • Service fleet coordination is critical in guiding teams and ensuring a strong architecture function or principal engineering function to oversee the big picture and prevent issues from arising due to individual team ownership and decisions 42m53s.
  • Observing over longer terms and coordinating microservice leads are essential lessons learned from past experiences, with consistency being key to productivity in the long run, especially for startups or seed-stage companies 43m30s.
  • When building a business where technology should survive for five to 15 years, it is essential to insist on consistency and manage system cost-saving efforts centrally, even if engineers may not love it, to avoid losing the wider context and to coordinate microservice fleets to avoid outages 43m52s.

Knowledge Transfer and Service Ownership

  • Building an architecture towards changing the structure in people can be challenging, as it goes against optimizing for knowledge and change management in the near term, but it is crucial to balance short, medium, and long-term goals 45m3s.
  • The "build it, you own it" strategy is effective for the first six to 12 months of a service, but after that, the person who built it is likely to move on, and the service needs to be handed over to others, requiring a transition period to ensure adoption and knowledge transfer 45m39s.
  • The platform verticals model can be used to facilitate this transition, where a vertical builds a new service and is on the hook for it for six to nine months, with the relevant parts of the platform advising through that period 46m17s.
  • The goal is to create "Lego blocks" that anyone can pick up and take care of, even if they are not an expert, to enable an agile organization that can focus on the most important things 47m10s.
  • The transition period can be challenging, and it is essential to pull teams back from the mindset of having a single person who knows the service, as this can cause contention and make it difficult to touch the service 46m49s.
  • The "bus factor of one" is a common problem, where only one person knows the service, and it is essential to avoid this by creating a culture of knowledge sharing and transfer 46m56s.

Microservices, Consistency, and Technology Adoption

  • Microservices were initially adopted for team autonomy, allowing different languages to be used, but now consistency is considered key, raising questions about its future and potential downsides 47m41s.
  • Having multiple languages for front-end development, such as Android and iOS, and different back-end languages, like .NET and Ruby, can lead to fragmentation of skill sets within an organization 48m14s.
  • This fragmentation can make it challenging to assemble cross-functional teams to deliver simple features, requiring a large number of people with different skill sets 48m44s.
  • The complexity of native platforms like Android and iOS, each with their own languages, contributes to this challenge 49m11s.
  • To address this, it's essential to strike a balance between allowing innovation and trying new things, while also having a path for making successful new technologies official and widely adopted 49m30s.
  • Allowing everyone to choose their own technologies without a clear path for adoption can lead to unmanageability, especially as the business grows 49m44s.
  • A successful business needs a clear strategy for managing technology adoption to avoid complications in the long run 49m55s.
Made with Recall · in 3 seconds

Get a summary like this for anything you read, watch or save.

Recall summarizes any link you paste, then keeps it in your personal library so you can search, chat with it, and never lose a key idea again.

YouTube videosArticlesPodcastsPDFsAnything else
Save this summary

Then save anything you watch or read next.

Bookmark this summary, then save any video, article or PDF you read next.

Save to your library

Ready to get started?

Save, summarize & chat with your content.

GET STARTED

IT'S FREE

No credit card required · 30 Day Refund on Premium · 24 Hour Support

Recall web app on laptop