YouTube video summary

Mastering Long-Running Processes in Modern Architecture: Real-Life Examples & Tools for Engineers

Technology

14 Oct 202414 min summaryFrom InfoQ

Mastering Long-Running Processes in Modern Architecture: Real-Life Examples & Tools for Engineers

Save to your library

Chat with this summary

Introduction to Long-Running Processes

Long-running processes can be compared to ordering food, such as pizza, where there are different ways to place an order, including phone calls and emails 17s.
Phone calls represent synchronous blocking communication, where the caller is blocked until the other person answers, and a direct feedback loop is established once the call is answered 42s.
However, this method has limitations, such as being temporarily coupled to the availability of the other side, and if the person is not available, the caller must try again or wait 1m13s.
An alternative to phone calls is sending an email, which represents asynchronous non-blocking communication, allowing the sender to send the message even if the recipient is not available 1m40s.
Emails lack a direct feedback loop, but the recipient can still respond to the email, providing a feedback loop, albeit asynchronously 2m6s.
The key difference between synchronous and asynchronous communication is not the technology used, but the interaction pattern, which can be decoupled from the technology 2m36s.
In the case of ordering pizza, the feedback loop is not the same as the result, as the customer is still hungry after receiving confirmation of their order, and the actual result is the pizza being delivered 2m51s.
The task of making pizza is a long-running process that takes time, involving multiple steps, such as baking and delivery, and this pattern is seen in many other interactions beyond just ordering food 3m10s.
Synchronous blocking behavior is not suitable for long-running processes, as it would require the customer to wait for an extended period, and asynchronous results are more appropriate 3m39s.
The process of making coffee at a machine is described as synchronous and blocking, meaning that while waiting for the coffee, no other tasks can be performed, leading to inefficiencies and poor user experience, especially when there is a queue. 4m12s
An article by Gor Hooper discusses how Starbucks scales its coffee-making process by separating the ordering and payment from the actual coffee preparation, allowing baristas to work independently and improving scalability and user experience. 4m48s
Fast food chains are increasingly using apps for ordering to streamline the initial steps of the process, although the actual preparation, such as coffee making, often still involves human workers like baristas. 5m41s

Challenges of Long-Running Processes

Long-running processes are defined as those that involve waiting, which can be due to human tasks such as approvals or decisions, or simply waiting for a response from a customer. These processes can take from hours to weeks. 6m14s
An example of a startup is given where they automated a service but intentionally added a delay to simulate human processing time, highlighting the importance of managing waiting times in processes. 7m15s
Waiting is challenging because it requires remembering the state of the process over potentially long periods, necessitating persistent state management to ensure continuity when the process resumes. 7m44s
Persistent state can be a problem, despite the existence of databases, due to subsequent requirements such as understanding what is being waited for, escalating if waiting for too long, versioning problems, and running at scale 8m9s.
These technical challenges can be difficult to solve without adding accidental complexity, and homegrown workflow engines are often not a good solution 9m1s.

Workflow Engines as a Solution

The speaker has experience working on workflow engines, CR engines, and orchestration engines, having co-founded Kamuna, a workflow orchestration company, and worked on open-source workflow engines 9m42s.
A workflow engine, also known as an orchestration engine or process engine, can solve long-running issues by defining workflows and running instances of them, settling requirements such as versioning and escalation 10m23s.
A demo of a workflow engine was given to illustrate its capabilities and provide a common understanding of what a workflow engine is 10m45s.
The demo used an example of an onboarding process, which is a common process in many companies, such as opening a new bank account or mobile phone contract 11m32s.
The workflow engine used in the demo is available on GitHub, allowing others to run it themselves and experiment with workflow engines 11m2s.
BPMN (Business Process Model and Notation) is an ISO standard used to define processes graphically, and it's not a proprietary thing, allowing for standardized process modeling 11m51s.
BPMN models can be used to define manual tasks, such as scoring a customer and approving an order, and can also include automated tasks and escalations 12m9s.
The BPMN model can be used to define a duration with a period of time, such as 10 seconds, to determine when a task is taking too long and should be escalated 12m48s.
A Java application, in this case a Spring Boot application, can be used to connect to a workflow engine, deploy the process, and provide a small web UI 13m2s.
The application can trigger a REST call to start a process instance within the workflow engine, and tools like Operate can be used to look into what's going on and see the versioning and instances running 13m45s.
The workflow engine can be used to automate tasks, such as sending an email, and can also be integrated with other systems, such as a CRM system, using custom Java code or pre-built connectors 15m1s.
The workflow engine can also be used to send emails, and the email can be configured using a pre-built connector, such as the one for SendGrid 15m12s.
The workflow engine is running in the background, and the workflow model has instances running through it, with code or UI attached to connect to systems or humans 15m36s.
The workflow engine used in this example is Camunda as a Service, and it's integrated with a Spring Boot application 15m49s.
Workflow isn't just for small-scale processes, but can be run at a huge scale, with thousands of process instances per second, and can be distributed across multiple data centers in different geographic locations, such as the US and UK, which can add latency but doesn't bring throughput down 16m10s.

Technical Reasons for Waiting in Long-Running Processes

There are technical reasons why processes may need to wait, including asynchronous communication, where a message may not be received immediately, and failure scenarios where a message is not received at all 17m1s.
In distributed systems, peer services may not always be available, requiring processes to wait for them to become available before proceeding 17m33s.
A common example of a long-running process is checking in for a flight, where a user may receive an email notification to check in, but the process may fail, requiring a retry, which can be done in a stateful manner, where the retry is scheduled for a later time 18m23s.
In the case of the flight check-in example, the user may need to wait for a few hours before retrying, and can use a calendar entry to remind them to try again, illustrating a stateful retry in a long-running process 19m24s.
The situation can be envisioned as having a web interface, a check-in microservice, and a background process that handles the check-in, which may need to wait for certain conditions to be met before proceeding 19m51s.
A personal experience of a failed check-in process due to a barcode generation issue is used to illustrate the importance of resiliency in distributed systems, where certain parts are always broken or network connections are always down 19m59s.
The failure of a single component, such as barcode generation, should not bring down the entire system, and a well-designed system should be able to handle such failures locally without affecting the overall user experience 20m51s.
A chain reaction of failures can occur when a problem is passed on to the user, making them responsible for resolving the issue, which is a bad design 21m12s.

Resilience in Long-Running Processes

A better design would be for the check-in service to handle the issue locally, for example, by checking in the user and sending the boarding pass later, which requires long-running capabilities within the service 22m44s.
Many teams prefer to be stateless and avoid keeping state, which can lead to rethrowing errors instead of handling them locally 23m23s.
Customers often expect a synchronous response, such as seeing a confirmation message and receiving a boarding pass immediately, which can make it challenging to implement long-running processes 23m37s.
A more resilient design would prioritize handling errors locally and providing a better user experience, even if it means not providing an immediate synchronous response 23m2s.

Handling Long-Running Processes in Payments

The discussion extends the example of flight bookings to include payment collection, specifically handling credit card payments, which typically involves using an external API service like Stripe. 24m15s
There is a challenge with service availability when charging credit cards, as the service might not be available at the time of the transaction, necessitating alternative solutions to avoid disappointing customers. 25m10s
In distributed systems, remote call exceptions can arise from various issues, such as network problems or service provider failures, making it difficult to determine the exact cause of the failure. 26m0s
Handling exceptions in distributed systems is complex because it is unclear whether a transaction was completed, which can lead to issues like double charging if not managed properly. 26m29s
Solutions to these issues include using workflows or running periodic reconciliation jobs to ensure transactions are correctly processed and any discrepancies are addressed. 26m49s
Embracing asynchronous thinking is recommended, where APIs are designed to acknowledge requests and provide results later, using HTTP codes to communicate the status of the request. 27m16s
Long-running processes can extend the options of what an API can do, and making APIs asynchronous allows for better handling of long-running tasks within services, giving more freedom to implement requirements as desired 27m47s.
Extending payment options to include customer credits on their account, similar to some companies that offer credits for returned goods, or PayPal's system of holding funds before deducting from a bank account, can provide more options for handling payments 28m11s.
Implementing long-running processes can pose new problems around consistency, such as handling transactions across different services, like credit handling or credit card charging, and ensuring that all steps are technically transactional 29m0s.
In distributed systems, failing a payment process can require compensating actions, such as rebooking customer credits, to maintain consistency, and this complexity can arise quickly when considering all implications 29m33s.

Service Boundaries and Long-Running Processes

Having long-running capabilities is necessary for designing good services and service boundaries, and this technical capability should be present in the architecture 30m17s.
A booking service can tell a payment service to retrieve payment via a message or REST call, and if the credit card is rejected, the next step would be to ask the customer to provide new details, allowing them to still book their flight 30m41s.
Long-running processes can be used to handle scenarios where a customer needs to provide new credit card details after the initial rejection, and this can be achieved through a workflow that includes compensating actions 31m40s.
GitHub subscriptions have a fully automated process for renewal, but if the credit card is invalid, an email is sent to update the card, introducing a long-running process that requires handling 31m59s.
A common reaction to this requirement is to pass it to a component that already handles long-running processes, such as booking, but this can lead to domain concept leakage and added complexity 32m43s.
Booking should not know about credit card details, as it only cares about receiving payment, and handling payment methods should be separate 33m21s.
Domain-driven design (DDD) also emphasizes the importance of separating domain language and concepts, and in this case, the booking service should not care about credit card rejection 33m43s.
To handle long-running requirements within payment, it's essential to make it easy for teams to implement, and potentially using workflows or orchestration can help 34m12s.
Payment might be fast and synchronous in most cases, but handling edge cases where it's not is crucial, and designing an API that can handle both cases is necessary 34m33s.
Using workflows or orchestration can help implement long-running processes, and having these capabilities available in different services can make it easier to distribute the process correctly among microservices 35m10s.
Not having long-running capabilities in a service like payment can lead to monolithic design if the logic is moved to another service, such as booking, just because it has the capability 35m47s.
Having long-running capabilities at the disposal of every service avoids the creation of monolithic "God Services" and makes it easier to distribute responsibilities correctly, as well as embracing long-running, asynchronous, and non-blocking processes 36m1s.

Organizational Strategies for Long-Running Processes

A good architecture requires a process orchestration capability, which can be obtained as a service, either internally or externally, and can be easily implemented by a team 37m17s.
Organizations that successfully use process orchestration often have a Center of Excellence, a dedicated team that cares about process orchestration, process automation, and related topics 37m58s.
A Center of Excellence should focus on enablement and providing a platform, rather than building solutions, and should enable others to build things by consulting, helping, and providing technology 38m50s.
The traditional model of central teams being involved in solution creation has been replaced by a model where central teams focus on enabling others, and this shift is driven by the need for autonomy and freedom in decision-making 39m32s.
The creation of a Center of Excellence is not a step backward towards centralization, but rather a way to enable teams to make their own decisions while still providing guidance and support 40m10s.
The concept of team topologies is discussed, emphasizing different types of teams to enhance development efficiency. These include stream-aligned teams focused on business logic, enabling teams with a consulting function, and platform teams providing necessary technology. 40m17s
Stream-aligned teams are designed to maximize productivity and reduce friction, allowing them to deliver business value effectively. 40m50s
Enabling teams assist by consulting across projects, while platform teams supply the technology needed, preventing teams from having to figure out everything independently. 41m11s
The complicated subsystem team is mentioned but not emphasized, as it deals with specialized tasks like fraud checks or AI services. 41m24s
Organizations can map these team structures effectively, often using a center of excellence for process orchestration and automation, with tools like Camunda or RPA tools. 41m36s
This approach prevents teams from spending excessive time in evaluation mode without delivering business value. 42m20s
Spotify's "Golden Path" concept from 2020 is highlighted, where defined solution templates are provided for building specific types of applications, making it easy and desirable for teams to use them without being forced. 42m39s
The "Golden Path" approach helps avoid "rumor-driven development," which is not scalable and can lead to inefficient technology use. 43m40s
Spotify also developed an open-source tool called Backstage.io to support this approach. 44m9s
Spotify's approach to development emphasizes autonomy for teams, but as the company grows, the software ecosystem becomes more complex and fragmented, leading to slower development speeds 44m28s.
Standardization of services and tooling can help free engineers from infrastructure complexity, rather than restricting autonomy, as seen in the concept of the "standards paradox" 44m47s.
Companies like Twilio offer pre-built services, known as the "PaaS path," that allow teams to get up and running quickly, and creating an incentive structure can encourage teams to take this path 45m10s.

Graphical Models and Long-Running Processes

Graphical models, such as BPMN, can be used to express complex processes in a simple and powerful way, and can be used for living documentation, test cases, and operations 45m55s.
Graphical models can also be used to discuss complex processes with different stakeholders, including non-developers, and can help elevate decisions about long-running behavior to the business level 46m50s.
Visualizing complex processes is important for making decisions about long-running behavior, and can help redesign the customer journey to leverage new architecture 47m41s.
Redesigning the customer journey is necessary to fully leverage new architecture, and graphical models can be a powerful tool in this process 47m59s.

Real-World Examples and Conclusion

The airline industry has seen significant changes in customer experience over the last five years, with automation playing a key role in improving services, such as automatic check-in for flights 48m28s.
A personal experience with a delayed and canceled flight to London demonstrated the use of automation in rebooking and providing updates through email and a mobile app, although some issues still required human intervention 48m38s.
The use of long-running capabilities, process orchestration platforms, and workflow engines can help design better service boundaries, reduce complexity, and provide a better customer experience 50m22s.
Embracing asynchronicity and using these technologies can also increase operational efficiency, automation, and compliance, while reducing risk and documenting processes 50m46s.
To successfully adopt these technologies across an organization, central enablement is necessary, and resources such as books, websites, and conferences can provide more information on the topic 51m8s.

Made with Recall · in 3 seconds

Get a summary like this for anything you read, watch or save.

Recall summarizes any link you paste, then keeps it in your personal library so you can search, chat with it, and never lose a key idea again.

YouTube videosArticlesPodcastsPDFsAnything else

Save this summary

Keep it in your library.

Save to your library

Browse all from InfoQ →

Why We Deprecated Google Analytics (And Built a System 3x Cheaper)

Why We Deprecated Google Analytics (And Built a System 3x Cheaper)

YouTube05 Jul 2026

Craig McLuckie on Culture as a Team's Operating System in the AI Era

Craig McLuckie on Culture as a Team's Operating System in the AI Era

YouTube15 Jun 2026

Netflix Engineering Director: Why Code Scales Systems, But Clarity Scales Orgs

Netflix Engineering Director: Why Code Scales Systems, But Clarity Scales Orgs

YouTube08 Jun 2026

Why Scaling Teams Spikes Human Latency (And How to Fix It)

Why Scaling Teams Spikes Human Latency (And How to Fix It)

YouTube07 Jun 2026

How AI Erased the Software Implementation Bottleneck (90% Shipped Code)

How AI Erased the Software Implementation Bottleneck (90% Shipped Code)

YouTube02 Jun 2026

Requirements Analysis for Architects: A Conversation with Sonya Natanzon

Requirements Analysis for Architects: A Conversation with Sonya Natanzon

YouTube02 Jun 2026

Ready to get started?

Save, summarize and chat with your content.

IT'S FREE

No credit card required · 30 Day Refund on Premium · 24 Hour Support

Recall web app on laptop, personal AI knowledge base for summarizing and chatting with your content