Closing the Visual-API Gap: How to Automate Legacy Portals with AI Vision Agents

You watch an ops manager process a standard customer refund. It takes five minutes. They check Zendesk for the complaint, open Shopify to verify the order, log into a clunky warehouse portal to confirm the return receipt, and finally switch to Stripe to issue the actual cash.
Five minutes is nothing for one ticket. But when you scale to two hundred refunds a day, that manual context-switching becomes a full-time job. The immediate instinct is to automate it. You look at the API documentation for your warehouse system and realise it hasn't been updated since 2018.
This is where the automation dream stalls. You are stuck choosing between a custom integration build that will cost £20,000 and a manual process that drains your team's energy every single afternoon.
The visual-API gap
The visual-API gap is the operational bottleneck that occurs when your modern SaaS tools can talk to each other perfectly, but a critical legacy system requires human eyes and mouse clicks to function. You have Zendesk, Shopify, and Stripe singing in perfect harmony through webhooks. Then you hit the warehouse management portal. It only exists as a web interface with no API access whatsoever.
This structural flaw affects nearly every UK SME that handles physical goods or complex service delivery. It persists because replacing a core legacy system is too risky and expensive. So, you patch the gap with human labour.
Your accounts assistant becomes a human API. They read data from a modern dashboard and manually type it into a web form that was built a decade ago. It feels like productive work. It is actually just data entry masking a system failure.
The cost is not just the hourly wage of the person doing the clicking. The real cost is the error rate. When humans copy and paste order numbers across three different screens for four hours straight, they make mistakes. A transposed digit means a refund goes to the wrong account. A returned item is never logged back into inventory.
The visual-API gap forces your smartest people to do your dumbest work. They spend their days moving text from a white screen to a grey screen. You cannot scale a business when your core operations rely on someone manually bridging the divide between modern software and legacy portals. Every new customer just adds another manual click to the pile.
Why rigid screen scraping fails silently
Rigid screen scraping fails because it relies on absolute coordinates rather than visual understanding. The standard advice is to buy a Robotic Process Automation tool to bridge the visual-API gap. You map out the exact sequence of clicks, record a macro, and tell a bot to repeat the process. It seems like a logical fix. It is actually a disaster waiting to happen.
Traditional RPA relies on brittle rules. It looks for a specific HTML div or a fixed set of screen coordinates. If your legacy supplier portal adds a new notification banner at the top of the screen, the entire page shifts down by fifty pixels. Your bot does not know this. It clicks where the "Refund" button used to be. It hits "Cancel Order" instead, and moves on to the next ticket.
If an RPA bot tries to read a customer address from a fixed text box, and the portal changes the font size, the bot scrapes empty space. It silently writes a blank field into your database. You only notice at month-end when your logistics partner asks why fifty return labels have no destination address attached.
This is the technical reality of rigid UI automation. It fails silently. An API workflow will throw a 400 Bad Request error if a payload is formatted incorrectly. You get an alert, and you fix it. A traditional screen scraper just keeps clicking blindly. It creates a mess of corrupted data that you only discover during an audit.
I see this pattern repeatedly when SMEs try to automate customer support without proper API access. They pay for expensive RPA licenses. They spend weeks building rigid click-paths. Then they abandon the system the first time a web interface updates its layout.
You cannot rely on coordinates or CSS selectors to handle financial transactions. You need a system that can actually see and interpret the screen the way a human operator does. If a button moves, the system needs to look for the new location. It must read the text and understand the context before taking action.
Hybrid automation with vision agents

A hybrid architecture uses n8n to orchestrate API calls while delegating visual tasks on legacy portals to an AI vision agent.
Hybrid automation combines standard API workflows for modern tools with vision-based UI agents for legacy interfaces. You use APIs where they exist, and you deploy an AI agent to handle the messy screens. This is now possible using tools like Anthropic's Claude 3.5 Computer Use API [source](https://docs.anthropic.com/en/docs/computer-use).
Here is exactly how you structure a complex refund automation. A customer emails support asking for a refund. Zendesk AI tags the ticket and triggers an n8n webhook. n8n immediately makes a standard API call to Shopify to pull the order details and verify the purchase date.
This is where standard automation stops. But now, n8n triggers a Python script running in a secure Docker container. This script calls the Claude Computer Use API. You pass Claude the order number and a strict set of instructions. You tell it to log into the warehouse portal, search for the order, and confirm if the item has been returned to the shelf.
Claude actually controls a virtual mouse and keyboard within that container. It looks at screenshots of the portal. It finds the search bar regardless of where it is on the page, types the order number, and reads the status. It is not relying on fixed coordinates. It is visually interpreting the interface.
If Claude sees the status is "Returned", it sends a structured JSON response back to n8n. n8n then makes a final, secure API call to Stripe to issue the refund. It updates the Zendesk ticket and closes the loop. The entire process takes forty seconds, requires zero human intervention, and leaves a perfect audit trail in your database.
You are blending the reliability of API automation with the flexibility of a human operator. The API handles the money, while the UI agent handles the legacy lookup. Building this hybrid system takes two to three weeks. You should expect to spend £6,000 to £12,000 depending on the complexity of your legacy portal and your existing n8n infrastructure.
To catch failure modes, you enforce strict boundary conditions. The UI agent is only given read-only access to the warehouse system. It is never allowed to click a "Refund" button. It only extracts data. If the agent gets confused by a major UI update, it simply returns an error to n8n. n8n then routes the ticket back to a human.
Where hybrid UI automation falls apart
Hybrid UI automation breaks down when you face physical security barriers, high-frequency transaction volumes, or unreadable legacy screens. This architecture is powerful, but it is not a universal fix. You need to verify a few constraints before committing to a build.
First, look at your authentication methods. If your legacy portal requires a physical 2FA hardware token or a biometric login, a virtualised UI agent cannot access it. You will spend weeks trying to bypass security protocols only to realise the system is locked down by design.
Second, consider the latency and cost of LLM inference. Claude taking screenshots, analysing them, and moving a virtual mouse takes time. A single lookup might take thirty seconds and cost a few pence in API tokens.
If you are processing fifty refunds a day, that is perfectly fine. If you are processing ten thousand micro-transactions an hour, the compute costs will destroy your margins. The latency will create massive backlogs that your ops team cannot clear.
Finally, be wary of systems with highly dynamic, flash-based, or heavily obfuscated interfaces. If a human struggles to read the screen because the text is embedded in low-resolution images, the vision model will struggle too. Always test the agent manually on your messiest screens before writing a single line of orchestration code. You need to know the limits of the vision model before you trust it with live customer data.
Three questions to sit with
- Which of your daily operational tasks are currently acting as human bridges between two disconnected software systems? Identify the specific workflows where your team spends more time copying data from a modern dashboard into a legacy portal than they do actually making decisions.
- If your current screen scraping tool encounters a completely redesigned web interface tomorrow, will it fail loudly and alert you, or will it fail silently and corrupt your database? Look closely at how your existing automations handle unexpected UI changes, because silent failures in financial processes are far more expensive than manual labour.
- Are you forcing API-level reliability onto a system that only offers a visual interface, rather than using a vision-based agent to bridge the visual-API gap? Evaluate whether you are wasting thousands of pounds trying to build brittle reverse-engineered APIs when a flexible UI agent could simply read the screen.
Get our UK AI insights.
Practical reads on AI for UK businesses — teardowns, how-to guides, regulatory news. Unsubscribe anytime.
Unsubscribe anytime.