In this tutorial, we create an advanced computer-using agent from scratch that can reason, plan, and perform virtual actions using local open-weight models. We create a miniature simulated desktop, equip it with a tool interface, and design an intelligent agent that can analyze its environment, make decisions on actions such as clicking or typing, and execute them step by step. Finally, we look at how the agent interprets goals such as opening an email or taking notes, demonstrating how a local language model can mimic interactive reasoning and task execution. check it out full code here,
!pip install -q transformers accelerate sentencepiece nest_asyncio
import torch, asyncio, uuid
from transformers import pipeline
import nest_asyncio
nest_asyncio.apply()
We set up our environment by installing the necessary libraries like Transformer, Accelerate, and Nest Asyncio, which enable us to seamlessly run local models and asynchronous tasks in Colab. We prepare the runtime so that the upcoming components of our agent can work efficiently without external dependencies. check it out full code here,
class LocalLLM:
def __init__(self, model_name="google/flan-t5-small", max_new_tokens=128):
self.pipe = pipeline("text2text-generation", model=model_name, device=0 if torch.cuda.is_available() else -1)
self.max_new_tokens = max_new_tokens
def generate(self, prompt: str) -> str:
out = self.pipe(prompt, max_new_tokens=self.max_new_tokens, temperature=0.0)[0]["generated_text"]
return out.strip()
class VirtualComputer:
def __init__(self):
self.apps = {"browser": "https://example.com", "notes": "", "mail": ["Welcome to CUA", "Invoice #221", "Weekly Report"]}
self.focus = "browser"
self.screen = "Browser open at https://example.com\nSearch bar focused."
self.action_log = []
def screenshot(self):
return f"FOCUS:{self.focus}\nSCREEN:\n{self.screen}\nAPPS:{list(self.apps.keys())}"
def click(self, target:str):
if target in self.apps:
self.focus = target
if target=="browser":
self.screen = f"Browser tab: {self.apps['browser']}\nAddress bar focused."
elif target=="notes":
self.screen = f"Notes App\nCurrent notes:\n{self.apps['notes']}"
elif target=="mail":
inbox = "\n".join(f"- {s}" for s in self.apps['mail'])
self.screen = f"Mail App Inbox:\n{inbox}\n(Read-only preview)"
else:
self.screen += f"\nClicked '{target}'."
self.action_log.append({"type":"click","target":target})
def type(self, text:str):
if self.focus=="browser":
self.apps["browser"] = text
self.screen = f"Browser tab now at {text}\nPage headline: Example Domain"
elif self.focus=="notes":
self.apps["notes"] += ("\n"+text)
self.screen = f"Notes App\nCurrent notes:\n{self.apps['notes']}"
else:
self.screen += f"\nTyped '{text}' but no editable field."
self.action_log.append({"type":"type","text":text})
We define the main components, a lightweight local model and a virtual computer. We use Flan-T5 as our logic engine and create a simulated desktop that can open apps, display screens, and respond to typing and clicking actions. check it out full code here,
class ComputerTool:
def __init__(self, computer:VirtualComputer):
self.computer = computer
def run(self, command:str, argument:str=""):
if command=="click":
self.computer.click(argument)
return {"status":"completed","result":f"clicked {argument}"}
if command=="type":
self.computer.type(argument)
return {"status":"completed","result":f"typed {argument}"}
if command=="screenshot":
snap = self.computer.screenshot()
return {"status":"completed","result":snap}
return {"status":"error","result":f"unknown command {command}"}
We present the ComputerTool interface, which acts as a communication bridge between the agent’s logic and the virtual desktop. We define high-level operations such as clicks, types, and screenshots, which enable the agent to interact with the environment in a structured manner. check it out full code here,
class ComputerAgent:
def __init__(self, llm:LocalLLM, tool:ComputerTool, max_trajectory_budget:float=5.0):
self.llm = llm
self.tool = tool
self.max_trajectory_budget = max_trajectory_budget
async def run(self, messages):
user_goal = messages[-1]["content"]
steps_remaining = int(self.max_trajectory_budget)
output_events = []
total_prompt_tokens = 0
total_completion_tokens = 0
while steps_remaining>0:
screen = self.tool.computer.screenshot()
prompt = (
"You are a computer-use agent.\n"
f"User goal: {user_goal}\n"
f"Current screen:\n{screen}\n\n"
"Think step-by-step.\n"
"Reply with: ACTION <click/type/screenshot> ARG <target or text> THEN <assistant message>.\n"
)
thought = self.llm.generate(prompt)
total_prompt_tokens += len(prompt.split())
total_completion_tokens += len(thought.split())
action="screenshot"; arg=""; assistant_msg="Working..."
for line in thought.splitlines():
if line.strip().startswith("ACTION "):
after = line.split("ACTION ",1)[1]
action = after.split()[0].strip()
if "ARG " in line:
part = line.split("ARG ",1)[1]
if " THEN " in part:
arg = part.split(" THEN ")[0].strip()
else:
arg = part.strip()
if "THEN " in line:
assistant_msg = line.split("THEN ",1)[1].strip()
output_events.append({"summary":[{"text":assistant_msg,"type":"summary_text"}],"type":"reasoning"})
call_id = "call_"+uuid.uuid4().hex[:16]
tool_res = self.tool.run(action, arg)
output_events.append({"action":{"type":action,"text":arg},"call_id":call_id,"status":tool_res["status"],"type":"computer_call"})
snap = self.tool.computer.screenshot()
output_events.append({"type":"computer_call_output","call_id":call_id,"output":{"type":"input_image","image_url":snap}})
output_events.append({"type":"message","role":"assistant","content":[{"type":"output_text","text":assistant_msg}]})
if "done" in assistant_msg.lower() or "here is" in assistant_msg.lower():
break
steps_remaining -= 1
usage = {"prompt_tokens": total_prompt_tokens,"completion_tokens": total_completion_tokens,"total_tokens": total_prompt_tokens + total_completion_tokens,"response_cost": 0.0}
yield {"output": output_events, "usage": usage}
We create a ComputerAgent, which acts as an intelligent controller of the system. We program it to reason about goals, decide what actions to take, execute them through a tool interface, and record each interaction as a step in the decision-making process. check it out full code here,
async def main_demo():
computer = VirtualComputer()
tool = ComputerTool(computer)
llm = LocalLLM()
agent = ComputerAgent(llm, tool, max_trajectory_budget=4)
messages=[{"role":"user","content":"Open mail, read inbox subjects, and summarize."}]
async for result in agent.run(messages):
print("==== STREAM RESULT ====")
for event in result["output"]:
if event["type"]=="computer_call":
a = event.get("action",{})
print(f"[TOOL CALL] {a.get('type')} -> {a.get('text')} [{event.get('status')}]")
if event["type"]=="computer_call_output":
snap = event["output"]["image_url"]
print("SCREEN AFTER ACTION:\n", snap[:400],"...\n")
if event["type"]=="message":
print("ASSISTANT:", event["content"][0]["text"], "\n")
print("USAGE:", result["usage"])
loop = asyncio.get_event_loop()
loop.run_until_complete(main_demo())
We bring everything together by running a demo, where the agent interprets the user’s request and performs the tasks on the virtual computer. We see it generating logic, executing commands, updating the virtual screen, and achieving its goal in a clear, step-by-step manner.
Finally, we implemented the essence of a computer-using agent capable of autonomous reasoning and conversation. We see how local language models like Flan-T5 can powerfully simulate desktop-level automation within a secure, text-based sandbox. This project helps us understand the architecture behind intelligent agents, such as computer-use agents, which combine natural language reasoning with virtual tool control. This lays a strong foundation for extending these capabilities to real-world, multimodal and safe automation systems.
check it out full code hereFeel free to check us out GitHub page for tutorials, code, and notebooksAlso, feel free to follow us Twitter And don’t forget to join us 100k+ ml subreddit and subscribe our newsletterwait! Are you on Telegram? Now you can also connect with us on Telegram.
Asif Razzaq Marktechpost Media Inc. Is the CEO of. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. Their most recent endeavor is the launch of MarketTechPost, an Artificial Intelligence media platform, known for its in-depth coverage of Machine Learning and Deep Learning news that is technically robust and easily understood by a wide audience. The platform boasts of over 2 million monthly views, which shows its popularity among the audience.
🙌 Follow MarketTechPost: Add us as a favorite source on Google.