All field notes

Multimodal AI should fit like broken-in denim

Multimodal AI should fit like broken-in denim

Project Astra sees through a phone camera, hears ambient sound, and remembers what you showed it last Tuesday. It can locate an object in a room, pull up Google Maps before you finish the sentence and, through the glasses, recall a saved door code and show it to you privately. No announcement. Just the right information, at the right moment, visible only to you. The technical achievement is impressive. My open question: what should it feel like to have Project Astra as a companion?

POV through Project Astra smart glasses showing a hand pressing a numeric keypad. An AI overlay displays "The door code you saved previously is 1170." Labels in the upper left read Cross-Device Memory and Glasses. Project Astra recalls a door code through the glasses, at the moment it is needed.

That question breaks into three parts. How we might reduce friction without asking people to switch modes or announce what they need. How a companion might know when to reassure and when to simply stay quiet. And how that contrast between presence and silence could make this feel like a cooperative relationship worth trusting.


Multimodal interaction only matters if it creates less friction

The engineering ambition behind Project Astra is real. Google DeepMind fused live video, audio, and text into a single transformer trained on all three simultaneously. The result is something that feels less like typing into a search bar and more like talking to a person who can see what you see.

But capability is not the same as value.

My experience designing assistive technology taught me this. The tools people with disabilities use are not limited to their phones: a walking stick, the objects they are already manipulating, the availability of their hands. Having a lot of capability does not mean you can reach it while moving through the real world. The quality requirement is higher because people have to multitask, especially people with disabilities. Being able to find the right thing by voice is what is truly powerful; the technology pulls the weight so you never have to stop, switch apps, or interact with the phone itself.

Multimodal AI has the opportunity to work the same way: fitting around the tools people already hold, until switching modes is something they never have to think about.

Rear view of a person wearing a bicycle helmet and a patterned poncho, pausing at a park entrance. A subtitle reads "Can you check if I can bike in there?" Labels indicate Tool Use, Maps and Glasses. Hands on the handlebars, the companion checks Maps by voice. No stopping, no switching.


Choosing what not to do

Astra’s roadmap includes agents that handle professional workflows, manage files, navigate Chrome on your behalf. A 95% success rate is impressive for a chatbot. For an agent authorized to move money or delete files, a 5% failure rate is the kind of friction that keeps people from ever fully handing it over. One wrong transfer means calling your bank, explaining what happened, starting over. That is enough to take back control.

The design question is not what the assistant can do. It is what the assistant should choose not to do. Restraint is the hardest interaction to ship because there is no demo moment. You cannot show a crowd the conversation that did not happen, the action held back, the question asked before the decision was made. But that is the work. That is what trust is made of.

That question has a version anyone has felt. Imagine being on a bus and asking your AI companion for directions. It answers out loud: the route, the stop, the neighborhood. Every person in that carriage now knows where you are going. For a tourist it is an awkward moment. For someone navigating a difficult situation, it is an exposure they did not choose. The AI answered correctly. What it did not ask was whether speaking out loud was the right choice in that moment.

A young man sits alone on a bus, looking out the window. A subtitle reads "The 24 bus route goes through Leicester Square, which is very close to Chinatown." Labels indicate Tool Use, Maps and Multimodal Reasoning. No headphones. Astra reads a destination out loud. The question the AI did not ask: should it?

When I worked on Voice Access for Android alongside teams passionate about accessibility, the conversation we kept coming back to was the same design problem in a different sense: not what the AI says out loud, but what the screen reveals to the people around you. The people we were designing for were low vision users who needed the camera running and the screen on just to find the exit on a transit bus, to locate a medication on a pharmacy shelf, to identify the right aisle in a grocery store without asking a stranger for help. They could not dim the screen when someone leaned over to look. The constraint we were working through was simpler to say than to solve: how do you protect what is on that screen from the person sitting two seats away?

That question pointed to something we were not calling by its right name at the time. It was not just a privacy feature. It was a contextual decision: the recognition that the environment a person is in is also part of the experience. The people who can see the screen without permission are part of the design problem. The technology had to hold that context, not just the hand holding the device.

That was nine years ago. Today, the Samsung Galaxy S26 Ultra ships a feature called Privacy Display: it controls how light leaves the screen, restricting what people at side angles can see, applied automatically when entering passwords, configurable by app. The problem my team was working through on a whiteboard with disability advocates is now something the industry ships as a flagship feature.

The multimodal AI assistant of the future will be designed around hundreds of contextual decisions exactly like that one, not only for the person using the device, but aware of who else can see it. A form factor like Astra’s glasses already moves toward that answer by removing the screen from the equation entirely. In the meantime, screen technology like Samsung’s can do the same work for the phone in someone’s hand on a bus.


Peripheral vision, not a camera

The best multimodal AI will not feel like a camera pointed at the world. It will feel like peripheral vision: the sense that is working before you turn your head, that catches movement before you consciously register it, that is always on without demanding your attention.

Pre-washed denim works the same way. Already shaped to you before you put it on, softened by someone else’s work so you do not have to earn it. The difference matters. Stiff leather needs to be broken in; you adapt to it before it adapts to you. Technology has worked this way for decades: you learn the interface, you train the model, you adjust your behavior to meet the machine. What multimodal AI can offer is the opposite: an assistant that arrives already fitted to your context, holds what you have shown it, and gets easier with use instead of harder.

Project Astra’s persistent memory is the beginning of that fit: keeping track of objects just out of sight, the things set down or left behind, without asking you to repeat yourself. The integration with Maps matters for the same reason, not just navigation but awareness of where you are going and what you want to do when you get there. At home, on the bus, on the way to the pharmacy, inside the pharmacy: the context changes and the companion changes with it. It knows when to stay quiet at a bus stop. It remembers when you feel comfortable speaking out loud. That is when this cooperative relationship earns trust: not because it does more, but because it knows when to do less.

A hand holds a smartphone running Project Astra, camera pointed at a red London double-decker bus on a city street. A subtitle reads "Will that bus take me anywhere near Chinatown?" In an unfamiliar city, context becomes serendipity. Point, ask, know: just the question and the world in front of you.

The assistant that earns trust is the one that fits around what people already carry and sees what they almost missed. Peripheral vision is not passive; it is always building context. At home on a familiar route, you might prefer the quiet. Traveling somewhere unfamiliar, you want the wider lens: the street name you did not look up, the building you walked past, the bus connection you would not have thought to ask about. Context becomes serendipity. That is when this stops being a tool you point at things and starts being a companion that sees the world with you.


First sketches

Four moments from a single trip to the pharmacy. Each scenario maps what the AI perceives, what it decides, and what it holds back. Together they are a concrete place to explore: how the AI reduces friction without asking, when silence builds more trust than any response, and what it feels like when restraint is the right design decision.

Design scenario Step 1 of 4

Entering the pharmacy: context before action

What the AI perceives

Person stepping inside a pharmacy entrance
  • location: pharmacy, strongest public context signal
  • store sign at entrance: exact location confirmed
  • reduced ambient volume, indoor acoustic signature

What the AI decides

  • switches to screen-only alerts, minimal display mode
  • audio output

Location is the anchor. Vision and audio confirm it. Together they resolve: public context.

Four moments in a pharmacy — and the AI's contextual decisions at each.

These sketches became a case study with a working prototype. The rest are coming together in my Field Notes section.

What are you building that earns that trust? Show me something cool.


Marco Lobato · marcolobato.ux@gmail.com · March 2026