MCP Has an Image Problem

MCP Has an Image Problem

I added trademark image retrieval to Patent Connector. The MCP specification defines exactly how a server should return images to a client. I implemented it by the book. Then I tested it across clients and the results were all over the place.

What the Spec Says

The MCP specification defines an ImageContent type for returning images from tools. A server sends base64-encoded image data with a MIME type, and optionally annotations to hint at how the content should be used:

{
  "content": [
    {
      "type": "image",
      "data": "<base64-encoded JPEG>",
      "mimeType": "image/jpeg",
      "annotations": {
        "audience": ["user"],
        "priority": 0.8
      }
    },
    {
      "type": "text",
      "text": "View trademark image (temporary link): https://..."
    }
  ]
}

The audience annotation tells the client who the content is for - "user" means display it to the human, "assistant" means feed it to the language model. This is our implementation in Go:

content := []mcp.Content{
    &mcp.ImageContent{
        MIMEType: output.MIMEType,    // "image/jpeg"
        Data:     output.ImageData,    // raw JPEG bytes
        Annotations: &mcp.Annotations{
            Audience: []mcp.Role{"user"},
        },
    },
}

The Go MCP SDK handles base64 encoding automatically via standard json.Marshal behavior for []byte fields. The image is a 10 KB JPEG. Nothing exotic. By the spec, any MCP client should be able to render this.

What Actually Happened

ChatGPT Desktop does not understand ImageContent from MCP tools. Instead of rendering the image, it passes the raw base64 string to the language model as text. The model, trying to be helpful, wraps it in a markdown image tag: ![Trademark Logo](data:image/jpeg;base64,/9j/4AAQ...). ChatGPT doesn't render data URIs in markdown either. The user sees a wall of base64 characters scrolling off the screen.

ChatGPT dumping base64 as text

I added a fallback - a temporary URL that serves the image over HTTP. ChatGPT picks up the URL and tries to render it. The image placeholder appears, a spinner loads, but the image never shows. Whether that's a CORS issue, a security restriction, or something else - the result is the same. The user doesn't see the trademark.

ChatGPT broken image

Claude Desktop is the closest to working correctly. The image renders. But it's tucked inside a collapsed tool result panel that the user has to click to expand. The model's actual response just mentions the temporary URL - it doesn't describe the image because the audience: ["user"] annotation correctly prevented the image data from being sent to the language model's context.

Claude Desktop with image in collapsed tool result

Claude Code and other CLI tools return only the JSON metadata. The image content is silently dropped. A terminal can't render images, which is fair enough, but there's no indication that an image was returned at all.

I've Seen This Before

This reminds me of building websites in the early 2000s. There was a standard - CSS, HTML. You wrote to the standard. Then you opened it in Internet Explorer and half the layout was broken. You opened it in Firefox and it looked different again. Netscape had its own ideas entirely.

We wrote browser-specific hacks. Conditional comments for IE. CSS resets to normalize behavior. Vendor prefixes for features that every browser implemented differently. The spec was right. The implementations were all over the place.

That's where MCP is right now. The specification defines ImageContent clearly. The implementations don't agree on what to do with it. I'm shipping a temporary URL as a workaround - the MCP equivalent of a CSS hack for IE6. It works sometimes, in some clients, under some conditions.

The chat providers are the new browsers.

The Missing Piece

There's been discussion in the MCP community about adding a disposition concept to content annotations - something like "inline" versus "attachment" to tell clients where content should appear in the conversation flow. This would let a server say "render this image directly in the chat, not hidden behind a panel."

It never made it into the spec. The audience annotation controls whether content gets sent to the language model, not where it shows up visually. Without a disposition mechanism, the client decides entirely on its own how to present images - and as we've seen, every client decides differently.

The CLI Question

Many people in the MCP community gravitate toward CLI tools. There's a certain appeal - the command line is back, and it has its strengths. But there were reasons why graphical interfaces took over. Windows didn't win on technical merit alone. It won because humans are visual creatures. We process images faster than text. We want to see the trademark logo, not read its base64 encoding.

Try rendering a trademark image in your terminal.

The MCP protocol is on a good trajectory. The spec has the building blocks for rich content - images, audio, annotations, audience targeting. But a specification only matters when implementations follow it. Right now, the experience of returning an image from an MCP tool depends almost entirely on which client your user happens to be running. The server does everything right and the user's experience is a coin flip.

The chat providers are in charge here. The same way browser vendors eventually converged on web standards, MCP clients will need to converge on content rendering. I hope it happens faster this time. We already know how that story ends.


The trademark image tool and all other Patent Connector features are available at patent.dev/patent-connector. For questions about MCP server development or patent data integration, reach out through patent.dev.

Read more