In Part 5 of this series, we added MCP to our AI agent, enabling dynamic tool integration without code changes. However, we discovered another limitation: when users wants to upload expense receipts, invoices, or travel documents, the agent can only process text—it cannot analyze images or extract information from visual documents.

Text-only AI agents miss critical information embedded in images, scanned documents, charts, and diagrams. In business scenarios like expense management, travel booking confirmations, or invoice processing, most information arrives as images or PDFs rather than structured text.

In this post, we’ll add multi-modal capabilities (vision and document analysis) and multi-model support to our AI agent, allowing it to analyze images, extract structured data from receipts, and use different AI models optimized for specific tasks.

Overview of the Solution

The Multi-Modal Challenge

Text-only AI agents cannot:

  • Extract information from receipt images or scanned invoices
  • Analyze charts, diagrams, or infographics
  • Process travel booking confirmations (often PDFs with images)
  • Understand visual context in documents
  • Verify expense compliance from uploaded receipts

Users expect to upload a photo of their restaurant receipt and have the AI extract the amount, date, merchant, and check policy compliance automatically.

We’ll solve this with multi-modal AI and multi-model architecture:

  1. Add vision-capable models that can analyze images and documents
  2. Create specialized service for document analysis with expense extraction
  3. Configure different models for different tasks (chat vs. document analysis)
  4. Route requests based on content type (text vs. image/document)

What is Multi-Modal AI?

Multi-modal AI models can process and understand multiple types of input:

  • Text: Natural language questions and responses
  • Images: Photos, screenshots, diagrams
  • Documents: PDFs, scanned receipts, invoices
  • Combined: Text prompts with attached images

Vision-capable models like Claude Sonnet, Amazon Nova and some other models can “see” images and extract structured information, enabling use cases like:

  • Expense receipt analysis and extraction
  • Invoice processing and validation
  • Travel document verification
  • Chart and diagram interpretation
  • Visual quality inspection

What is Multi-Model Architecture?

Different AI models excel at different tasks. A multi-model architecture uses:

  • Chat Model: Optimized for conversational interactions, reasoning, and tool calling
  • Document Model: Optimized for vision, document analysis, and structured data extraction
  • Embedding Model: Optimized for semantic search and RAG (already using Titan Embeddings)

This allows you to choose the best model for each task while maintaining a unified user experience.

Architecture Overview

1
2
3
4
5
6
7
8
9
10
11
User Request

[ChatController]

Has Image/Document?
├─ Yes → [DocumentChatService] → Vision Model (Claude Sonnet)
└─ No → [ChatService] → Chat Model (Claude Sonnet/Amazon Nova)

[Memory + RAG + MCP Tools]

Response

Key Spring AI Components

Prerequisites

Before you start, ensure you have:

  • Completed Part 5 of this series with the working ai-agent application
  • Java 21 JDK installed (Amazon Corretto 21)
  • Maven 3.6+ installed
  • Docker Desktop running (for Testcontainers with PostgreSQL/PGVector)
  • AWS CLI configured with access to Amazon Bedrock
  • Access to vision-capable models in Amazon Bedrock (Claude Sonnet, Amazon Nova)

Navigate to your project directory from Part 5:

1
cd ai-agent

Multi-Model Configuration

We’ll configure two models: one for chat interactions and one for document analysis.

Configure Models

Add model configuration to src/main/resources/application.properties:

1
2
3
4
5
cat >> src/main/resources/application.properties << 'EOF'

# Document processing model (vision-capable)
ai.agent.document.model=global.anthropic.claude-sonnet-4-5-20250929-v1:0
EOF

This configuration:

  • Uses Claude Sonnet 4.5 for document analysis (vision-capable)
  • Keeps the existing chat model configuration for conversations
  • Allows independent model selection for different tasks

You can use different models for chat and documents. For example, use Nova Pro for chat (cost-effective) and Claude Sonnet for documents (superior vision capabilities).

Document Analysis Service

We’ll create a specialized service for analyzing images and documents with vision models.

Create DocumentChatService

Create the service that handles multi-modal document analysis:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
mkdir -p src/main/java/com/example/ai/agent/service
cat <<'EOF' > src/main/java/com/example/ai/agent/service/DocumentChatService.java
package com.example.ai.agent.service;

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.ai.chat.client.ChatClient;
import org.springframework.ai.chat.model.ChatModel;
import org.springframework.ai.model.tool.ToolCallingChatOptions;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.core.io.ByteArrayResource;
import org.springframework.http.MediaType;
import org.springframework.http.MediaTypeFactory;
import org.springframework.stereotype.Service;
import org.springframework.util.MimeType;
import org.springframework.util.MimeTypeUtils;
import reactor.core.publisher.Flux;
import java.util.Base64;

@Service
public class DocumentChatService {
private static final Logger logger = LoggerFactory.getLogger(DocumentChatService.class);

private final ChatClient documentChatClient;
private final ChatService chatService;

@Value("${ai.agent.document.model}")
private String documentModel;

public static final String DOCUMENT_ANALYSIS_PROMPT = """
Extract expense information from this document.

Required fields:
- Document Type: [RECEIPT, INVOICE, TICKET, BILL, OTHER]
- Expense Type: [MEALS, ACCOMMODATION, TRANSPORTATION, OFFICE_SUPPLIES, OTHER]
- Amount and Currency
- Date: [YYYY-MM-DD]

Category-specific details:
- ACCOMMODATION: check-in/out dates, nights, rate per night, location
- MEALS: contains alcohol (yes/no)
- TRANSPORTATION: type, route or location

Check against the Expense Policy and provide approval status with reasoning.
If not an expense document, provide a brief summary.
For missing information, state "I don't know".
""";

public DocumentChatService(ChatModel chatModel, ChatService chatService) {
this.documentChatClient = ChatClient.builder(chatModel)
.defaultSystem(DOCUMENT_ANALYSIS_PROMPT)
.build();
this.chatService = chatService;
}

public Flux<String> processDocument(String prompt, String fileBase64, String fileName) {
logger.info("Processing document: {}", fileName);

return Flux.create(sink -> {
// 1. Emit immediate feedback
sink.next("Analyzing document...\n\n");

// 2. Analyze document with multimodal AI
String documentAnalysis = analyzeDocument(prompt, fileBase64, fileName);

// 3. Stream structured summary with currency conversion
String summaryPrompt = documentAnalysis + "\n\n" +
"Based on the extracted information, provide a structured summary including:\n" +
"- Amount in EUR: If original currency is EUR, use original amount. " +
"Otherwise, convert to EUR using the document date (or current date if unavailable).\n\n" +
"After presenting the information, ask the user to confirm and offer to register the expense.";

chatService.processChat(summaryPrompt)
.subscribe(
chunk -> sink.next(chunk),
error -> sink.error(error),
() -> sink.complete()
);
});
}

private String analyzeDocument(String prompt, String fileBase64, String fileName) {
MimeType mimeType = determineMimeType(fileName);
byte[] fileData = Base64.getDecoder().decode(fileBase64);
ByteArrayResource resource = new ByteArrayResource(fileData);

String userPrompt = (prompt != null && !prompt.trim().isEmpty())
? prompt
: "Analyze this document";

try {
var chatResponse = documentChatClient
.prompt()
.options(ToolCallingChatOptions.builder()
.model(documentModel)
.build())
.user(userSpec -> {
userSpec.text(userPrompt);
userSpec.media(mimeType, resource);
})
.call().chatResponse();

return (chatResponse != null)
? chatResponse.getResult().getOutput().getText()
: "I don't know - no response received.";
} catch (Exception e) {
logger.error("Error analyzing document", e);
return "I don't know - there was an error analyzing the document.";
}
}

private MimeType determineMimeType(String fileName) {
if (fileName != null && !fileName.trim().isEmpty()) {
MediaType mediaType = MediaTypeFactory.getMediaType(fileName)
.orElse(MediaType.APPLICATION_OCTET_STREAM);
return new MimeType(mediaType.getType(), mediaType.getSubtype());
}
return MimeTypeUtils.APPLICATION_OCTET_STREAM;
}
}
EOF

Key features:

  • Multi-modal prompts: Combines text and images in a single request
  • Expense extraction: Structured prompt for extracting receipt information
  • Policy compliance: Checks expenses against company policies (via RAG)
  • Currency conversion: Uses tools to convert amounts to EUR
  • Streaming response: Provides immediate feedback and streams results

Update ChatController

Update ChatController to route document requests:

src/main/java/com/example/ai/agent/controller/ChatController.java

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
import com.example.ai.agent.service.DocumentChatService;
...
private final DocumentChatService documentChatService;

public ChatController(ChatService chatService,
ChatMemoryService chatMemoryService,
ConversationSummaryService summaryService,
DocumentChatService documentChatService) {
this.chatService = chatService;
this.chatMemoryService = chatMemoryService;
this.summaryService = summaryService;
this.documentChatService = documentChatService;
}

@PostMapping(value = "message", produces = MediaType.APPLICATION_OCTET_STREAM_VALUE)
public Flux<String> chat(@RequestBody ChatRequest request, Principal principal) {
String userId = getUserId(request.userId(), principal);
chatMemoryService.setCurrentUserId(userId);

// Route to document analysis or regular chat
return hasFile(request)
? documentChatService.processDocument(request.prompt(), request.fileBase64(), request.fileName())
: chatService.processChat(request.prompt());
}

...

private boolean hasFile(ChatRequest request) {
return request.fileBase64() != null && !request.fileBase64().trim().isEmpty();
}

public record ChatRequest(String prompt, String userId, String fileBase64, String fileName) {}
...

The controller now:

  • Accepts file uploads as base64-encoded strings
  • Routes requests with files to DocumentChatService
  • Routes text-only requests to ChatService
  • Maintains the same streaming response interface

Update WebViewController

Enable multi-modal features in the UI:

src/main/java/com/example/ai/agent/controller/WebViewController.java

1
2
3
4
5
6
7
8
9
10
11
12
13
14
...
@Value("${ui.features.multi-user:true}")
private boolean multiUserEnabled;

@Value("${ui.features.multi-modal:true}")
private boolean multiModalEnabled;

@GetMapping("/")
public String index(Model model) {
model.addAttribute("multiUserEnabled", multiUserEnabled);
model.addAttribute("multiModalEnabled", multiModalEnabled);
return "chat";
}
}

This enables the file upload button in the web interface, allowing users to attach images and documents to their messages.

Testing Multi-Modal Capabilities

If you completed Part 5, you can either start the travel MCP server from Part 5, or comment out the MCP client configuration in application.properties to test multi-modal features independently:

1
# spring.ai.mcp.client.sse.connections.travel.url=http://localhost:8082

Let’s test document analysis with a tram ticket image:

1
./mvnw spring-boot:test-run

Download the sample ticket image:

1
2
# Download sample tram ticket
curl -o ticket-tram-cz.png https://raw.githubusercontent.com/aws-samples/java-on-aws/main/samples/spring-ai-te-agent/ai-agent/samples/ticket-tram-cz.png

Test in the UI at http://localhost:8080:

  1. Click the file upload button (📎)
  2. Select the ticket-tram-cz.png image
  3. Type “Analyze this expense receipt” in the message box
  4. Click Send

Upload ticket

The AI will analyze the tram ticket image and extract:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Analyzing document...

| Field | Details |
|-------|---------|
| **Document Type** | Public Transport Ticket |
| **Date** | 2025-06-12 |
| **Time** | 17:13 |
| **Original Amount** | 30.00 CZK |
| **Amount in EUR** | €1.22* |
| **Expense Category** | TRANSPORTATION |
| **Description** | Prague public transport 30-minute transfer ticket (zones P, O, B) |
| **Vendor** | Dopravní Podnik hl. m. Prahy, a.s. |
| **Location** | Prague, Czech Republic |

**Policy Compliance:**
✅ Approved: Transportation expenses are within policy limits

Would you like me to register this expense?

Analysis result

Success! The AI agent can now analyze images and extract structured information.

You can also test in the UI at http://localhost:8080 - use the file upload feature to analyze receipts, invoices, or travel documents.

To test expense registration, you can download the backend MCP server from the sample repository and connect your AI Agent to it using the techniques learned in Part 5. The backend server provides expense management tools that work seamlessly with document analysis.

How Multi-Modal Works

When you upload a receipt image, here’s what happens:

  1. Image upload: Browser converts image to base64 and sends to server
  2. Routing: ChatController detects file and routes to DocumentChatService
  3. Vision analysis: Claude Sonnet analyzes the image and extracts expense data
  4. Policy check: RAG retrieves relevant expense policies
  5. Currency conversion: Tools convert amounts to EUR if needed (if backend server is connected)
  6. Structured response: Formats results with approval status

The AI “sees” the receipt image and extracts text, amounts, dates, and merchant information automatically.

Multi-Model Benefits

Using different models for different tasks provides:

  • Cost optimization: Use cheaper models for simple chat, expensive models for complex vision tasks
  • Performance optimization: Vision models for documents, fast models for chat
  • Capability matching: Use models with specific strengths (Claude for vision, Nova for speed)
  • Flexibility: Switch models without changing application code

Cleanup

To stop the application, press Ctrl+C in the terminal where it’s running.

The PostgreSQL container will continue running (due to withReuse(true)). If necessary, stop and remove it:

1
2
docker stop ai-agent-postgres
docker rm ai-agent-postgres

(Optional) To remove all data and start fresh:

1
docker volume prune

Commit Changes

1
2
git add .
git commit -m "Add multi-modal document analysis and multi-model support"

Conclusion

In this post, we’ve added multi-modal and multi-model capabilities to our AI agent:

  • Multi-Modal Support: Analyze images, receipts, invoices, and documents
  • Document Analysis Service: Specialized service for vision-based extraction
  • Multi-Model Architecture: Different models for chat vs. document analysis
  • Expense Extraction: Structured data extraction from receipt images
  • Policy Compliance: Automatic checking against company policies

Our AI agent now has a complete, production-ready architecture: memory (Part 2), knowledge (Part 3), real-time information (Part 4), dynamic tool integration (Part 5), and multi-modal capabilities (Part 6). It can handle text conversations, analyze images, integrate with any service via MCP, and use the best AI model for each task—all essential capabilities for enterprise AI applications.

What’s Next

Explore production deployment patterns, monitoring and observability, security best practices, and scaling strategies for AI agents in enterprise environments.

Learn More

Let’s continue building intelligent Java applications with Spring AI!

Comments