A Practical Guide to Building AI Agents With Java and Spring AI - Part 6 - Multi-Modal Multi-Model

In Part 5 of this series, we added MCP to our AI agent, enabling dynamic tool integration without code changes. However, we discovered another limitation: when users wants to upload expense receipts, invoices, or travel documents, the agent can only process text—it cannot analyze images or extract information from visual documents.

Text-only AI agents miss critical information embedded in images, scanned documents, charts, and diagrams. In business scenarios like expense management, travel booking confirmations, or invoice processing, most information arrives as images or PDFs rather than structured text.

In this post, we’ll add multi-modal capabilities (vision and document analysis) and multi-model support to our AI agent, allowing it to analyze images, extract structured data from receipts, and use different AI models optimized for specific tasks.

Overview of the Solution

Text-only AI agents cannot:

Extract information from receipt images or scanned invoices
Analyze charts, diagrams, or infographics
Process travel booking confirmations (often PDFs with images)
Understand visual context in documents
Verify expense compliance from uploaded receipts

Users expect to upload a photo of their restaurant receipt and have the AI extract the amount, date, merchant, and check policy compliance automatically.

We’ll solve this with multi-modal AI and multi-model architecture:

Add vision-capable models that can analyze images and documents
Create specialized service for document analysis with expense extraction
Configure different models for different tasks (chat vs. document analysis)
Route requests based on content type (text vs. image/document)

Multi-modal AI models can process and understand multiple types of input:

Text: Natural language questions and responses
Images: Photos, screenshots, diagrams
Documents: PDFs, scanned receipts, invoices
Combined: Text prompts with attached images

Vision-capable models like Claude Sonnet, Amazon Nova and some other models can “see” images and extract structured information, enabling use cases like:

Expense receipt analysis and extraction
Invoice processing and validation
Travel document verification
Chart and diagram interpretation
Visual quality inspection

What is Multi-Model Architecture?

Different AI models excel at different tasks. A multi-model architecture uses:

Chat Model: Optimized for conversational interactions, reasoning, and tool calling
Document Model: Optimized for vision, document analysis, and structured data extraction
Embedding Model: Optimized for semantic search and RAG (already using Titan Embeddings)

This allows you to choose the best model for each task while maintaining a unified user experience.

Architecture Overview

User Request
     ↓
[ChatController]
     ↓
Has Image/Document?
     ├─ Yes → [DocumentChatService] → Vision Model (Claude Sonnet)
     └─ No  → [ChatService] → Chat Model (Claude Sonnet/Amazon Nova)
                    ↓
            [Memory + RAG + MCP Tools]
                    ↓
                Response

Key Spring AI Components

ChatClient: Unified interface for all AI models
Multi-Modal Support: Handle text, images, and documents in prompts
Model Options: Configure different models per request

Prerequisites

Before you start, ensure you have:

Completed Part 5 of this series with the working ai-agent application
Java 21 JDK installed (Amazon Corretto 21)
Maven 3.6+ installed
Docker Desktop running (for Testcontainers with PostgreSQL/PGVector)
AWS CLI configured with access to Amazon Bedrock
Access to vision-capable models in Amazon Bedrock (Claude Sonnet, Amazon Nova)

Navigate to your project directory from Part 5:

1	cd ai-agent

Multi-Model Configuration

We’ll configure two models: one for chat interactions and one for document analysis.

Configure Models

Add model configuration to src/main/resources/application.properties:

cat >> src/main/resources/application.properties << 'EOF'

# Document processing model (vision-capable)
ai.agent.document.model=global.anthropic.claude-sonnet-4-5-20250929-v1:0
EOF

This configuration:

Uses Claude Sonnet 4.5 for document analysis (vision-capable)
Keeps the existing chat model configuration for conversations
Allows independent model selection for different tasks

You can use different models for chat and documents. For example, use Nova Pro for chat (cost-effective) and Claude Sonnet for documents (superior vision capabilities).

Document Analysis Service

We’ll create a specialized service for analyzing images and documents with vision models.

Create DocumentChatService

Create the service that handles multi-modal document analysis:

mkdir -p src/main/java/com/example/ai/agent/service
cat <<'EOF' > src/main/java/com/example/ai/agent/service/DocumentChatService.java
package com.example.ai.agent.service;

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.ai.chat.client.ChatClient;
import org.springframework.ai.chat.model.ChatModel;
import org.springframework.ai.model.tool.ToolCallingChatOptions;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.core.io.ByteArrayResource;
import org.springframework.http.MediaType;
import org.springframework.http.MediaTypeFactory;
import org.springframework.stereotype.Service;
import org.springframework.util.MimeType;
import org.springframework.util.MimeTypeUtils;
import reactor.core.publisher.Flux;
import java.util.Base64;

@Service
public class DocumentChatService {
    private static final Logger logger = LoggerFactory.getLogger(DocumentChatService.class);

    private final ChatClient documentChatClient;
    private final ChatService chatService;

    @Value("${ai.agent.document.model}")
    private String documentModel;

    public static final String DOCUMENT_ANALYSIS_PROMPT = """
        Extract expense information from this document.

        Required fields:
        - Document Type: [RECEIPT, INVOICE, TICKET, BILL, OTHER]
        - Expense Type: [MEALS, ACCOMMODATION, TRANSPORTATION, OFFICE_SUPPLIES, OTHER]
        - Amount and Currency
        - Date: [YYYY-MM-DD]

        Category-specific details:
        - ACCOMMODATION: check-in/out dates, nights, rate per night, location
        - MEALS: contains alcohol (yes/no)
        - TRANSPORTATION: type, route or location

        Check against the Expense Policy and provide approval status with reasoning.
        If not an expense document, provide a brief summary.
        For missing information, state "I don't know".
        """;

    public DocumentChatService(ChatModel chatModel, ChatService chatService) {
        this.documentChatClient = ChatClient.builder(chatModel)
                .defaultSystem(DOCUMENT_ANALYSIS_PROMPT)
                .build();
        this.chatService = chatService;
    }

    public Flux<String> processDocument(String prompt, String fileBase64, String fileName) {
        logger.info("Processing document: {}", fileName);

        return Flux.create(sink -> {
            // 1. Emit immediate feedback
            sink.next("Analyzing document...\n\n");

            // 2. Analyze document with multimodal AI
            String documentAnalysis = analyzeDocument(prompt, fileBase64, fileName);

            // 3. Stream structured summary with currency conversion
            String summaryPrompt = documentAnalysis + "\n\n" +
                "Based on the extracted information, provide a structured summary including:\n" +
                "- Amount in EUR: If original currency is EUR, use original amount. " +
                "Otherwise, convert to EUR using the document date (or current date if unavailable).\n\n" +
                "After presenting the information, ask the user to confirm and offer to register the expense.";

            chatService.processChat(summaryPrompt)
                .subscribe(
                    chunk -> sink.next(chunk),
                    error -> sink.error(error),
                    () -> sink.complete()
                );
        });
    }

    private String analyzeDocument(String prompt, String fileBase64, String fileName) {
        MimeType mimeType = determineMimeType(fileName);
        byte[] fileData = Base64.getDecoder().decode(fileBase64);
        ByteArrayResource resource = new ByteArrayResource(fileData);

        String userPrompt = (prompt != null && !prompt.trim().isEmpty())
                ? prompt
                : "Analyze this document";

        try {
            var chatResponse = documentChatClient
                    .prompt()
                    .options(ToolCallingChatOptions.builder()
                            .model(documentModel)
                            .build())
                    .user(userSpec -> {
                        userSpec.text(userPrompt);
                        userSpec.media(mimeType, resource);
                    })
                    .call().chatResponse();

            return (chatResponse != null)
                ? chatResponse.getResult().getOutput().getText()
                : "I don't know - no response received.";
        } catch (Exception e) {
            logger.error("Error analyzing document", e);
            return "I don't know - there was an error analyzing the document.";
        }
    }

    private MimeType determineMimeType(String fileName) {
        if (fileName != null && !fileName.trim().isEmpty()) {
            MediaType mediaType = MediaTypeFactory.getMediaType(fileName)
                    .orElse(MediaType.APPLICATION_OCTET_STREAM);
            return new MimeType(mediaType.getType(), mediaType.getSubtype());
        }
        return MimeTypeUtils.APPLICATION_OCTET_STREAM;
    }
}
EOF

Key features:

Multi-modal prompts: Combines text and images in a single request
Expense extraction: Structured prompt for extracting receipt information
Policy compliance: Checks expenses against company policies (via RAG)
Currency conversion: Uses tools to convert amounts to EUR
Streaming response: Provides immediate feedback and streams results

Update ChatController

Update ChatController to route document requests:

src/main/java/com/example/ai/agent/controller/ChatController.java

import com.example.ai.agent.service.DocumentChatService;
    ...
    private final DocumentChatService documentChatService;

    public ChatController(ChatService chatService,
                         ChatMemoryService chatMemoryService,
                         ConversationSummaryService summaryService,
                         DocumentChatService documentChatService) {
        this.chatService = chatService;
        this.chatMemoryService = chatMemoryService;
        this.summaryService = summaryService;
        this.documentChatService = documentChatService;
    }

    @PostMapping(value = "message", produces = MediaType.APPLICATION_OCTET_STREAM_VALUE)
    public Flux<String> chat(@RequestBody ChatRequest request, Principal principal) {
        String userId = getUserId(request.userId(), principal);
        chatMemoryService.setCurrentUserId(userId);

        // Route to document analysis or regular chat
        return hasFile(request)
            ? documentChatService.processDocument(request.prompt(), request.fileBase64(), request.fileName())
            : chatService.processChat(request.prompt());
    }

    ...

    private boolean hasFile(ChatRequest request) {
        return request.fileBase64() != null && !request.fileBase64().trim().isEmpty();
    }

    public record ChatRequest(String prompt, String userId, String fileBase64, String fileName) {}
    ...

The controller now:

Accepts file uploads as base64-encoded strings
Routes requests with files to DocumentChatService
Routes text-only requests to ChatService
Maintains the same streaming response interface

Update WebViewController

Enable multi-modal features in the UI:

src/main/java/com/example/ai/agent/controller/WebViewController.java

...
    @Value("${ui.features.multi-user:true}")
    private boolean multiUserEnabled;

    @Value("${ui.features.multi-modal:true}")
    private boolean multiModalEnabled;

    @GetMapping("/")
    public String index(Model model) {
        model.addAttribute("multiUserEnabled", multiUserEnabled);
        model.addAttribute("multiModalEnabled", multiModalEnabled);
        return "chat";
    }
}

This enables the file upload button in the web interface, allowing users to attach images and documents to their messages.

If you completed Part 5, you can either start the travel MCP server from Part 5, or comment out the MCP client configuration in application.properties to test multi-modal features independently:
1
# spring.ai.mcp.client.sse.connections.travel.url=http://localhost:8082

Let’s test document analysis with a tram ticket image:

1	./mvnw spring-boot:test-run

Download the sample ticket image:

1 2	# Download sample tram ticket curl -o ticket-tram-cz.png https://raw.githubusercontent.com/aws-samples/java-on-aws/main/samples/spring-ai-te-agent/ai-agent/samples/ticket-tram-cz.png

Test in the UI at http://localhost:8080:

Click the file upload button (📎)
Select the ticket-tram-cz.png image
Type “Analyze this expense receipt” in the message box
Click Send

Upload ticket

The AI will analyze the tram ticket image and extract:

Analyzing document...

| Field | Details |
|-------|---------|
| **Document Type** | Public Transport Ticket |
| **Date** | 2025-06-12 |
| **Time** | 17:13 |
| **Original Amount** | 30.00 CZK |
| **Amount in EUR** | €1.22* |
| **Expense Category** | TRANSPORTATION |
| **Description** | Prague public transport 30-minute transfer ticket (zones P, O, B) |
| **Vendor** | Dopravní Podnik hl. m. Prahy, a.s. |
| **Location** | Prague, Czech Republic |

**Policy Compliance:**
✅ Approved: Transportation expenses are within policy limits

Would you like me to register this expense?

Analysis result

✅ Success! The AI agent can now analyze images and extract structured information.

You can also test in the UI at http://localhost:8080 - use the file upload feature to analyze receipts, invoices, or travel documents.

To test expense registration, you can download the backend MCP server from the sample repository and connect your AI Agent to it using the techniques learned in Part 5. The backend server provides expense management tools that work seamlessly with document analysis.

When you upload a receipt image, here’s what happens:

Image upload: Browser converts image to base64 and sends to server
Routing: ChatController detects file and routes to DocumentChatService
Vision analysis: Claude Sonnet analyzes the image and extracts expense data
Policy check: RAG retrieves relevant expense policies
Currency conversion: Tools convert amounts to EUR if needed (if backend server is connected)
Structured response: Formats results with approval status

The AI “sees” the receipt image and extracts text, amounts, dates, and merchant information automatically.

Multi-Model Benefits

Using different models for different tasks provides:

Cost optimization: Use cheaper models for simple chat, expensive models for complex vision tasks
Performance optimization: Vision models for documents, fast models for chat
Capability matching: Use models with specific strengths (Claude for vision, Nova for speed)
Flexibility: Switch models without changing application code

Cleanup

To stop the application, press Ctrl+C in the terminal where it’s running.

The PostgreSQL container will continue running (due to withReuse(true)). If necessary, stop and remove it:

1 2	docker stop ai-agent-postgres docker rm ai-agent-postgres

(Optional) To remove all data and start fresh:

1	docker volume prune

Commit Changes

1 2	git add . git commit -m "Add multi-modal document analysis and multi-model support"

Conclusion

In this post, we’ve added multi-modal and multi-model capabilities to our AI agent:

Multi-Modal Support: Analyze images, receipts, invoices, and documents
Document Analysis Service: Specialized service for vision-based extraction
Multi-Model Architecture: Different models for chat vs. document analysis
Expense Extraction: Structured data extraction from receipt images
Policy Compliance: Automatic checking against company policies

Our AI agent now has a complete, production-ready architecture: memory (Part 2), knowledge (Part 3), real-time information (Part 4), dynamic tool integration (Part 5), and multi-modal capabilities (Part 6). It can handle text conversations, analyze images, integrate with any service via MCP, and use the best AI model for each task—all essential capabilities for enterprise AI applications.

What’s Next

Explore production deployment patterns, monitoring and observability, security best practices, and scaling strategies for AI agents in enterprise environments.

Learn More

Let’s continue building intelligent Java applications with Spring AI!

Java

Amazon Bedrock

Spring AI

Multi-Modal

Vision

Overview of the Solution

What is Multi-Model Architecture?

Architecture Overview

Key Spring AI Components

Prerequisites

Multi-Model Configuration

Configure Models

Document Analysis Service

Create DocumentChatService

Update ChatController

Update WebViewController

Multi-Model Benefits

Cleanup

Commit Changes

Conclusion

What’s Next

Learn More

Comments

Catalogue

Recents

Overview of the Solution

The Multi-Modal Challenge

What is Multi-Modal AI?

What is Multi-Model Architecture?

Architecture Overview

Key Spring AI Components

Prerequisites

Multi-Model Configuration

Configure Models

Document Analysis Service

Create DocumentChatService

Update ChatController

Update WebViewController

Testing Multi-Modal Capabilities

How Multi-Modal Works

Multi-Model Benefits

Cleanup

Commit Changes

Conclusion

What’s Next

Learn More

Comments

Catalogue

Recents