Building Agent Discovery: Technical Patterns from Registry to Agent2Agent Communication
Abstract
The vision of million-agent networks is compelling, but how do you actually build the discovery infrastructure to make it real? This article bridges the gap between theory and implementation, exploring practical patterns emerging from registry experiments, the Model Context Protocol (MCP) revolution, and production deployments.
We’ll examine four concrete approaches: DNS-based discovery, registry APIs, well-known URLs, and dynamic tool discovery through MCP. You’ll see how MCP acts as the “USB-C port for AI applications,” enabling runtime capability enumeration without hardcoded integrations. We’ll also tackle critical production challenges: the multiple context problem that fragments agent memory, security patterns for enterprise deployment, and the architectural decisions that determine whether your agent network scales or stalls at 1,000 agents.
This isn’t about hypothetical futures - it’s about the patterns, protocols, and pitfalls you’ll encounter when building discovery systems today. Whether you’re deploying 10 agents or planning for 10,000, these implementation insights will shape your architecture decisions.

collaborative agents
Part 1: From Theory to Working Code
How do you actually build a discovery system?
In the first article, we explored the million-agent vision and identified three critical infrastructure gaps: Agent Registration Systems, Agent Naming Services, and Agent Gateways. Now comes the harder question: how do you actually implement these components?
The good news is that practical patterns are emerging from real-world experiments. The less good news is that most of these experiments hit the same scaling walls, security challenges, and coordination complexity that kills 90% of agent networks between 1,000 and 10,000 agents.
Let’s start where many projects begin: with a central registry.
Central Registry Architecture: The FastAPI Pattern
The simplest approach to agent discovery is a central registry - essentially a phone book for AI agents. Several open-source implementations have explored this pattern, and they reveal both what works and what doesn’t at scale.
The core architecture follows a familiar RESTful pattern:
Agent Registration: Each agent publishes its “digital business card” - a structured metadata document declaring its capabilities, endpoints, and requirements. Think of it as an API specification combined with a service advertisement. The registry validates this information, assigns identifiers, and makes the agent discoverable.
Discovery API: Other agents query the registry using filters: “Show me agents that can process financial PDFs” or “Find agents with Salesforce integration running in US-East.” The API returns matching agents with their connection details.
Health Monitoring: Agents send heartbeat signals to prove they’re alive and responsive. When heartbeats stop, the registry marks agents as unavailable and stops routing traffic to them. This graceful degradation prevents cascade failures when individual agents die.
The FastAPI implementation demonstrates these patterns beautifully using Python’s async/await for high-performance, non-blocking operations. Agents register via POST requests, update via PUT, and maintain liveness through periodic heartbeat POSTs. Discovery happens through GET requests with query parameters for filtering.
The Agent Card: Your Digital Business Card
At the heart of the registry pattern is the Agent Card - a standardized JSON document typically served at /.well-known/agent.json on the agent’s domain. This well-known URL pattern borrows from web standards like .well-known/security.txt and enables automatic agent discovery.
An Agent Card declares:
- Identity: Name, version, description, and operator contact
- Capabilities: What the agent can do, expressed as structured metadata
- Endpoints: How to communicate with the agent (API URLs, protocols)
- Requirements: Authentication methods, rate limits, and service dependencies
- Health: Current status and availability information
This standardization solves a critical problem: without it, every agent integration becomes a custom project. With Agent Cards, discovery becomes automatic.
The Registry Scaling Wall
Here’s where the pattern hits reality: most registry implementations use in-memory storage. This works beautifully for demos and proofs-of-concept. It fails catastrophically in production.
When your registry crashes, every agent loses discovery capability simultaneously. Your entire network flatlines. When you restart, all agents must re-register, creating a thundering herd problem. And since there’s no persistent storage, you’ve lost all historical data - no analytics, no audit trails, no debugging information.
The transition from in-memory to distributed storage isn’t just a database swap. You’re moving from single-node consistency to distributed consensus. Read queries that took microseconds now involve network round-trips. Write operations need coordination across multiple nodes. Failure modes multiply: network partitions, split-brain scenarios, and consistency-availability trade-offs.
This is where 90% of networks stall. The infrastructure that worked for 100 agents collapses under 1,000. You need distributed consensus systems like etcd or Consul, proper caching strategies, read replicas for query load, and careful thinking about consistency guarantees. Do you need strong consistency for every query? Or can discovery tolerate eventual consistency with millisecond-scale lag?
The b4rega2a project represents a more complete approach to these challenges, moving beyond simple FastAPI examples to production-grade architecture. But even sophisticated registries face a deeper question: how do agents actually discover each other?
Part 2: Discovery Mechanisms and the MCP Revolution
Four Patterns for Agent Discovery
Central registries aren’t the only game in town. In practice, agent discovery happens through four distinct patterns, each with different trade-offs:
1. DNS-Based Discovery
The simplest approach leverages existing internet infrastructure. Agents publish SRV and TXT records in DNS, advertising their services and capabilities. This provides baseline trust through domain ownership - if you control the domain, you control the agent registration.
The advantage: DNS is battle-tested, globally distributed, and built into every network stack. The limitation: DNS wasn’t designed for rich metadata or semantic search. You get location and connection information, but capability matching requires additional layers.
2. Registry API Discovery
This is the central registry pattern we explored earlier. Agents query a RESTful API, filtering by capabilities, tags, and metadata. It’s flexible, allowing complex queries like “Find agents with high-availability SLAs, US-East deployment, and Salesforce integration.”
The registry becomes the coordination point - and the bottleneck. Everything depends on registry availability, and query latency affects your entire network. But the flexibility and rich metadata support make this the most common pattern for enterprise deployments.
3. Well-Known URLs
Standardizing on /.well-known/agent.json enables automatic discovery without central coordination. If you know an agent’s domain, you can fetch its Agent Card directly. This decentralized approach scales beautifully - there’s no central bottleneck.
The challenge: you need some way to discover agent domains in the first place. This pattern works best combined with other discovery methods, or in environments where agent domains are already known (like within an organization).
4. Dynamic Tool Discovery Through MCP
This is where things get interesting. Rather than static registration, agents enumerate capabilities at runtime through protocol negotiation. An agent announces, “I have these tools available right now,” and other agents can invoke them immediately.
This pattern handles the reality that agent capabilities change constantly. New tools get deployed, services scale up or down, and maintenance windows affect availability. Static registration can’t keep up - you need dynamic, runtime discovery.
The Agent-to-Agent (A2A) Protocol
A2A is the primary protocol for direct agent communication. When Agent A needs Agent B to perform a task, they communicate via A2A - this is the standard, the default, the expected path.
A2A uses JSON-RPC 2.0 over HTTPS as its communication standard. This choice is deliberate: JSON-RPC provides a lightweight, language-agnostic protocol that’s simpler than REST while more structured than raw JSON. The 2.0 specification adds proper error handling, batch requests, and notification patterns (fire-and-forget messages).
Why JSON-RPC over HTTPS?
- Simplicity: Single endpoint, method-based routing, clear request/response structure
- Security: HTTPS provides transport encryption, and OAuth tokens can be passed in headers
- Bidirectionality: Both request-response and notification patterns supported
- Tool compatibility: JSON-RPC integrates cleanly with existing API ecosystems
An A2A interaction looks like this:
POST /rpc HTTP/1.1
Host: agent.example.com
Content-Type: application/json
Authorization: Bearer <token>
{
"jsonrpc": "2.0",
"method": "process_document",
"params": {
"document_url": "https://...",
"workflow_id": "wf-12345"
},
"id": 1
}The agent processes the request and returns:
{
"jsonrpc": "2.0",
"result": {
"status": "completed",
"output_url": "https://...",
"metadata": {...}
},
"id": 1
}Agent Cards as A2A Discovery Documents
The Agent Card we discussed earlier is part of the A2A specification. Served at /.well-known/agent.json, it advertises which JSON-RPC methods the agent supports, authentication requirements, and operational metadata.
This creates the standard discovery-to-invocation flow:
- Query registry for agents with required capabilities
- Fetch Agent Card from
/.well-known/agent.json - Verify the agent supports needed JSON-RPC methods
- Establish connection with proper authentication
- Communicate via A2A JSON-RPC
A2A is your primary communication protocol. It’s designed specifically for agent-to-agent interaction, it’s well-specified, and it handles the common cases beautifully.
The Model Context Protocol for Registry Discovery
While A2A handles agent-to-agent communication, we still need a way for agents to discover what’s available in the registry. This is where MCP comes in - not as a replacement for A2A, but as the “USB-C port” for talking to discovery infrastructure.
MCP’s architecture is elegant for this purpose: a client-server model using JSON-RPC 2.0 over various transports. Agents act as MCP clients, connecting to the registry (an MCP server) to discover available agents and capabilities.
The protocol defines four core primitives, but for registry discovery, we primarily care about:
Resources: The registry exposes agent metadata as resources. Think of resources as read-only data sources that agents can query. Each registered agent becomes a resource addressable by URI.
Tools: Discovery operations exposed as invocable functions. Tools like search_agents(), get_agent_details(), and filter_by_capability() allow agents to programmatically discover what’s available.
Prompts: Templates for common discovery patterns. For example, a prompt might encode “Find all agents in US-East with high availability SLAs and financial data processing capabilities.”
MCP’s Role: Registry Discovery, Not Agent Communication
Here’s the critical architectural point: MCP is for talking to the registry, A2A is for talking to agents.
The flow works like this:
Agent A → MCP → Registry → Returns Agent B's details
Agent A → A2A → Agent B (direct communication)An agent uses MCP to query the registry: “Show me agents that can process invoices.” The registry returns metadata about matching agents, including their A2A endpoints. The agent then switches to A2A to actually communicate with the discovered agents.
Why this separation matters:
Protocol efficiency: A2A is optimized for agent-to-agent interaction with specific patterns for task delegation, status updates, and result handling. MCP is optimized for resource discovery and tool enumeration.
Security boundaries: The registry has different authentication and authorization requirements than individual agents. MCP to the registry might use service-level credentials, while A2A between agents uses agent-specific authentication.
Scaling characteristics: Registry queries benefit from MCP’s subscription model (get notified when new agents register). Direct agent communication needs A2A’s request-response patterns.
Fallback flexibility: If an agent doesn’t support A2A or if you need more complex tool interactions, you can fall back to MCP for direct agent communication. But A2A should be your default.
MCP’s Killer Feature: Runtime Discovery
The revolutionary aspect of MCP for registry discovery is list_tools() and resource subscriptions. An agent connects to the registry’s MCP server and asks, “What discovery operations can I perform?” The registry responds with available tools for searching, filtering, and monitoring agents.
Even more powerful: agents can subscribe to registry resources. When new agents register or existing agents update their capabilities, subscribed agents receive notifications in real-time. This enables dynamic adaptation without polling.
This unlocks dynamic discovery networks:
AI Agent → MCP Client → Registry (MCP Server) → Agent Metadata
↓
A2A Connection → Discovered Agent BAn agent queries the registry via MCP for “agents that can process invoices.” The registry returns MCP resources describing matching agents, including their A2A endpoints. The agent connects to those agents using A2A for actual work delegation.
The AgentDiscoveryAgent Pattern
The AgentDiscoveryAgent isn’t a specific technology - it’s an architectural pattern for how agents discover and utilize other agents dynamically. The key insight: use the right protocol for each layer.
Consider a research agent that needs to analyze financial documents. Rather than being hardcoded to specific agents, it follows this pattern:
- Query the registry via MCP with semantic search: “Find agents for extracting tables from PDFs with compliance audit trails”
- Receive agent metadata as MCP resources - descriptions, capabilities, A2A endpoints, authentication requirements
- Select appropriate agents based on current availability, performance characteristics, and task requirements
- Connect via A2A using the endpoints from the registry metadata
- Delegate work through A2A JSON-RPC calls: “process_document”, “extract_tables”, etc.
- Monitor for registry changes through MCP subscriptions - if new capable agents register, adapt in real-time
This pattern enables true late binding of agent capabilities. The agent decides at runtime which agents to work with based on current availability and task requirements. No static dependencies. No brittle integrations.
When to fall back to MCP for agent communication:
While A2A is the default, there are scenarios where MCP makes sense for agent-to-agent interaction:
- Legacy agents that only support MCP
- Complex resource sharing where one agent needs streaming access to another agent’s data
- Tool-like interactions where you’re really invoking specific functions rather than delegating complete tasks
- Simplified development in early prototyping phases before implementing full A2A support
But these should be exceptions. For production agent networks, standardize on A2A for agent-to-agent communication and reserve MCP for registry discovery and specialized tool interactions.
Real-Time Registry Updates
Static registries go stale. By the time you query them, reality has changed. MCP solves this through event-driven notifications at the registry level.
When an agent registers, updates its capabilities, or goes offline, the registry (acting as an MCP server) notifies subscribed clients. Agents can subscribe to specific queries: “Notify me when new agents register with invoice processing capabilities.” When such an agent appears, subscribers receive real-time updates.
This matters for dynamic team reconfiguration. Imagine a coordinator agent managing a complex task. It subscribes to relevant agent capabilities in the registry via MCP. As specialized agents become available (perhaps another team just deployed a new service), the coordinator receives a notification and can immediately incorporate that capability. The team adapts without manual intervention.
The complete flow in action:
1. Agent A subscribes to MCP registry: "invoice processing agents"
2. Registry monitors for matching registrations
3. Agent B registers with invoice processing capability
4. Registry sends MCP notification to Agent A
5. Agent A fetches Agent B's details via MCP
6. Agent A establishes A2A connection to Agent B
7. Work delegation proceeds via A2A JSON-RPCThe subscription happens once via MCP. The registry push notifications come via MCP. But the actual work - the task delegation, status updates, result handling - all flows through A2A.
This architectural split gives you the best of both worlds: dynamic, event-driven discovery through MCP, and efficient, standardized agent communication through A2A.
Part 3: Production Patterns and Implementation Realities
Security Isn’t Optional
Everything we’ve discussed so far assumes a friendly environment. Production agents operate in hostile territory. You need security at every layer.
OAuth Integration handles authentication between agents and registries. The typical flow: agent requests access token from identity provider, presents token to registry, registry validates and authorizes specific operations. OAuth 2.0 remains standard, though OAuth 3.0 simplifies some flows. Critical consideration: token refresh strategies. Agents run for days or weeks - your tokens need automatic renewal without manual intervention.
Mutual TLS provides service-to-service authentication and encryption. Unlike web browsers where only the server presents a certificate, both agents prove their identities through certificates. This establishes cryptographic trust: both parties know whom they’re talking to, and all traffic is encrypted.
Certificate management becomes critical. You need infrastructure for issuing, rotating, and revoking certificates. Certificate expiration can’t bring down your agent network. Automate everything.
Fine-Grained Permissions determine what agents can actually do. Not every agent should access every tool. You need permissions at multiple levels: which registries an agent can query, which MCP servers it can connect to, which specific tools it can invoke, and which methods within those tools it can call.
In enterprise environments, integrate with existing identity providers (LDAP, Active Directory, Okta). Use SAML or OIDC flows. This provides single sign-on and centralized access control. Agents inherit organizational permissions rather than requiring separate management.
The Multiple Context Problem
Here’s a challenge that kills sophisticated agent architectures: agents lose their memory.
The problem manifests in four ways:
Context Discontinuity: Each interaction starts from zero. The agent from yesterday can’t remember what you discussed. This forces users to repeat information constantly.
Window Limitations: LLMs have finite context windows. Even within a single interaction, older information falls off the edge. Long conversations lose coherence.
Fragmentation: In multi-agent systems, context scatters across different agents. Agent A knows part of the story, Agent B knows another part, but nobody has the complete picture. Coordination breaks down.
Temporal Gaps: Agents can’t build on previous work. Today’s agent doesn’t benefit from yesterday’s learning. You can’t say, “Remember that analysis you did last week? Let’s extend it.”
MCP provides a framework for solving these problems through standardized context storage. Rather than context living only in the LLM’s memory, it persists in external systems that agents access through MCP resources.
Hierarchical Context Management
The solution isn’t dumping everything into one giant context. That’s computationally expensive and semantically noisy. Instead, implement hierarchical context layers:
System Context: Global information available to all agents - organizational policies, security requirements, standard operating procedures.
Organization Context: Department or team-level information - project backgrounds, stakeholder lists, relevant history.
Team Context: Information shared within a specific agent team working together - current task, delegation structure, intermediate results.
Agent Context: Individual agent state - what this specific agent is working on, its reasoning history, tools it has access to.
Session Context: The immediate conversation or task - recent exchanges, user intent, transient state.
Each layer has different persistence characteristics. System context rarely changes and can be cached aggressively. Session context is ephemeral and might only live in memory. The trick is routing queries to appropriate layers and merging results efficiently.
Workflow IDs: The Correlation Key
How do you tie context together across time and agents? Workflow IDs provide the answer.
When a user initiates a complex task, generate a unique workflow identifier. This ID flows through every interaction, every agent invocation, every tool call. Agents log using the workflow ID. Context storage keys on it. Monitoring systems track it.
This enables powerful capabilities:
- Resume workflows after interruption: “Continue from where we left off”
- Debug across agent boundaries: Follow the workflow ID through distributed logs
- Build on previous work: “Take that report from workflow-12345 and extend it”
- Correlate related activities: See how different agents contributed to the same goal
The implementation is straightforward: generate UUIDs, pass them as parameters, and log them everywhere. The discipline pays enormous dividends for debugging and coordination.
Enterprise Integration Patterns
Production agent networks don’t exist in isolation. They integrate with existing enterprise infrastructure.
Identity Provider Integration: Agents need to authenticate as organizational entities, not isolated services. Integrate with LDAP or Active Directory for user authentication. Use service principals for agent-to-agent communication. Implement SAML or OIDC flows for web-based agent interfaces.
Compliance and Governance: Regulation requires audit trails. Every agent action needs logging: who invoked what tool, with what data, when, and what was the result. Implement immutable audit logs. Support data residency requirements (some data can’t leave specific regions). Enable access reviews (periodic verification of who has access to what).
Multi-Tenancy: Enterprise agent platforms serve multiple teams or customers. Implement isolation so one tenant’s agents can’t access another tenant’s data or capabilities. Enforce resource quotas to prevent one tenant from consuming all infrastructure. Guarantee performance isolation - one tenant’s load doesn’t slow down others.
These aren’t afterthoughts. They’re architectural requirements that fundamentally shape your design.
Implementation Best Practices and Pitfalls
Start small, architect for scale: Deploy your first version with 10 agents. But design the architecture for 10,000. This means choosing data stores that scale horizontally, using caching strategies that work at high request rates, and implementing monitoring that surfaces bottlenecks before they crash your system.
Monitoring and observability: You can’t operate what you can’t see. Instrument everything:
- Registry query latency: p50, p95, p99 response times
- Discovery success rate: What percentage of queries find appropriate agents?
- Registration health: How many agents are online vs. total registered?
- Federation status: For cross-registry setups, are federated queries working?
Implement distributed tracing using OpenTelemetry and Jaeger. When a user query touches five different agents and ten tools, you need to see the entire flow. Where did it slow down? Which component failed?
Common pitfalls that kill agent networks:
Over-engineering initially: Don’t build for a million agents when you have ten. Start with simple, proven patterns. Add complexity only when you hit real scaling limits.
Ignoring latency budgets: Discovery adds overhead to every agent interaction. If registry queries take 500ms, and your agent needs to discover five services, that’s 2.5 seconds before real work begins. Design for millisecond-scale discovery latency.
Context hoarding: Don’t try to fit everything into context windows. Use hierarchical storage with smart retrieval. Most information can stay in external storage, loaded only when relevant.
Manual processes: Agent registration, approval workflows, certificate management - if it’s manual, it won’t scale. Automate from day one.
Code References and Implementation Resources
The practical implementation patterns discussed here draw from multiple sources:
The FastAPI registry implementation demonstrates basic patterns at https://dev.to/sreeni5018/building-an-ai-agent-registry-server-with-fastapi-enabling-seamless-agent-discovery-via-a2a-15dj - excellent for understanding core concepts, though remember the in-memory storage limitations.
The b4rega2a project at https://codeberg.org/b4mad/b4rega2a provides a more complete implementation with production considerations. Study its architecture decisions around persistence, health checking, and API design.
The Model Context Protocol documentation at https://modelcontextprotocol.io/ is essential reading. The specifications for resources, tools, prompts, and sampling will guide your MCP integration.
For context management patterns, Christian Posta’s analysis at https://blog.christianposta.com/understanding-sessions-in-agent-to-agent-communication/ provides deep insights into the multiple context problem and workflow ID implementations.
References and Further Reading
This article draws from multiple sources across the emerging agent infrastructure ecosystem. Here are the key references used, organized by topic:
Agent Registry and Discovery Infrastructure
Building an AI Agent Registry Server with FastAPI
https://dev.to/sreeni5018/building-an-ai-agent-registry-server-with-fastapi-enabling-seamless-agent-discovery-via-a2a-15dj
Practical implementation guide demonstrating core registry patterns with FastAPI, including registration, discovery APIs, and heartbeat mechanisms. Excellent starting point for understanding basic registry architecture.
b4rega2a - B4mad Registry for Agent-to-Agent Communication
https://codeberg.org/b4mad/b4rega2a
Production-grade registry implementation with v1.3.0 deployed at https://agents-prod.b4mad.industries/. Demonstrates architecture decisions for persistence, health checking, and scalable API design beyond simple in-memory implementations.
Agent Discovery, Naming and Resolution - The Missing Pieces to A2A
https://www.solo.io/blog/agent-discovery-naming-and-resolution---the-missing-pieces-to-a2a
Comprehensive analysis of infrastructure gaps in agent networks, explaining why discovery, naming services, and gateways are critical for scaling beyond simple deployments. Draws parallels to service mesh evolution.
Protocol Specifications
Model Context Protocol (MCP) Documentation
https://modelcontextprotocol.io/
Official MCP specification covering the four core primitives (Resources, Tools, Prompts, Sampling), JSON-RPC communication patterns, and client-server architecture. Essential reading for implementing MCP integration.
Agent-to-Agent Protocol References
The A2A protocol documentation referenced throughout, covering JSON-RPC 2.0 over HTTPS patterns, Agent Card specifications at /.well-known/agent.json, and authentication flows. Key source for understanding direct agent communication standards.
Context Management and Sessions
Understanding Sessions in Agent-to-Agent Communication
https://blog.christianposta.com/understanding-sessions-in-agent-to-agent-communication/
Deep analysis of the multiple context problem in agent networks - context discontinuity, window limitations, fragmentation across agents, and temporal gaps. Provides architectural patterns for workflow IDs and context correlation.
Reliability Engineering
Engineering Reliable AI Agents
https://slavakurilyak.com/posts/reliable-agents/
Framework for building production-ready agents through three pillars: verifiable reasoning, tool use and schema adherence, and constraint following. Emphasizes that reliable agents are engineered, not found.
Team Coordination Frameworks
Agno Documentation - Teams
https://docs.agno.com/concepts/teams/introduction
Introduction to dynamic team coordination modes (Route, Collaborate, Coordinate) and runtime reconfiguration patterns. Shows how pure Python teams achieve 10,000x faster instantiation than alternatives.
Agno Documentation - Workflows
https://docs.agno.com/concepts/workflows/overview
Workflow patterns for integrating teams into larger systems, session persistence, and context sharing across team members. Demonstrates how to build production agent systems with dynamic adaptation.
Related Infrastructure Projects
Additional context was drawn from the broader ecosystem of agent infrastructure projects, service mesh patterns (Envoy, Istio), distributed consensus systems (etcd, Consul), and observability standards (OpenTelemetry, Jaeger).
Conclusion: Building on Solid Foundations

enterprise collaborative agents
The technical foundations for agent discovery are emerging, but they’re still being actively shaped. We’ve moved from pure theory to working code, from hypothetical architectures to production deployments with real scaling challenges.
The key insights:
- Multiple discovery patterns coexist: DNS, registries, well-known URLs, and MCP each solve different problems
- MCP changes the game: Runtime discovery without hardcoded integrations is revolutionary
- Context is critical: Solve the multiple context problem or watch coordination collapse
- Security isn’t optional: Authentication, authorization, and audit trails must be built in from day one
- Plan for scale: Architecture decisions at 10 agents determine whether you can reach 10,000
In the next article, we’ll explore how these discovery mechanisms enable dynamic team coordination - how agents form teams, reconfigure on the fly, and achieve reliability in production environments. Because discovering agents is just the beginning. Making them work together reliably is where the real engineering challenge lies.
The infrastructure layer is becoming clearer. Now we need to build reliable teams on top of it.