Technical Writeup · Roster MCP Server · Companion to the PRD and Tool Specs

MCP Server & OAuth — Implementation Writeup

The infrastructure half of the Roster MCP project: the remote MCP server's behavior (transport, response envelope, pagination, errors), the OAuth 2.1 authorization service, the credential bridge into Roster's existing token model, hosting/deployment on Cloudflare Workers, and observability. The tool catalog itself lives in tool-specs.html.

Author Jeff Poulton Created 2026-06-10 Status Draft PRD prd.html Tool specs tool-specs.html

Architecture & Topology

One Cloudflare Worker at a single hostname serves both the OAuth authorization server and the MCP server. It is a normal client of the existing v2 API (api.getroster.com) — no database access, no shard awareness — so it can be disabled at any time without touching the portal.

mcp.getroster.com  (single Cloudflare Worker, custom domain on Roster's existing CF account)
├── GET  /.well-known/oauth-authorization-server   RFC 8414 metadata (advertises S256)
├── GET  /.well-known/oauth-protected-resource     RFC 9728 metadata
├── GET  /authorize                                consent UI (login + brand picker)
├── POST /token                                    form-encoded; code+PKCE, refresh rotation
├── POST /register                                 Dynamic Client Registration (RFC 7591)
└── POST /mcp                                      MCP Streamable HTTP (2025-11-25 spec)

One hostname keeps the resource-server and AS metadata trivially consistent — same-host is the pattern Linear and Sentry shipped. DNS: one CNAME/Worker custom domain; TLS automatic via Cloudflare.

Verified against current vendor docs 2026-06-10

Sources: workers-oauth-provider README (latest release v0.7.2, 2026-06-04 — actively maintained), Cloudflare MCP authorization guide, and the McpAgent API docs. All library facts in this document reflect those docs as of 2026-06-10.

MCP Server Behavior

2.1 Transport & protocol requirements (from the PRD)

Streamable HTTP at POST /mcp per the 2025-11-25 MCP spec revision; no SSE-only legacy transport. Publicly reachable over HTTPS, no IP allowlist.
Unauthenticated requests → 401 + WWW-Authenticate pointing at the RFC 9728 protected-resource metadata.
Token audience validation (RFC 8707) — reject tokens issued for any other resource.
Every tool declares title, an LLM-audience description, readOnlyHint: true, and openWorldHint: false. No tool may perform a write in Phase 1.

2.2 Response envelope (every tool)

{
  "brand": { "name": "Acme Outdoor", "domain": "acme" },   // from the grant — always present
  "portal_source": {                                        // report/dashboard tools only
    "surface": "Sales Attribution report",
    "url": "https://app.getroster.com/reports/sales-attribution",
    "date_range": { "start": "2026-05-10", "end": "2026-06-09" }
  },
  "data": { /* tool-specific payload — see tool-specs.html */ },
  "pagination": { "cursor": "opaque-base64", "has_more": true, "total_records": 1234 },
  "truncated": false      // true + guidance text when result was capped
}

Cursor mapping: the MCP layer encodes {pageIndex, pageSize, filters-hash} into an opaque cursor. Upstream pagination is index-based (1-based pageIndex/pageSize); never expose raw page indexes to the model.
Page caps: MCP default page_size 50, max 200 — well under the upstream max of 10,000 and Claude's ~150K-char tool-result limit. When capped, set truncated: true with guidance text ("narrow the date range or filters").

2.3 Dates & timezone (PRD Decision 5)

There is no brand-timezone field anywhere in Roster (verified — Global and shard schemas). The platform pattern is UTC storage with client-side resolution, and the MCP follows it: Claude (which knows the user's timezone) resolves relative ranges to explicit ISO dates before calling; every tool description instructs it to do so. Omitted date params default to UTC last-30-days, stated in portal_source.date_range. Where the upstream accepts DATETIMEOFFSET (program metrics), the offset passes through. Nothing is stored at consent; no server-side timezone state.

2.4 Error mapping

Upstream	MCP tool error (structured, `isError: true`)
401 / token invalid	"Connection expired or revoked — reconnect the Roster connector." (also triggers OAuth re-auth via 401 on the MCP HTTP layer when the MCP token is bad)
403	"This brand's plan does not include this feature."
429	"Roster rate limit reached — wait a moment and try again." (retryable; the MCP service must not auto-retry-storm)
5xx / timeout	"Roster API error — try again; if it persists, narrow the date range."
Validation (`success:false` w/ message)	Pass through the upstream `message`, prefixed with the failing parameter where known.
Date range > 366 days on report tools	Rejected MCP-side before calling upstream: "Date range exceeds 366 days — split the request."

Never surface raw upstream stack traces. Upstream report SPs run with 300s command timeouts against Claude's 300s tool ceiling — surface timeouts as narrow-the-range guidance.

Roster Token Model & Credential Bridge

A Roster login is user-level; API tokens are brand-level. The OAuth grant bridges the two: one brand per grant, chosen at consent, with a long-lived private ApiSession token held server-side and never exposed to the client. Everything below is verified against source (api-brand-portal, db-global) and the production Global DB (read-only, 2026-06-09).

3.1 Verified token facts

Private tokens are 30-year JWTs. CLIENT_API_TOKEN_EXPIRE_YEARS = 30 (UserAccessTokenService.cs:32); confirmed live — the newest AccessTypeId=3 rows in Global ApiSession are 288-char JWTs with ExpireDate exactly 30 years out. The bridge mints standard tokens, no special handling.
Rate limiting is already per-token. RateLimitService.VerifyRateLimiting keys its in-memory counter by access token (RateLimitService.cs:27-82); the limit value (default 100/interval, subscription item 149) is per brand. A bridged MCP token gets its own pool automatically and cannot starve a customer's existing integration token. Caveat: the counter is in-process (MemoryCache), so effective limits scale with app instances — pre-existing behavior.
Auth chain per request: bearer token → Global ApiSession lookup (indexed on AccessToken) → JWT claim validation → brand scoping via AccessToUserId (AuthorizationFilter.cs:58-133).

3.2 Minting the bridged token

Replicate UserAccessTokenService.CreateAccessSessionToken (UserAccessTokenService.cs:70-200): get/create ApiClient for the brand user → ensure subscription rate-limit item ≥100 → get/create shard AccessRight (AccessId=3) → ApiSessionService.CreateToken (JWT; claims: accessToUserId, sessionUserId, rightId, accessId) → insert Global ApiSession with 30-year expiry. Build one new internal-only bridge endpoint in api-brand-portal that runs this flow (and its reverse, ExpireAccessToken) so the OAuth service never reimplements token logic. Authenticate Worker→Roster with a shared service secret + WAF rules.

3.3 Flagging MCP-issued tokens

(a — recommended) new nullable SourceTypeId column on Global ApiSession (+ lookup value API_SESSION_SOURCE_MCP): zero impact on AuthorizationFilter/rate-limit code paths; trivially queryable for the portal's Connected-Apps list and usage metrics.
(b) new AccessTypeId (e.g. ACCESS_TO_MCP_API): cleaner separation, but every Open API endpoint validates against an accessIds list, so it must be added everywhere ACCESS_TO_CLIENT_API is accepted — more invasive, easy to miss a path.

This flag (plus the grant store) backs the portal's new "Connected apps" section: app name, brand, who authorized, created date, last-used date, revoke action.

OAuth Service — Build Plan

4.1 Libraries (verified against docs 2026-06-10)

Concern	Library	Notes
OAuth 2.1 AS (DCR, PKCE, token issuance, refresh rotation, grant storage)	`@cloudflare/workers-oauth-provider` v0.7.2 (2026-06-04)	The library Linear/Sentry/Intercom/Stripe launched on. Wraps the whole AS surface: `new OAuthProvider({ apiRoute: "/mcp", apiHandler, defaultHandler, authorizeEndpoint, tokenEndpoint, clientRegistrationEndpoint })` — this exact pattern is Cloudflare's documented MCP-authorization recipe. `accessTokenTTL` default 3600s (matches the PRD's ≤1h); `refreshTokenTTL` default 30d. Refresh rotation: each use issues a new token and invalidates the older of at most two concurrently-valid refresh tokens (deliberate retry-tolerance design — satisfies Claude's rotation requirement; not strict single-use). Grants/clients/codes in Workers KV (binding `OAUTH_KV`); per-grant `props` (where the bridged Roster token lives) are "end-to-end encrypted… with the secret token as key material — impossible to derive from storage unless a valid token is provided" (README). RFC 8414 + 9728 metadata, RFC 7591 DCR, PKCE (disable `allowPlainPKCE` for S256-only), and CIMD already shipped behind `clientIdMetadataDocumentEnabled` — the PRD's "fast-follow" is a config flag.
MCP server + transport	Cloudflare `agents` `McpAgent` (wraps `@modelcontextprotocol/sdk`, TypeScript)	`McpAgent.serve("/mcp")` "handles Streamable HTTP transport automatically" (docs) — keepalive past the ~5-min edge idle-stream watchdog, `Last-Event-ID` stream recovery. Each client session is a Durable Object with hibernation enabled by default (sleeps when idle — near-zero idle cost). Plugs in as the `apiHandler` of `workers-oauth-provider`; the validated grant's `props` (brand id, bridged token) arrive typed via the agent's third generic param and are read as `this.props` inside every tool handler.
Consent UI	Portal-hosted (PRD D-10): new `/connect/claude` route in `web-app-brand-portal`	The Worker never renders login or handles credentials. The portal route reuses the existing login (password + social SSO + lockout), shows consent + brand picker, and hands back a one-time connect ticket. This is the same external-IdP shape the Cloudflare docs show for Stytch/Auth0 — with the Brand Portal as the IdP. Supersedes the Worker-rendered consent page in earlier drafts; decided 2026-06-11 after verifying SSO-required brands reject password auth (`UserService.AuthenticateUser` → `isSSOError`).

4.2 Consent flow — concrete sequence

Claude hits POST /mcp unauthenticated → 401 + WWW-Authenticate → discovers metadata → POST /register (DCR) → browser to GET /authorize?client_id…&code_challenge…&redirect_uri=https://claude.ai/api/mcp/auth_callback. Redirect allowlist: claude.ai callback + loopback http://localhost:* (Claude Code) + whatever DCR registers (https or loopback only).
(D-10, decided 2026-06-11) The Worker stores the pending request and redirects the browser to app.getroster.com/connect/claude?request_id=…, a new route in web-app-brand-portal. The route uses the existing portal login — password, social SSO (USER_SETTING_SSO_REQUIREMENT brands work), and the 10-failures/30-min lockout, all already built; users with a live portal session skip login entirely. Brand enumeration comes from the SPA's own session data (sessionUser.systemUserAccess).
The portal renders consent + brand picker (skipped for single-brand logins). On Approve, a new App.WebApi endpoint — running under the consenting user's standard auth, so brand rights are validated at issue time — mints a one-time connect ticket (single-use, ~60s TTL, bound to {sessionUserId, brandUserId, request_id}) and redirects back to the Worker's callback with it.
The Worker exchanges the ticket at the bridge endpoint (§3.2, service-secret auth): the bridge validates + consumes the ticket and mints the long-lived brand ApiSession token with the consenting user's authority. The Worker re-checks the brand allowlist fail-closed (Phase-1 rollout gate, KV config → friendly "not enabled for your account yet"), then stores the token in the grant's encrypted props along with {brand_user_id, brand_name, brand_domain, authorized_by, granted_at} (this snapshot also feeds the get_connection_info tool), and the api_session_id in the grant's unencrypted metadata — props can't be decrypted without a client token, and the idle sweep (§5.1) needs the id to expire the session.
Library issues the auth code → redirect → Claude exchanges at POST /token (PKCE verified) → access token (1h) + rotating refresh token. Every subsequent POST /mcp call: library validates the token, decrypts props, hands the tool layer the bridged Roster credential. The Roster token never leaves the Worker.
Deny → standard OAuth access_denied redirect; no partial grants. All auth endpoints must respond <10s (Claude hard limit) — Workers cold start is ~ms; the long pole is the upstream portal-login call.

4.3 Storage & secrets

Item	Where
OAuth clients (DCR), grants, auth codes, refresh-token families	Workers KV, binding `OAUTH_KV` (managed by `workers-oauth-provider`)
Bridged Roster `ApiSession` token + brand context	Encrypted `props` inside the grant record (library-encrypted)
Brand allowlist (Phase-1 gate)	KV key, editable via wrangler/admin script
Secrets: bridge-endpoint service secret, `ARCHBEE_API_KEY`, cookie-signing key	Worker secrets (`wrangler secret put`)
Tool-call + auth event logs	Workers Analytics Engine (per-tool metrics for the success KPIs) + Logpush → existing log sink

No relational DB required on the MCP side. The only Roster-side data-model change is the ApiSession source flag (§3.3).

Revocation, Environments & Deploys

5.1 Revocation — both directions

Portal side (Connected Apps page): new App.WebApi endpoint lists/revokes MCP-flagged ApiSession rows (SourceTypeId filter) and calls a Worker admin route (service-secret auth) to delete the grant + refresh family. Next Claude call → 401 → re-auth prompt.
Claude side (user disconnects): claude.ai sends no signal at all — no RFC 7009 revocation, no grant deletion, no request (verified live against the deployed staging skeleton, 2026-06-12; PRD D-11). Cleanup is an idle-grant sweep: the Worker records last-used per grant in KV on every /mcp call (also feeds the Connected-apps "last used" column), and a daily Cron Trigger deletes grants with no token activity for 35 days (past the 30-day refresh TTL, so provably dead) and expires their bridged ApiSession rows via the bridge endpoint (api_session_id from grant metadata). An RFC 7009 revocation endpoint stays live in case Claude ever starts revoking. Residual risk accepted: a disconnected grant's unused, server-held token survives ≤35 days; portal revoke is the immediate cutoff.
Kill switch (PRD rollback): disable /authorize (env flag) + bulk-expire MCP-flagged ApiSession rows; the Worker is fully decoupled from portal serving.

5.2 Environments, CI, limits

Envs: staging.mcp.getroster.com (wrangler env staging, pointed at TEST API/DB) and prod. Staging is what the integration/E2E suites and MCP Inspector run against.
CI/CD: GitHub Actions → wrangler deploy per env; secrets from repo environments. Rollback = redeploy previous version (Workers keeps versions).
Limits check: KV (grants) and Durable Objects (MCP sessions) are well within tier limits at 20-brand scale; no concern before thousands of connections.

Alternative: .NET House Stack

ASP.NET Core service on Azure App Service + OpenIddict (or Duende IdentityServer, licensed) for the AS + the ModelContextProtocol C# SDK for the MCP server; grant store in Azure SQL/Table Storage; same bridge/consent design. Pros: one runtime, existing Azure pipelines, in-house C# depth. Cons: assembling DCR + refresh rotation + resource metadata from primitives (~weeks of auth-surface work the CF library gives for free), and DIY on the Streamable-HTTP session plumbing. Recommendation: Cloudflare for Phase 1; this design ports to .NET without changing any external contract if eng prefers later.

Observability

Tool-call log row: {brand_user_id, tool_name, params_shape (keys only — no PII values), latency_ms, upstream_status, result_rows, truncated} — emitted from the MCP service; no Roster-side work.
Auth events: grants, denials, refreshes, revocations, refresh-token reuse detections.
Alerting: OAuth endpoint p95 > 5s; tool error rate > 5%; upstream 429 spikes.
Success-metric mapping: adoption = distinct brand_user_id with ≥1 grant; engagement = distinct brands with ≥1 tool call in trailing 7 days.