Developer guide
How to build a retry-safe Shopify event pipeline with Rails
A practical architecture guide for Rails teams building retry-safe Shopify event pipelines with webhooks, queues, reconciliation jobs, and durable side-effect tracking.
What retry-safe actually means
Retry-safe does not mean “my job queue retries exceptions.” That is table stakes. A retry-safe Shopify pipeline means the whole path from receipt to side effect stays correct when Shopify sends duplicates, retries late, omits a delivery entirely, or your own workers crash halfway through a job.
In other words, retry safety is a system property. If your controller is idempotent but your ticket creation is not, you do not have a retry-safe pipeline. You have one reliable layer and one unreliable one.
Shopify’s current webhook behavior makes this non-optional. Shopify recommends duplicate
detection with X-Shopify-Event-Id, expects HTTPS handlers to respond in under five
seconds, retries failed deliveries up to eight times over four hours, warns that delayed
delivery can happen, and explicitly recommends reconciliation because webhook delivery is not
guaranteed.
“Webhook delivery isn't always guaranteed.”
The working rule
Treat Shopify webhooks as at-least-once event triggers that can be delayed, duplicated, or missed. Then design every downstream step so a replay is boring.
The four layers of a reliable pipeline
Reliable Shopify event processing in Rails usually needs four layers: receipt, normalization, business processing, and reconciliation. Collapse them into one controller action or one giant queue job and the system will work right up until the first noisy retry storm, partial outage, or accidental double-send. Which, in fairness, is how many weekend migrations become spiritual experiences.
| Layer | What it does | What it must persist | Main anti-pattern |
|---|---|---|---|
| Receipt | Verify origin, capture headers and payload, store the event, return quickly | Deduplication key, raw body, headers, receipt status | Doing business logic in the controller |
| Normalization | Map Shopify topics and payload shapes into app-level event intents | Canonical event row, business identifiers, timestamps, replay status | Letting every topic branch into bespoke worker spaghetti |
| Business processing | Update local state, fetch Shopify state when needed, schedule side effects | Business state transitions, locks, failure status, handoff markers | Assuming queue uniqueness equals correctness |
| Reconciliation | Pull recent changes from Shopify and feed them back through the same handlers | Cursors, lookback windows, drift counts, repair runs | Building a second code path that drifts from the webhook path |
The reason to separate these layers is operational clarity. When an order sync is wrong, you want to know whether the event never arrived, arrived but failed HMAC, normalized badly, processed twice, or was simply missed and later repaired. Systems that keep that answer only in logs are usually one incident away from archaeology.
Acknowledge fast, process later
Shopify’s HTTPS delivery flow gives you less than five seconds to establish the connection and return a response. That budget is not there so you can squeeze in HMAC verification, a few remote API calls, a Postgres migration, and a brief moment of introspection. It is there so you can verify, persist, enqueue, and acknowledge.
This is why good ingress code is aggressively boring. It should not decide business meaning. It should make a durable handoff. If you need more than a few hundred milliseconds to handle a webhook, you are almost certainly doing work in the wrong layer.
“Shopify expects to establish the connection and receive your response in less than five seconds.”
In Rails, the ingress contract is usually: read the raw body once, verify the HMAC using your
app secret, extract the routing headers, write a receipt row under a unique index, enqueue the
normalization job, then return 200. If the insert hits a duplicate constraint,
that is still a successful receipt and should still return 200.
# config/routes.rb
post "/webhooks/shopify", to: "webhooks/shopify#create"# app/controllers/webhooks/shopify_controller.rb
class Webhooks::ShopifyController < ActionController::API
def create
raw_body = request.raw_post
verify_shopify_hmac!(raw_body)
receipt = ShopifyWebhookReceipt.record!(
shop_domain: request.headers["HTTP_X_SHOPIFY_SHOP_DOMAIN"],
topic: request.headers["HTTP_X_SHOPIFY_TOPIC"],
event_id: request.headers["HTTP_X_SHOPIFY_EVENT_ID"],
subscription_name: request.headers["HTTP_X_SHOPIFY_NAME"],
triggered_at: request.headers["HTTP_X_SHOPIFY_TRIGGERED_AT"],
api_version: request.headers["HTTP_X_SHOPIFY_API_VERSION"],
headers: webhook_headers,
payload: JSON.parse(raw_body)
)
NormalizeShopifyReceiptJob.perform_later(receipt.id)
head :ok
rescue ShopifyWebhookReceipt::DuplicateReceipt
head :ok
rescue InvalidShopifyWebhook
head :unauthorized
end
private
def webhook_headers
request.headers.env.select { |key, _| key.start_with?("HTTP_X_SHOPIFY_") }
end
def verify_shopify_hmac!(raw_body)
digest = OpenSSL::HMAC.digest(
"sha256",
ENV.fetch("SHOPIFY_API_SECRET"),
raw_body
)
expected = Base64.strict_encode64(digest)
actual = request.headers["HTTP_X_SHOPIFY_HMAC_SHA256"].to_s
unless ActiveSupport::SecurityUtils.secure_compare(expected, actual)
raise InvalidShopifyWebhook, "HMAC verification failed"
end
end
endThe key detail is durability before acknowledgement. If you return 200 before the
event is durably recorded, you told Shopify everything is fine while your pipeline still lives
in RAM and optimism.
Also note the header selection. Shopify uses X-Shopify-Event-Id for duplicate
detection, X-Shopify-Triggered-At to help you reason about stale retries, and
X-Shopify-Name to echo the subscription name you configured. That last one matters
when you intentionally run multiple subscriptions that observe the same topic for different
purposes.
Define a canonical event envelope
The raw webhook is not your domain event. It is a Shopify delivery envelope. Good pipelines convert that envelope into a canonical app-level event record with a stable shape. This is the point where you stop letting every topic improvise its own lifecycle.
A canonical envelope usually contains the shop, topic, event identifier, subscription name, source timestamps, business identifiers extracted from the payload, and the normalization state. You can store the original payload forever or on a retention window, but the normalized record should contain the fields your app actually reasons about.
# db/migrate/20260312120000_create_shopify_webhook_receipts.rb
class CreateShopifyWebhookReceipts < ActiveRecord::Migration[8.0]
def change
create_table :shopify_webhook_receipts, id: :uuid do |t|
t.string :shop_domain, null: false
t.string :topic, null: false
t.string :subscription_name
t.string :event_id, null: false
t.string :api_version
t.datetime :triggered_at
t.string :status, null: false, default: "received"
t.jsonb :headers, null: false, default: {}
t.jsonb :payload, null: false, default: {}
t.text :last_error_message
t.datetime :normalized_at
t.datetime :processed_at
t.timestamps
end
add_index :shopify_webhook_receipts,
[:shop_domain, :topic, :subscription_name, :event_id],
unique: true,
name: "idx_shopify_receipts_unique_delivery"
end
end# db/migrate/20260312120500_create_domain_events.rb
class CreateDomainEvents < ActiveRecord::Migration[8.0]
def change
create_table :domain_events, id: :uuid do |t|
t.string :kind, null: false
t.string :aggregate_type, null: false
t.string :aggregate_key, null: false
t.uuid :source_receipt_id, null: false
t.string :dedupe_key, null: false
t.jsonb :data, null: false, default: {}
t.string :status, null: false, default: "pending"
t.datetime :processed_at
t.text :last_error_message
t.timestamps
end
add_index :domain_events, :source_receipt_id
add_index :domain_events, :dedupe_key, unique: true
end
endWhy two tables? Because “we received this webhook” and “we normalized this into business work” are not the same fact. Separate them and replays become straightforward. Keep them fused and every replay becomes interpretive dance.
The unique index for receipt dedupe should reflect your actual subscription design. Shopify
notes that if you have more than one subscription for the same topic, you can receive several
webhooks for the same event, one per subscription. In that setup, dedupe on
event_id alone would be too aggressive, so include your subscription dimension.
Normalize into business work, not controller chaos
Normalization is where you translate a Shopify topic into the work your app actually cares about. That usually means one of two patterns.
Snapshot sync pattern: the webhook means “this resource changed, sync current state now.”
Action trigger pattern: the webhook means “attempt a specific business action, but only if guards still pass.”
Most apps should prefer the first pattern for important Shopify resources. Why? Because retried webhooks are delivered with the original payload from when they were triggered, and Shopify also warns that webhook delays can happen. If you can afford one extra API read, treating the webhook as a trigger to fetch current state is often safer than trusting the payload blindly.
# app/jobs/normalize_shopify_receipt_job.rb
class NormalizeShopifyReceiptJob < ApplicationJob
queue_as :webhooks
retry_on StandardError, wait: :polynomially_longer, attempts: 10
def perform(receipt_id)
receipt = ShopifyWebhookReceipt.find(receipt_id)
return if receipt.normalized?
case receipt.topic
when "orders/create", "orders/updated", "orders/paid", "orders/fulfilled"
Orders::NormalizeWebhookReceipt.call(receipt)
when "products/create", "products/update", "products/delete"
Products::NormalizeWebhookReceipt.call(receipt)
else
GenericShopifyNormalizer.call(receipt)
end
end
end# app/services/orders/normalize_webhook_receipt.rb
module Orders
class NormalizeWebhookReceipt
def self.call(receipt)
order_gid = receipt.payload.fetch("admin_graphql_api_id")
DomainEvent.create!(
kind: "shopify.order.touched",
aggregate_type: "ShopifyOrder",
aggregate_key: order_gid,
source_receipt_id: receipt.id,
dedupe_key: ["shopify.order.touched", receipt.shop_domain, order_gid, receipt.event_id].join(":"),
data: {
topic: receipt.topic,
shop_domain: receipt.shop_domain,
triggered_at: receipt.triggered_at,
payload_updated_at: receipt.payload["updated_at"]
}
)
receipt.update!(status: "normalized", normalized_at: Time.current)
SyncShopifyOrderJob.perform_later(order_gid, receipt.shop_domain)
end
end
endThe processing job that follows should be organized around the business object, not around the
webhook topic. That is the big mental shift. Your app usually cares that an order now needs to
be synced, not that it was specifically orders/paid versus
orders/updated in this one delivery. Topic still matters as context, but business
code should be attached to business objects.
Also, be careful with transaction boundaries. Modern Rails can defer Active Job enqueueing until after the surrounding transaction commits. That is excellent, because “write row, enqueue job, then roll back the row” is one of those bugs that only appears when you are already having a bad day.
Track side effects as first-class state
The hardest part of retry safety is not parsing JSON. It is controlling side effects. If a pipeline can send emails, create helpdesk tickets, push data to an ERP, tag a customer, or notify Slack, each of those actions needs a durable record of whether it has already happened for the relevant business key.
This is where many otherwise decent webhook systems become chaos machines. The controller is idempotent. The normalizer is idempotent. Then the job calls an external API twice and the merchant gets two tickets, two emails, or two “urgent” Slack alerts about the same order. Now everybody is looking at logs like they are tea leaves.
The fix is to store side effects in their own ledger table with a semantic dedupe key. Not a queue-level uniqueness hint. A real row that says whether the effect is pending, applied, failed, or cancelled.
# db/migrate/20260312121000_create_side_effect_executions.rb
class CreateSideEffectExecutions < ActiveRecord::Migration[8.0]
def change
create_table :side_effect_executions, id: :uuid do |t|
t.string :effect_type, null: false
t.string :dedupe_key, null: false
t.string :state, null: false, default: "pending"
t.string :provider_reference
t.integer :attempt_count, null: false, default: 0
t.datetime :applied_at
t.datetime :next_attempt_at
t.jsonb :request_payload, null: false, default: {}
t.jsonb :response_payload, null: false, default: {}
t.text :last_error_message
t.timestamps
end
add_index :side_effect_executions, :dedupe_key, unique: true
end
end# app/services/side_effects/create_helpdesk_ticket.rb
module SideEffects
class CreateHelpdeskTicket
def self.call(order:)
dedupe_key = [
"helpdesk-ticket",
order.shop_id,
order.shopify_order_id,
order.support_ticket_reason
].join(":")
effect = SideEffectExecution.find_or_initialize_by(dedupe_key: dedupe_key)
return effect if effect.applied?
effect.effect_type ||= "helpdesk_ticket"
effect.request_payload ||= {
order_id: order.shopify_order_id,
reason: order.support_ticket_reason
}
effect.attempt_count += 1
effect.save!
response = HelpdeskClient.create_ticket(effect.request_payload)
effect.update!(
state: "applied",
provider_reference: response.fetch("id"),
response_payload: response,
applied_at: Time.current
)
effect
rescue HelpdeskClient::TemporaryError => e
effect&.update!(
state: "failed",
next_attempt_at: 10.minutes.from_now,
last_error_message: e.message
)
raise
end
end
endThis pattern does two useful things. First, it gives retries somewhere authoritative to look. Second, it makes reconciliation possible for effects too. If the external provider went down, you can later scan failed effects and re-dispatch them without replaying the whole webhook universe.
Exactly once is not the contract
Sidekiq explicitly documents at-least-once execution, not exactly-once. Queue uniqueness can reduce noise, but it does not replace idempotent handlers and durable side-effect state.
In practice, the cleanest shape is often an outbox-like handoff. Your business transaction marks that an effect should happen, and a separate dispatcher performs it. This reduces the blast radius when your database commit succeeds but the downstream API flakes out, or vice versa.
Use reconciliation to close the gaps
Shopify recommends reconciliation because webhook delivery is not guaranteed. That single fact should permanently end any architecture discussion where webhooks are treated as an infallible event log.
Reconciliation means scheduled jobs that fetch recent changes from Shopify, identify drift, and push the repaired data through the same normalizers or processors your webhook path already uses. The critical phrase there is the same. If your repair job has its own business logic, you are building a second system that will eventually disagree with the first one.
“Your app shouldn't rely solely on receiving data from Shopify webhooks.”
A common pattern is a per-shop cursor plus a safety lookback window. For example, every ten minutes, ask Shopify for orders updated since the later of your last successful cursor or a rolling lookback such as thirty minutes. The overlap is intentional. Your dedupe keys and side effect ledgers are what make overlapping windows safe.
# app/jobs/reconcile_shopify_orders_job.rb
class ReconcileShopifyOrdersJob < ApplicationJob
queue_as :reconciliation
def perform(shop_id)
shop = Shop.find(shop_id)
cursor = shop.reconciliation_cursors.find_or_create_by!(stream: "orders")
window_start = [cursor.last_seen_at || 2.hours.ago, 30.minutes.ago].min
Orders::PullRecentChanges.call(shop:, updated_after: window_start)
end
endmodule Orders
class PullRecentChanges
QUERY = <<~GRAPHQL
query($after: String, $search: String!) {
orders(first: 100, after: $after, query: $search, sortKey: UPDATED_AT) {
pageInfo { hasNextPage endCursor }
edges {
node {
id
updatedAt
name
displayFinancialStatus
displayFulfillmentStatus
}
}
}
}
GRAPHQL
def self.call(shop:, updated_after:)
search = "updated_at:>='#{updated_after.utc.iso8601}'"
after = nil
loop do
result = ShopifyAdminGraphql.call(shop:, query: QUERY, variables: { after:, search: })
edges = result.dig("data", "orders", "edges") || []
edges.each do |edge|
node = edge.fetch("node")
SyncShopifyOrderJob.perform_later(node.fetch("id"), shop.domain)
end
page_info = result.dig("data", "orders", "pageInfo") || {}
break unless page_info["hasNextPage"]
after = page_info["endCursor"]
end
end
end
endFor high-volume backfills or broad repairs, Shopify Bulk Operations are often the better tool. They avoid pagination churn, respect Shopify’s async model, and in API versions 2026-01 and newer, Shopify allows up to five bulk query operations at a time per shop. Use offline tokens for long-running bulk work.
Reconciliation also gives you a clean outage recovery story. If your pipeline was degraded for two hours, widen the window, replay through the same business path, and let idempotency absorb the overlap. This is dramatically nicer than telling a merchant, “good news, we found the logs.”
Common production failure modes
Most webhook pipelines do not fail because JSON is hard. They fail because the design quietly assumed a guarantee the platform never made. These are the failure modes that show up again and again.
Returning
200before durable persistence. This loses work when your process dies after acknowledgement.Using queue uniqueness as the main dedupe strategy. This hides symptoms but does not make downstream effects safe.
Trusting stale payloads. Shopify retries with the original payload, and delayed deliveries can happen, so blindly applying a payload can move your state backward.
No subscription dimension in dedupe. If you run multiple subscriptions for the same topic, dedupe on
event_idalone can drop legitimate deliveries.No reconciliation loop. This turns one missed delivery into permanent drift.
Per-topic business logic duplication. This creates four slightly different order sync paths that all disagree by next quarter.
No stale-event guard. Compare
X-Shopify-Triggered-Atand resource timestamps against what you already know before applying a regressive mutation.Webhook fanout causing API throttling. Shopify’s GraphQL Admin API uses a cost model and per app-store buckets, so a retry burst that triggers eager follow-up reads can rate-limit itself into a second incident.
There are two especially sneaky production edges worth calling out.
Subscription migrations can leak retries to the old endpoint
Shopify documents that retried webhooks go to the address configured when the webhook was originally triggered. If you rotate endpoints, keep both live briefly. Otherwise your shiny new endpoint will be healthy while the old one quietly receives the retries you still needed.
Volume reduction features help, but they do not fix correctness
Shopify lets you use webhook filters and payload customization to reduce event volume and trim payload contents. That is excellent for throughput and cost. It is not a substitute for idempotent processing. A smaller broken pipeline is still broken, just with better bandwidth.
Testing the ugly cases
If your tests only verify the happy path, your pipeline is not tested. It is being complimented. Reliable event systems need failure-shape tests. The useful question is not “does this job run?” but “what happens when it runs twice, late, and after the external API already succeeded once?”
| Case | What to assert |
|---|---|
| Duplicate webhook delivery | The second receipt returns 200 and does not create new business work |
| Worker crash after external call | Replay does not duplicate the side effect because the ledger key already exists |
| Delayed stale retry | Current resource snapshot or timestamp guard prevents regressive updates |
| Missed webhook | Reconciliation rediscovers the resource and feeds it into the same handler |
| GraphQL throttling during burst | Backoff and retry behavior keep progress monotonic rather than cascading failure |
| Endpoint migration during retries | Both endpoints remain valid long enough that old-address retries are still accepted |
RSpec.describe SideEffects::CreateHelpdeskTicket do
let(:order) { create(:order, shopify_order_id: "gid://shopify/Order/123") }
it "does not create the same external ticket twice" do
allow(HelpdeskClient).to receive(:create_ticket).and_return({ "id" => "T-1" })
described_class.call(order: order)
described_class.call(order: order)
expect(HelpdeskClient).to have_received(:create_ticket).once
expect(SideEffectExecution.count).to eq(1)
expect(SideEffectExecution.first).to be_applied
end
endInstrumentation matters too. Rails gives you Active Support instrumentation hooks, and Shopify gives you webhook delivery metrics in the Dev Dashboard for recent deliveries. Track receipt latency, queue latency, dedupe hit rates, processing failure counts, stale delivery counts, and reconciliation drift. When an incident happens, good metrics turn a witch hunt into a boring graph review, which is the nicest compliment you can pay an incident.
Best internal links
Sources and further reading
Shopify Dev: About webhooks
Shopify Dev: Best practices for webhooks
Shopify Dev: Deliver webhooks through HTTPS
Shopify Dev: Ignore duplicate webhooks
Shopify Dev Changelog: Updates to webhook retry mechanism
Shopify Dev: Troubleshooting webhooks
Shopify Admin GraphQL: webhookSubscriptionCreate
Shopify Dev: Filter your events
Shopify Dev: Modify your payloads
Shopify Dev: API limits
Shopify Dev: Perform bulk operations with the GraphQL Admin API
Shopify Dev: Subscribe to a webhook topic
Sidekiq Wiki: Best Practices
Rails Guides: Active Job Basics
Rails Guides: Rails 7.2 release notes
Rails Guides: Active Record Callbacks
Rails Guides: Active Support Instrumentation
FAQ
Should I trust the webhook payload as the final source of truth?
Usually no. Treat the webhook as a trigger plus useful context. For critical state, especially after delays or retries, fetch current Shopify state or apply a version and timestamp rule before mutating your own records.
Is X-Shopify-Event-Id enough for dedupe?
It is the right starting point for duplicate delivery detection, but it is not the whole design. You still need idempotent business handlers and durable side-effect dedupe, because the same business object can legitimately arrive in many different webhook events.
Can I return 200 immediately and process later?
Yes, and in practice you usually should. Shopify expects a response within five seconds, so the safer pattern is verify, persist, enqueue, and acknowledge. That only works if your persistence layer is durable and your downstream processing is replay-safe.
When do reconciliation jobs become mandatory?
As soon as missing an event would create bad merchant state, lost automations, or duplicate side effects. Shopify explicitly recommends reconciliation because webhook delivery is not guaranteed.
Related resources
Keep exploring the playbook
Shopify webhook idempotency in Rails
A Rails guide to Shopify webhook idempotency using event IDs, durable deduplication, and processing patterns that survive retries, duplicates, and delayed deliveries.
Calling a Rails API from a Shopify customer account extension
A practical guide to calling Rails endpoints from a Shopify Customer Account UI Extension, including session-token verification, endpoint design, and the requests that should not go through your backend at all.
Checkout UI Extensions with a custom backend architecture
A practical backend architecture guide for Shopify Checkout UI Extensions, covering when to call your own backend, how to keep the extension thin, and where teams overbuild the server side.