Guides

Developer guide

How to build a retry-safe Shopify event pipeline with Rails

A practical architecture guide for Rails teams building retry-safe Shopify event pipelines with webhooks, queues, reconciliation jobs, and durable side-effect tracking.

Updated March 12, 2026
18 min read
Editorial note: This guide assumes Shopify webhooks entering a Rails app over HTTPS, then moving through Active Job or Sidekiq. The same downstream contract also works when ingress is Pub/Sub or EventBridge.

What retry-safe actually means

Retry-safe does not mean “my job queue retries exceptions.” That is table stakes. A retry-safe Shopify pipeline means the whole path from receipt to side effect stays correct when Shopify sends duplicates, retries late, omits a delivery entirely, or your own workers crash halfway through a job.

In other words, retry safety is a system property. If your controller is idempotent but your ticket creation is not, you do not have a retry-safe pipeline. You have one reliable layer and one unreliable one.

Shopify’s current webhook behavior makes this non-optional. Shopify recommends duplicate detection with X-Shopify-Event-Id, expects HTTPS handlers to respond in under five seconds, retries failed deliveries up to eight times over four hours, warns that delayed delivery can happen, and explicitly recommends reconciliation because webhook delivery is not guaranteed.

“Webhook delivery isn't always guaranteed.”

The working rule

Treat Shopify webhooks as at-least-once event triggers that can be delayed, duplicated, or missed. Then design every downstream step so a replay is boring.

The four layers of a reliable pipeline

Reliable Shopify event processing in Rails usually needs four layers: receipt, normalization, business processing, and reconciliation. Collapse them into one controller action or one giant queue job and the system will work right up until the first noisy retry storm, partial outage, or accidental double-send. Which, in fairness, is how many weekend migrations become spiritual experiences.

LayerWhat it doesWhat it must persistMain anti-pattern
ReceiptVerify origin, capture headers and payload, store the event, return quicklyDeduplication key, raw body, headers, receipt statusDoing business logic in the controller
NormalizationMap Shopify topics and payload shapes into app-level event intentsCanonical event row, business identifiers, timestamps, replay statusLetting every topic branch into bespoke worker spaghetti
Business processingUpdate local state, fetch Shopify state when needed, schedule side effectsBusiness state transitions, locks, failure status, handoff markersAssuming queue uniqueness equals correctness
ReconciliationPull recent changes from Shopify and feed them back through the same handlersCursors, lookback windows, drift counts, repair runsBuilding a second code path that drifts from the webhook path

The reason to separate these layers is operational clarity. When an order sync is wrong, you want to know whether the event never arrived, arrived but failed HMAC, normalized badly, processed twice, or was simply missed and later repaired. Systems that keep that answer only in logs are usually one incident away from archaeology.

Acknowledge fast, process later

Shopify’s HTTPS delivery flow gives you less than five seconds to establish the connection and return a response. That budget is not there so you can squeeze in HMAC verification, a few remote API calls, a Postgres migration, and a brief moment of introspection. It is there so you can verify, persist, enqueue, and acknowledge.

This is why good ingress code is aggressively boring. It should not decide business meaning. It should make a durable handoff. If you need more than a few hundred milliseconds to handle a webhook, you are almost certainly doing work in the wrong layer.

“Shopify expects to establish the connection and receive your response in less than five seconds.”

In Rails, the ingress contract is usually: read the raw body once, verify the HMAC using your app secret, extract the routing headers, write a receipt row under a unique index, enqueue the normalization job, then return 200. If the insert hits a duplicate constraint, that is still a successful receipt and should still return 200.

# config/routes.rb
post "/webhooks/shopify", to: "webhooks/shopify#create"
# app/controllers/webhooks/shopify_controller.rb
class Webhooks::ShopifyController < ActionController::API
  def create
    raw_body = request.raw_post
    verify_shopify_hmac!(raw_body)
 
    receipt = ShopifyWebhookReceipt.record!(
      shop_domain: request.headers["HTTP_X_SHOPIFY_SHOP_DOMAIN"],
      topic: request.headers["HTTP_X_SHOPIFY_TOPIC"],
      event_id: request.headers["HTTP_X_SHOPIFY_EVENT_ID"],
      subscription_name: request.headers["HTTP_X_SHOPIFY_NAME"],
      triggered_at: request.headers["HTTP_X_SHOPIFY_TRIGGERED_AT"],
      api_version: request.headers["HTTP_X_SHOPIFY_API_VERSION"],
      headers: webhook_headers,
      payload: JSON.parse(raw_body)
    )
 
    NormalizeShopifyReceiptJob.perform_later(receipt.id)
    head :ok
  rescue ShopifyWebhookReceipt::DuplicateReceipt
    head :ok
  rescue InvalidShopifyWebhook
    head :unauthorized
  end
 
  private
 
  def webhook_headers
    request.headers.env.select { |key, _| key.start_with?("HTTP_X_SHOPIFY_") }
  end
 
  def verify_shopify_hmac!(raw_body)
    digest = OpenSSL::HMAC.digest(
      "sha256",
      ENV.fetch("SHOPIFY_API_SECRET"),
      raw_body
    )
 
    expected = Base64.strict_encode64(digest)
    actual = request.headers["HTTP_X_SHOPIFY_HMAC_SHA256"].to_s
 
    unless ActiveSupport::SecurityUtils.secure_compare(expected, actual)
      raise InvalidShopifyWebhook, "HMAC verification failed"
    end
  end
end

The key detail is durability before acknowledgement. If you return 200 before the event is durably recorded, you told Shopify everything is fine while your pipeline still lives in RAM and optimism.

Also note the header selection. Shopify uses X-Shopify-Event-Id for duplicate detection, X-Shopify-Triggered-At to help you reason about stale retries, and X-Shopify-Name to echo the subscription name you configured. That last one matters when you intentionally run multiple subscriptions that observe the same topic for different purposes.

Define a canonical event envelope

The raw webhook is not your domain event. It is a Shopify delivery envelope. Good pipelines convert that envelope into a canonical app-level event record with a stable shape. This is the point where you stop letting every topic improvise its own lifecycle.

A canonical envelope usually contains the shop, topic, event identifier, subscription name, source timestamps, business identifiers extracted from the payload, and the normalization state. You can store the original payload forever or on a retention window, but the normalized record should contain the fields your app actually reasons about.

# db/migrate/20260312120000_create_shopify_webhook_receipts.rb
class CreateShopifyWebhookReceipts < ActiveRecord::Migration[8.0]
  def change
    create_table :shopify_webhook_receipts, id: :uuid do |t|
      t.string :shop_domain, null: false
      t.string :topic, null: false
      t.string :subscription_name
      t.string :event_id, null: false
      t.string :api_version
      t.datetime :triggered_at
      t.string :status, null: false, default: "received"
      t.jsonb :headers, null: false, default: {}
      t.jsonb :payload, null: false, default: {}
      t.text :last_error_message
      t.datetime :normalized_at
      t.datetime :processed_at
      t.timestamps
    end
 
    add_index :shopify_webhook_receipts,
      [:shop_domain, :topic, :subscription_name, :event_id],
      unique: true,
      name: "idx_shopify_receipts_unique_delivery"
  end
end
# db/migrate/20260312120500_create_domain_events.rb
class CreateDomainEvents < ActiveRecord::Migration[8.0]
  def change
    create_table :domain_events, id: :uuid do |t|
      t.string :kind, null: false
      t.string :aggregate_type, null: false
      t.string :aggregate_key, null: false
      t.uuid :source_receipt_id, null: false
      t.string :dedupe_key, null: false
      t.jsonb :data, null: false, default: {}
      t.string :status, null: false, default: "pending"
      t.datetime :processed_at
      t.text :last_error_message
      t.timestamps
    end
 
    add_index :domain_events, :source_receipt_id
    add_index :domain_events, :dedupe_key, unique: true
  end
end

Why two tables? Because “we received this webhook” and “we normalized this into business work” are not the same fact. Separate them and replays become straightforward. Keep them fused and every replay becomes interpretive dance.

The unique index for receipt dedupe should reflect your actual subscription design. Shopify notes that if you have more than one subscription for the same topic, you can receive several webhooks for the same event, one per subscription. In that setup, dedupe on event_id alone would be too aggressive, so include your subscription dimension.

Normalize into business work, not controller chaos

Normalization is where you translate a Shopify topic into the work your app actually cares about. That usually means one of two patterns.

  • Snapshot sync pattern: the webhook means “this resource changed, sync current state now.”

  • Action trigger pattern: the webhook means “attempt a specific business action, but only if guards still pass.”

Most apps should prefer the first pattern for important Shopify resources. Why? Because retried webhooks are delivered with the original payload from when they were triggered, and Shopify also warns that webhook delays can happen. If you can afford one extra API read, treating the webhook as a trigger to fetch current state is often safer than trusting the payload blindly.

# app/jobs/normalize_shopify_receipt_job.rb
class NormalizeShopifyReceiptJob < ApplicationJob
  queue_as :webhooks
 
  retry_on StandardError, wait: :polynomially_longer, attempts: 10
 
  def perform(receipt_id)
    receipt = ShopifyWebhookReceipt.find(receipt_id)
    return if receipt.normalized?
 
    case receipt.topic
    when "orders/create", "orders/updated", "orders/paid", "orders/fulfilled"
      Orders::NormalizeWebhookReceipt.call(receipt)
    when "products/create", "products/update", "products/delete"
      Products::NormalizeWebhookReceipt.call(receipt)
    else
      GenericShopifyNormalizer.call(receipt)
    end
  end
end
# app/services/orders/normalize_webhook_receipt.rb
module Orders
  class NormalizeWebhookReceipt
    def self.call(receipt)
      order_gid = receipt.payload.fetch("admin_graphql_api_id")
 
      DomainEvent.create!(
        kind: "shopify.order.touched",
        aggregate_type: "ShopifyOrder",
        aggregate_key: order_gid,
        source_receipt_id: receipt.id,
        dedupe_key: ["shopify.order.touched", receipt.shop_domain, order_gid, receipt.event_id].join(":"),
        data: {
          topic: receipt.topic,
          shop_domain: receipt.shop_domain,
          triggered_at: receipt.triggered_at,
          payload_updated_at: receipt.payload["updated_at"]
        }
      )
 
      receipt.update!(status: "normalized", normalized_at: Time.current)
      SyncShopifyOrderJob.perform_later(order_gid, receipt.shop_domain)
    end
  end
end

The processing job that follows should be organized around the business object, not around the webhook topic. That is the big mental shift. Your app usually cares that an order now needs to be synced, not that it was specifically orders/paid versus orders/updated in this one delivery. Topic still matters as context, but business code should be attached to business objects.

Also, be careful with transaction boundaries. Modern Rails can defer Active Job enqueueing until after the surrounding transaction commits. That is excellent, because “write row, enqueue job, then roll back the row” is one of those bugs that only appears when you are already having a bad day.

Track side effects as first-class state

The hardest part of retry safety is not parsing JSON. It is controlling side effects. If a pipeline can send emails, create helpdesk tickets, push data to an ERP, tag a customer, or notify Slack, each of those actions needs a durable record of whether it has already happened for the relevant business key.

This is where many otherwise decent webhook systems become chaos machines. The controller is idempotent. The normalizer is idempotent. Then the job calls an external API twice and the merchant gets two tickets, two emails, or two “urgent” Slack alerts about the same order. Now everybody is looking at logs like they are tea leaves.

The fix is to store side effects in their own ledger table with a semantic dedupe key. Not a queue-level uniqueness hint. A real row that says whether the effect is pending, applied, failed, or cancelled.

# db/migrate/20260312121000_create_side_effect_executions.rb
class CreateSideEffectExecutions < ActiveRecord::Migration[8.0]
  def change
    create_table :side_effect_executions, id: :uuid do |t|
      t.string :effect_type, null: false
      t.string :dedupe_key, null: false
      t.string :state, null: false, default: "pending"
      t.string :provider_reference
      t.integer :attempt_count, null: false, default: 0
      t.datetime :applied_at
      t.datetime :next_attempt_at
      t.jsonb :request_payload, null: false, default: {}
      t.jsonb :response_payload, null: false, default: {}
      t.text :last_error_message
      t.timestamps
    end
 
    add_index :side_effect_executions, :dedupe_key, unique: true
  end
end
# app/services/side_effects/create_helpdesk_ticket.rb
module SideEffects
  class CreateHelpdeskTicket
    def self.call(order:)
      dedupe_key = [
        "helpdesk-ticket",
        order.shop_id,
        order.shopify_order_id,
        order.support_ticket_reason
      ].join(":")
 
      effect = SideEffectExecution.find_or_initialize_by(dedupe_key: dedupe_key)
      return effect if effect.applied?
 
      effect.effect_type ||= "helpdesk_ticket"
      effect.request_payload ||= {
        order_id: order.shopify_order_id,
        reason: order.support_ticket_reason
      }
      effect.attempt_count += 1
      effect.save!
 
      response = HelpdeskClient.create_ticket(effect.request_payload)
 
      effect.update!(
        state: "applied",
        provider_reference: response.fetch("id"),
        response_payload: response,
        applied_at: Time.current
      )
 
      effect
    rescue HelpdeskClient::TemporaryError => e
      effect&.update!(
        state: "failed",
        next_attempt_at: 10.minutes.from_now,
        last_error_message: e.message
      )
      raise
    end
  end
end

This pattern does two useful things. First, it gives retries somewhere authoritative to look. Second, it makes reconciliation possible for effects too. If the external provider went down, you can later scan failed effects and re-dispatch them without replaying the whole webhook universe.

Exactly once is not the contract

Sidekiq explicitly documents at-least-once execution, not exactly-once. Queue uniqueness can reduce noise, but it does not replace idempotent handlers and durable side-effect state.

In practice, the cleanest shape is often an outbox-like handoff. Your business transaction marks that an effect should happen, and a separate dispatcher performs it. This reduces the blast radius when your database commit succeeds but the downstream API flakes out, or vice versa.

Use reconciliation to close the gaps

Shopify recommends reconciliation because webhook delivery is not guaranteed. That single fact should permanently end any architecture discussion where webhooks are treated as an infallible event log.

Reconciliation means scheduled jobs that fetch recent changes from Shopify, identify drift, and push the repaired data through the same normalizers or processors your webhook path already uses. The critical phrase there is the same. If your repair job has its own business logic, you are building a second system that will eventually disagree with the first one.

“Your app shouldn't rely solely on receiving data from Shopify webhooks.”

A common pattern is a per-shop cursor plus a safety lookback window. For example, every ten minutes, ask Shopify for orders updated since the later of your last successful cursor or a rolling lookback such as thirty minutes. The overlap is intentional. Your dedupe keys and side effect ledgers are what make overlapping windows safe.

# app/jobs/reconcile_shopify_orders_job.rb
class ReconcileShopifyOrdersJob < ApplicationJob
  queue_as :reconciliation
 
  def perform(shop_id)
    shop = Shop.find(shop_id)
    cursor = shop.reconciliation_cursors.find_or_create_by!(stream: "orders")
 
    window_start = [cursor.last_seen_at || 2.hours.ago, 30.minutes.ago].min
 
    Orders::PullRecentChanges.call(shop:, updated_after: window_start)
  end
end
module Orders
  class PullRecentChanges
    QUERY = <<~GRAPHQL
      query($after: String, $search: String!) {
        orders(first: 100, after: $after, query: $search, sortKey: UPDATED_AT) {
          pageInfo { hasNextPage endCursor }
          edges {
            node {
              id
              updatedAt
              name
              displayFinancialStatus
              displayFulfillmentStatus
            }
          }
        }
      }
    GRAPHQL
 
    def self.call(shop:, updated_after:)
      search = "updated_at:>='#{updated_after.utc.iso8601}'"
      after = nil
 
      loop do
        result = ShopifyAdminGraphql.call(shop:, query: QUERY, variables: { after:, search: })
        edges = result.dig("data", "orders", "edges") || []
 
        edges.each do |edge|
          node = edge.fetch("node")
          SyncShopifyOrderJob.perform_later(node.fetch("id"), shop.domain)
        end
 
        page_info = result.dig("data", "orders", "pageInfo") || {}
        break unless page_info["hasNextPage"]
 
        after = page_info["endCursor"]
      end
    end
  end
end

For high-volume backfills or broad repairs, Shopify Bulk Operations are often the better tool. They avoid pagination churn, respect Shopify’s async model, and in API versions 2026-01 and newer, Shopify allows up to five bulk query operations at a time per shop. Use offline tokens for long-running bulk work.

Reconciliation also gives you a clean outage recovery story. If your pipeline was degraded for two hours, widen the window, replay through the same business path, and let idempotency absorb the overlap. This is dramatically nicer than telling a merchant, “good news, we found the logs.”

Common production failure modes

Most webhook pipelines do not fail because JSON is hard. They fail because the design quietly assumed a guarantee the platform never made. These are the failure modes that show up again and again.

  • Returning 200 before durable persistence. This loses work when your process dies after acknowledgement.

  • Using queue uniqueness as the main dedupe strategy. This hides symptoms but does not make downstream effects safe.

  • Trusting stale payloads. Shopify retries with the original payload, and delayed deliveries can happen, so blindly applying a payload can move your state backward.

  • No subscription dimension in dedupe. If you run multiple subscriptions for the same topic, dedupe on event_id alone can drop legitimate deliveries.

  • No reconciliation loop. This turns one missed delivery into permanent drift.

  • Per-topic business logic duplication. This creates four slightly different order sync paths that all disagree by next quarter.

  • No stale-event guard. Compare X-Shopify-Triggered-At and resource timestamps against what you already know before applying a regressive mutation.

  • Webhook fanout causing API throttling. Shopify’s GraphQL Admin API uses a cost model and per app-store buckets, so a retry burst that triggers eager follow-up reads can rate-limit itself into a second incident.

There are two especially sneaky production edges worth calling out.

Subscription migrations can leak retries to the old endpoint

Shopify documents that retried webhooks go to the address configured when the webhook was originally triggered. If you rotate endpoints, keep both live briefly. Otherwise your shiny new endpoint will be healthy while the old one quietly receives the retries you still needed.

Volume reduction features help, but they do not fix correctness

Shopify lets you use webhook filters and payload customization to reduce event volume and trim payload contents. That is excellent for throughput and cost. It is not a substitute for idempotent processing. A smaller broken pipeline is still broken, just with better bandwidth.

Testing the ugly cases

If your tests only verify the happy path, your pipeline is not tested. It is being complimented. Reliable event systems need failure-shape tests. The useful question is not “does this job run?” but “what happens when it runs twice, late, and after the external API already succeeded once?”

CaseWhat to assert
Duplicate webhook deliveryThe second receipt returns 200 and does not create new business work
Worker crash after external callReplay does not duplicate the side effect because the ledger key already exists
Delayed stale retryCurrent resource snapshot or timestamp guard prevents regressive updates
Missed webhookReconciliation rediscovers the resource and feeds it into the same handler
GraphQL throttling during burstBackoff and retry behavior keep progress monotonic rather than cascading failure
Endpoint migration during retriesBoth endpoints remain valid long enough that old-address retries are still accepted
RSpec.describe SideEffects::CreateHelpdeskTicket do
  let(:order) { create(:order, shopify_order_id: "gid://shopify/Order/123") }
 
  it "does not create the same external ticket twice" do
    allow(HelpdeskClient).to receive(:create_ticket).and_return({ "id" => "T-1" })
 
    described_class.call(order: order)
    described_class.call(order: order)
 
    expect(HelpdeskClient).to have_received(:create_ticket).once
    expect(SideEffectExecution.count).to eq(1)
    expect(SideEffectExecution.first).to be_applied
  end
end

Instrumentation matters too. Rails gives you Active Support instrumentation hooks, and Shopify gives you webhook delivery metrics in the Dev Dashboard for recent deliveries. Track receipt latency, queue latency, dedupe hit rates, processing failure counts, stale delivery counts, and reconciliation drift. When an incident happens, good metrics turn a witch hunt into a boring graph review, which is the nicest compliment you can pay an incident.

Best internal links

Sources and further reading

FAQ

Should I trust the webhook payload as the final source of truth?

Usually no. Treat the webhook as a trigger plus useful context. For critical state, especially after delays or retries, fetch current Shopify state or apply a version and timestamp rule before mutating your own records.

Is X-Shopify-Event-Id enough for dedupe?

It is the right starting point for duplicate delivery detection, but it is not the whole design. You still need idempotent business handlers and durable side-effect dedupe, because the same business object can legitimately arrive in many different webhook events.

Can I return 200 immediately and process later?

Yes, and in practice you usually should. Shopify expects a response within five seconds, so the safer pattern is verify, persist, enqueue, and acknowledge. That only works if your persistence layer is durable and your downstream processing is replay-safe.

When do reconciliation jobs become mandatory?

As soon as missing an event would create bad merchant state, lost automations, or duplicate side effects. Shopify explicitly recommends reconciliation because webhook delivery is not guaranteed.

Related resources

Keep exploring the playbook

Guides

Shopify webhook idempotency in Rails

A Rails guide to Shopify webhook idempotency using event IDs, durable deduplication, and processing patterns that survive retries, duplicates, and delayed deliveries.

guidesShopify developerwebhooks