Event deduplication

6 min

delivery guarantee the chord cdp provides at least once delivery for every event this means the cdp attempts to deliver each event at least once and may retry when failures occur, so in rare cases the same event may be delivered more than once duplicates are typically caused by transient events in the underlying message pipeline a processing instance restarting (deployment, autoscaling, crash) before its progress is acknowledged a network blip or pause that causes an in flight message to be redelivered an upstream producer retry after an unconfirmed delivery temporary failures in downstream services that trigger replays this is the same delivery model used by the major customer data platforms (segment, rudderstack, mparticle), and is the standard guarantee for streaming event pipelines current deduplication policy the chord cdp does not perform message level deduplication within the pipeline itself instead, deduplication is the responsibility of the downstream destination for every event the cdp processes, the source provided messageid is preserved end to end and made available to every destination destinations should use the messageid (or a deterministic value derived from it — some destinations require a transformed identifier, e g , a messageid with a suffix) to detect and reject duplicates on their side how destinations should handle duplicates destination type recommended approach data warehouses (snowflake, bigquery, postgres, redshift, clickhouse, etc ) configure deduplication at the destination using messageid as the primary key the chord cdp supports per connection deduplication options ( deduplicate + primarykey ) that perform merge / upsert on each load clickhouse destinations collapse duplicates on background merges via the replacingmergetree table engine http apis (braze, klaviyo, insider, stripe, etc ) pass event messageid (or a destination specific derivation of it) as an idempotency key in the api request — typically via the idempotency key http header most modern apis will reject or ignore requests with a previously seen idempotency key reverse etl / loopback destinations filter by messageid in the destination system, or rely on the destination's natural primary key constraints why deduplicate at the destination? destination side deduplication is the conventional pattern for streaming data systems, for several reasons destinations have authoritative state a warehouse already knows whether a row with a given primary key exists an http api already knows whether it has processed a given idempotency key asking these systems to detect duplicates is more reliable than maintaining a parallel record elsewhere destinations are diverse different destinations have different definitions of "duplicate" — some merge by primary key, some upsert by composite key, some collapse on background processes a pipeline level dedup can't capture this nuance it avoids a single point of failure a pipeline level dedup store would be a critical path dependency; if it became slow or unavailable, the entire pipeline would degrade pushing dedup to destinations keeps the pipeline fast and stateless it aligns with how streaming pipelines work at least once is the default delivery semantic of the underlying streaming infrastructure building exactly once on top requires complex coordination that introduces its own failure modes implications for custom functions (udfs) udfs run inside the cdp pipeline before events are dispatched to destinations because the cdp is at least once, a udf may execute more than once for the same source event in rare redelivery scenarios for udfs that are pure transformations (enrichment, filtering, splitting), this is harmless — the destination will still deduplicate the result by messageid however, udfs that perform external write operations (calling a third party api that modifies state, incrementing a remote counter, sending an email, writing to a database) should be designed to be idempotent the standard approach is to pass event messageid as an idempotency key to the external system export default async function (event, { fetch }) { await fetch("https //api example com/orders", { method "post", headers { "content type" "application/json", "idempotency key" event messageid, // ← prevents duplicate writes on redelivery }, body json stringify({ / / }), }); return event; } without an idempotency key, a redelivered event could cause the udf's external write to occur twice working with messageid every event flowing through the cdp carries a messageid field if the source provides one, it is preserved end to end if not, the cdp generates a unique identifier at ingest time the messageid is stable — the same value is forwarded to every destination unique — within a reasonable window (system generated unique identifiers, or source provided identifiers that the source guarantees unique) available everywhere — exposed on the event object inside udfs as event messageid , included in the payloads sent to destinations, and visible in live events use it as the canonical key for any deduplication, tracing, or correlation downstream related topics for details on configuring per connection deduplication options for a warehouse destination, see your destination configuration page in the chord console for guidance on writing idempotent udfs, see the cdp functions documentation support for questions about deduplication, delivery guarantees, or how to configure your destination, please contact help\@chord co mailto\ help\@chord co