Predictive Marketing Attribution

10 min

introduction marketing attribution is becoming increasingly complex as parts of the customer journey can be obscured privacy regulations, such as gdpr and ccpa, along with platform changes like apple’s app tracking transparency and the phase out of third party cookies, have limited the ability to track users across apps, websites, and devices as a result, some key interactions, such as ad views, referrals, and cross device behavior, may go unrecorded, making it more challenging to see the full path to conversion web sessions play a vital role in attribution, customer lifetime value (clv) analysis, a/b testing, and spend optimization however, more and more sessions are unrecorded, resulting in dark data sessions a dark data session is not invisible traffic, but instead, it’s traffic with hidden origins businesses that rely heavily on attribution and optimization can account for this growing blind spot by utilizing methods like predictive or probabilistic attribution, identity resolution, and server side tracking dark data sessions web sessions play a vital role in attribution, customer lifetime value (clv) analysis, a/b testing, and spend optimization however, more and more sessions are unrecorded, resulting in dark data sessions a dark data session refers to a website or app session where the source and attribution information are missing, incomplete, or untraceable in other words, it's a visit where you can’t tell how or why the user arrived, making it difficult to link that session back to a specific marketing campaign, channel, or customer journey stage 🔍 common causes of dark data sessions privacy restrictions (e g , apple’s app tracking transparency, browser settings) cookie blocking or deletion users coming from untrackable sources (e g , sms, native mobile apps, secure email clients) cross device behavior without identity stitching url parameter stripping (e g , missing utm tags) ⚠️ why it matters dark sessions can represent varying degrees of ecommerce traffic they obscure the effectiveness of marketing efforts they make it harder to optimize spend, personalize experiences, or measure roi dark data sessions now account for a growing share of ecommerce activity, but with thoughtful modeling, enriched data, and identity resolution strategies, businesses can begin to illuminate these hidden touchpoints and make smarter, more informed decisions predictive marketing attribution chord utilizes look a like propensity modeling to assign probabilistic source and channel to unknown sessions while the solution here can be technically complicated, the idea couldn’t be simpler we shouldn’t throw away information that we already have if sessions that come from similar sources have similar attributes (and the data strongly suggest that this is the case), then we can pipe our known data into a model and that will give us a much better guess at how folks are getting to our media probabilistic matching using a series of feature engineering transformers, data modeling, and multi class boosted modeling we associate sessions with unknown sources to their most likely sources notice that the output of the multi class is a probability for each of the potential sources which reflects the underlying uncertainty feature engineering the data used for this analysis were from cdp client side eventing for the session portion and the order management system for the orders observations are at the session or order level, respective to each model, and the features are broken into the following numerical transforms attributes such as duration of session, number of associated sessions (by anonymous id), and activity in the session categorical transforms date and time of day features, as well as session attributes such as device user when we applied this model to orders, we were able to add all of the above (where session was observed) as well as numerical and categorical transformers on order data, and line item transforms such as multiclass transforms products purchased, discount codes applied all in all, this renders a rich dataset with session and order observations for their respective model model outputs the multi class model for source prediction models the pr(source) = f(features), where each potential source probability is estimated for each session with an unknown source for instance, based on its input features, a single session that has an unknown source will get a prediction of, for example, {‘google’ 60%, ‘facebook’ 20%, ‘ig’ 15%, } and so on for all potential source targets, eventually adding to 1 for that session the above graph shows the relationship of the sessions that were in class guessed correctly (true positives) vs those in class guessed wrong (false positives) in a one vs all framework without getting into the weeds, this graph indicates that the rank model is getting clear separation (making good guesses) compared to a naive guess across all potential source targets these metrics indicate that this is a strong model where the features have reproducible underlying relationships with the targets, and can have high confidence in the rankings of predictions evaluation of model inputs a deep dive is out of the scope of this analysis, but it is worth calling out that the features of this model are themselves useful for correlation analysis that can be leveraged in root cause studies for instance we can take a look at the correlation of the entire population of the sessions here we see that compared to other classes duration of activity and activity count have strong correlations with a session being labeled as ‘google’ again, this is just the hint to go further into root cause analysis, but is mainly presented to show that the model itself has additive value beyond pure source prediction business impact predictive marketing attributions takes our “dark data”, and probabilistically poured it into its best guess bucket in the above graph we see the unknown (null) bucket on the far right, and the blue bars on the left show how much we have reallocated these into each bucket based on their underlying features looking at the impact of the top five sources we see economically significant changes nearly 2m sessions (16% of total sessions) and an associated $12m of “dark data” revenue (28% of total revenue) have probabilistically been attributed to google, and similar percentage wise impacts across the board in conclusion web sessions are becoming increasingly opaque, leaving companies blind to a significant portion of session attribution to address this, we developed a predictive attribution model that uses behavioral and contextual features to estimate the likely source of sessions and conversions the end result added best guess attribution to over 50% of previously unknown conversions and provided greater visibility into source inflow for non converted sessions these probabilistic attributions now power more accurate segmentation campaigns, clv and roi analyses, and retargeting strategies the model is fully productionalized, running as a daily batch process across millions of web sessions and hundreds of thousands of online orders—enabling clearer insight into incremental marketing impact and more efficient spend optimization