If you'd like to learn more, please get in touch.
Note: This project involves a fictional e-commerce brand and uses a publicly available digital marketing dataset.
An e-commerce brand that sells luxury golf apparel is struggling to know who their target demographic is. They lack a good understanding of their avatar and are thus having a hard time directing their marketing efforts. Are men or women their most valuable customers? What is the age range of their most valuable customers? How much do they make, and what are their buying characteristics? These are questions the marketing director has been asking himself. Ultimately, he wants to know what kind of person the company should target in their marketing efforts to have the most profitable and effective campaigns moving forward.
What is the problem that can be solved? Given the available data, we can determine who the company should target in their marketing efforts. In other words, we can determine the kind of people who spend the most (i.e. have the most conversions) and use this to create campaigns that target the right kind of avatar. By solidifying the best avatar for the company to pursue we can help them achieve the highest ROI for their existing marketing efforts within their existing client base.
In terms of defining success metrics, we will consider conversion rate to be the key metric of success. We will use characteristics of the customer (such as age, income, and gender) and wider behavioral patterns of the customer (such as engagement with the website, past history with the company, response to marketing efforts) to determine the best avatar for the company to target.
This is where we'll look at the relationships in the data and determine a hypothesis as to who the best avatar is for the company to target. First we'll look at demographics to understand who converts the most (by looking at gender, age, and income). Next we’ll look at behavioral traits and understand how high converters behave (by looking at time on site, engagement, loyalty, etc.).
The problem is centered around identifying a high-converting avatar that the brand can use to craft future marketing efforts more effectively. This type of problem suggests the need for a classification model for our experiment. Classification is a supervised machine learning method where the model tries to predict the correct label of the provided input data. Classification models are trained using the training data and then evaluated on existing testing data before being used on new data. For this particular classification problem, we will use a logistic regression model to help predict the probability of whether a customer will convert or not.
Example of a logistic regression function
The logistic regression model predicts a probability that ranges between 0 and 1. Values >= 0.5 indicate that a customer will convert, while values < 0.5 indicate that the customer will not convert. This predicted probability can be used to identify the characteristics or avatar of the highest converting customers. The features that the model will use to predict probability of conversion are the following:
The logistic regression model will be useful because it will illustrate which customers are likely to convert and then we can analyze which traits these customers share. The benefits of logistic regression include interpretability, scalable to many features, simple to train, and ability to rank importance of features.
This dataset was already clean when provided so there was little to do from a data cleaning perspective. For data manipulation and feature engineering, the data needed to be sorted into the proper segmentation. Using Pandas, NumPy, and scikit-learn, the raw behavioral data was converted into model-ready variables. Several new behavioral features were engineered to better capture user intent and marketing engagement:
Categorical data such as Gender and age_group were converted to numerical formats. The data were split into training and testing sets.
Training and evaluating the logistic regression model was an iterative process focused on addressing data quality challenges and improving the reliability of the results. The first baseline model, trained on the original dataset, showed that logistic regression was a strong choice for interpretability—but the results highlighted a major limitation in the data itself: the conversion variable was extremely imbalanced, with the vast majority of customers labeled as converters. This caused the baseline model to predict almost everything as the majority class, creating artificially high accuracy while producing nearly zero recall for the minority class.
To correct this, I used SMOTE oversampling to balance the dataset during training. This ensured the model learned meaningful patterns for both converters and non-converters, significantly improving minority-class recall and providing more stable model behavior. Even after balancing, there remained considerable overlap between the two groups’ features, which capped overall performance, but the model became much more informative and actionable.
Next, I optimized the decision threshold using F1-score rather than relying on the default 0.50 cutoff. This step had a major impact: the optimal threshold was approximately 0.27, which substantially improved the balance of precision and recall. This demonstrated that the model’s predictive value could be improved not just by adjusting the data, but by interpreting probabilities more strategically—an important consideration in marketing use cases where probability sensitivity often matters more than raw accuracy.
Confusion maxtrix of the model
I also tested a nonlinear alternative—a Random Forest classifier—to verify whether a more flexible model could extract additional signal. While Random Forest improved certain recall metrics and reinforced the importance of specific features, it did not outperform logistic regression in terms of interpretability or overall consistency. This confirmed that logistic regression, once optimized, was the better tool for generating insights about customer behavior and characteristics.
Ultimately, the final modeling workflow combined logistic regression with SMOTE oversampling and a tuned probability threshold. This approach delivered the best balance of interpretability, performance, and practical actionability. Most importantly, it produced stable, transparent insights into which demographic and behavioral factors correlate with higher conversion likelihood—directly supporting the client’s goal of identifying their most effective customer avatar.
In this case, there was no need to operationalize or deploy the model since a static dataset of past customer data was used. The goal of the project was to help provide an analysis of past data to inform future marketing efforts rather than provide an ongoing analysis of ongoing customer behavior. Due to this, there was no need to deploy the model.
This section contains several visualizations illustrating the results of the experiment.
Graph illustrates the features that have the biggest impact on conversion
Graph shows the conversion rate by age group
Graph shows the conversion rate by gender
Graph shows overlap of engagement and conversion among dataset
Graph shows the overlap of income and conversion among dataset
Graph shows the overlap of loyalty and conversion among dataset
The results of the experiment illustrated that engagement score and loyalty score have the largest direct effects on conversion. Engagement score had a coefficient of 0.37 which shows that more engaged visitors are far more likely to convert. Loyalty score had a coefficient of 0.29 which shows that loyal/returning customers have a high conversion probability. Marketing responsiveness had a coefficient of -0.07 which shows that over-targeted or ad-fatigued customers may convert less. This could reflect a sentiment of lower trust or an oversaturation of these marketing methods. The age group of 35-44 had the largest coefficient of 0.07 which shows that mid-career adults are the highest converting age demographic. Income had a coefficient of 0.06 which shows that higher-income customers are more likely to convert. Social promoter had a coefficient of -0.05 which shows that possibly the people who share are not the people that ultimately buy. The male gender had a coefficient of 0.04 which suggests a small tilt toward male audience converting higher than females. Older age groups had mixed negative coefficients which suggests that conversions decline slightly over the age of 45.
Thus, the results of the experiment show the highest converting avatar of the brand.
Avatar: Highly engaged, loyal repeat customer, typically male, aged 35-44, with above average income. They are not necessarily heavy social sharers or frequent responders to ads or email blasts, but they are active on the website and already familiar with the brand.
Note: Those with similar traits but in the 25-34 age range could potentially represent an emerging buyer market.
Avatar least likely to convert: Older visitors (45+), social sharers, or overly-marketed-to audiences who engage superficially but don’t purchase
Practical applications and ideas: Potentially retarget by engagement. Focusing ads or emails on users with high engagement scores will yield more conversions than cold prospects. Consider expanding/building loyalty programs. Those who are repeat customers and engage in loyalty programs tend to have high conversion rates. Optimize content, visuals, or ads for the 35-44 male segment. Language, presentation, and offers should feel relevant to this mid-aged premium demographic. Potentially reduce ad fatigue. Try limiting repetitive marketing touchpoints as responsive customers might already be overserved. Do not place that much emphasis on social metrics. Shares on social media don’t equate to sales; engagement on the website is more predictive of conversion.
The model’s accuracy appears high at ~88%, but the class distribution and ROC-AUC (≈0.63) suggest that the model performs best at identifying converters, but is less reliable at identifying non-converters. In other words, this means the model is very good at spotting those who will convert but not as strong at predicting those who won’t convert.
The behavioral traits of the highest converting customers are frequent interactions with the website (visits, clicks, pages per visit), familiarity with the brand (as shown from past purchases and engagement with loyalty programs), and self-motivated to purchase (less influenced by email campaigns or external promotions). Marketing efforts should strive to target this demographic by emphasizing loyalty programs, personalized on-site experiences, and premium value propositions aimed tailored to them.