Clustering Data and Exact Matches

One record to rule them all…

In my full-time job I’m both a Customer Insight Manager and a data developer.

We have recently developed a “Single Customer View” (SCV), or “Single Customer Record”.

Our customer data is UK business data. It is possible for businesses to have records within our two different billing systems with variations of their name, different locations, different people responsible for paying our business, different accounts etc. So despite having a master Customer reference in each billing system, the reality is that business may be represented across multiples of these.

This makes marketing and analytics hard.

At it’s heart, our SCV clusters customer records from two different systems, and produces one master record that represents the customer. This produces a table like so:

SCV grouped data

Now our SCV uses several rules of matching data, in combination with each other.

The process can basically be described as:

  • Key matching fields are converted into match codes at various sensitivities.
  • Match codes are then joined together to create clusters.
  • Several passes are made of the matching rules, so super-clusters can be created.

With this top-level SCV we can now refer to a single customer entity, despite it’s data being spread across different systems and in multiple top-level records.

That’s great for marketing, but can we use for X… ?

Operationally, we have a scenario where our spread of customer data over multiple business IDs causes us a problem. An online system can only do things on a per-business ID basis and so customers cannot administer their whole online accounts with us. They end up needing multiple logins.

We have a merging facility, but this requires an awful lot of human effort to do all the necessary checks.

I was asked to quickly estimate how many records linked together by our SCV we could just merge automatically, based on them having a 100% match across all key fields. This included things such as the business name, their HQ address’s post code.

Getting a view of this quickly, could easily have been a pain. The data looks a little like this:

SCV grouped with key fields

In this example, SCV 1 has an exact match across both records. SCV 2 does not.

In order to make this assessment, I brought the data into SAS and started processing it.

The Business Name and Post Code fields were ran through an MD5 checksum routine, and converted to numbers. In SAS base, I used this code to do this.

input(put(md5(BusinessName),$hex12.),hex12

The data now looks a little like this:

SCV data MD5 numeric checksums

Now it’s a matter of grouping the data by the SCV Cluster ID, and taking the MIN/MAX values of each of the numeric fields, to assess whether the cluster has a 100% similarity across it’s source records.

In SAS this is like so:

PROC SQL;
CREATE TABLE SCVAssess1 AS
SELECT 
SCVClusterID
,MIN(BusinessNameNumeric) as MinBusinessNameNum
,MAX(BusinessNameNumeric) as MaxBusinessNameNum
,MIN(PostCodeNumeric) as MinPostCodeNum
,MAX(PostCodeNumeric) as MaxPostCodeNum
FROM SCVNumericFields;
QUIT;

/* Join the Max/Min values back on to source data */ 
PROC SQL;
CREATE TABLE SCVAccess2 AS
SELECT DISTINCT
A.*
,B.MinBusinessNameNum
,B.MaxBusinessNameNum
,B.MinPostCodeNum
,B.MaxPostCodeNum
FROM SCVNumericFields as A
LEFT JOIN SCVAccess1 as B ON A.SCVClusterID = B.SCVClusterID;
QUIT;

/* Create assessment fields */
DATA SCVAccess3;
SET SCVAccess2;
format BusinessNameSame 1.; BusinessNameSame = 0;
format PostCodeSame 1.; PostCodeSame = 0;
format AllSame 1.; AllSame = 0;
IF MinBusinessNameNum = MaxBusinessNameNum THEN BusinessNameSame = 1;
IF MinPostCodeNum = MaxPostCodeNum THEN PostCodeSame = 1;
IF BusinessNameSame = 1 AND PostCodeSame = 1 THEN AllSame = 1;
RUN;

Now I can filter my data based on whether the whole cluster is the same or not. I could even check for partial similarities of individual fields, using the fields BusinessNameSame or PostCodeNameSame separately.

The data looks like so:

Final SCV data with filtering fields

You can see here that the first cluster shows it’s eligible for merging all the records, and the second does not.

Finishing Up

I don’t think any of this is really rocket science. A lot of the manipulation here is data-dev  / SAS base 101 really.

What this gave me though was the ability to answer a business question pretty quickly, and give some good estimates back to senior management.

It turns out about 20% of records could be automatically merged with no real detriment to the customer. This is a sizeable win. These customers will see a true benefit when they go to use our services that rely on this data being all together.

Although largely unnecessary, a human is still in the loop for assessing the data before it goes through bulk processing.

I don’t really intend to write so much about my data dev work here, but this was something that happened that was fresh in my mind and it seemed worth writing up!