Our 10-Step Process to Clean, Standardize and Augment Data

Okay, we get it. You’re one of those people that is excited to learn about data. You want to know how we take millions of pieces of raw data and turn them into valuable, searchable sales leads.

We LOVE to talk about information, too, so get ready to “geek out” for a bit while we go into a little detail and explain the what-and-how of the work our data analysts and database architects do to give you the best possible results, user experience and product.

Step 1: 

On the first business day of the month, our data analysts download the latest Form 5500 data sets from the DOL. We download the last 5 years of data, as companies are filing new Form 5500’s, amending existing filings and sometimes filing past due 5500’s. We check if there are new columns of data added to the data sets (this sometimes happens).

Step 2: 

During the monthly download process, we typically download over 40 raw database files from the DOL. Our data architects then import each of these raw files using a special program that our software development team created to process each file.

 

Step 3: 

It is time to scrub and clean the raw data we just imported. We run a series of SQL scripts (a set of SQL commands saved as a file) to correct and filter the data.

Step 4: 

Now that we’ve cleaned up some of the raw data, it is time to start to analyze the data. We have a program that will identify new filings; identify existing filings; identify modified filings; and, finally, identify deleted filings. We then merge all that data into our existing databases.

Step 5:

It’s time to clean up the data and start transforming it into information. We have created a program we call ‘Matchup’. Matchup employs state-of-the-art fuzzy matching algorithms that help us identify records in our databases that are the same business (like the example we showed you of Priceline with two different names); helps streamline insurance carriers where different spellings or abbreviations are employed; and helps recognize all of the different ways the BOR (broker of record) likes to submit details. It aggregates multiple records like BOR data into a single record. With Matchup we have created many matching rules that help us identify duplicate information, from the obvious to the not-so-obvious, so we can prune and merge the database.

Step 6:

We now want to standardize and validate company names, addresses, emails, and phone info. We use a third-party service that we feel is the best-in-class. We pass the information we have from the filing and the service returns to us standardized and validated information.


The service provides us full data quality by comparing company name, address, phone, and email information against multi-sourced datasets. The service enriches the data by updating addresses and adding latitude/longitude coordinates and comprehensive demographics. Basically the service is adding more awesomeness to our database.

 

Step 7: 

We need to compute or pre-calculate a lot of the most useful search fields. For example, search fields like ‘total revenue’ (commission + fees paid) that the BOR made, or which benefits are fully-insured and which are self-insured. We pre-calculate or compute dozens of data points to enrich your search experience.

Step 8: 

Time for our QA team to swing into action. They run through a series of quality assurance tests to confirm that the new data files are all correct and cleaned. Once our QA team gives the new data the thumbs up, our production team pushes all the new information into 5500Leads.

And, Here’s Where We Go the Extra Mile for YOU

At this point we’ve completed 3 of the 4 things that make data good and usable. We have:

1. Data Cleansing – we cleaned up all the raw DOL data
2. Data Hygiene – we have improved the data by fixing case and abbreviation issues and we have validated contact details and fixed accuracy and quality issues
3. Data Standardization – we have standardization, normalized and matchup the data

This is where our competitors stop … but not us. These next 2 steps are what really separate us from the competition.

Step 9:

Now it is time to enrich and enhance the company information. We leverage machine learning and primary sources of information to extract additional company details. We enhance the data by adding a company description, type of industry, specialties, website address information and more. When a company is publicly traded, we add stock information and company news.
data.jpg
Step 10: 

If you thought the enhanced Company information was good, wait until you learn about all the Contacts we have. By scouring a Company’s website, press releases and other primary sources of information we find every person that works at that company and we add them as a ‘Company Contact’ and create a profile record for them.

While scouring the web we often find their email address (over 11 million email addresses to date). Using our Matchup program, we match the person to their email and then to their LinkedIn profile. We also evaluate the person’s job title from press releases, regulatory filings, and other primary sources and for some job titles we mark them as a ‘Primary Contact’.  These are people with C-level or HR related job titles, basically all the people that you want to pitch your services too. We have a ‘search within’ the Company Contacts database so you can find a specific person or a specific job title. We are continuously enhancing each and every person’s ‘Contacts Profile’ within the 5500Leads program.

 

Get The Data Partner You Deserve

So why did we take you through all this? Because we care about getting it right … and delivering what we promise, and that separates us from the competition. We’ve just shown you what it takes to transform raw DOL data into actionable searchable information. It’s a lot of work … and we do it EVERY SINGLE MONTH. Get the data partner you deserve … 5500Leads. Your business is too valuable to settle for less.

                                                                 

 
Call Us :  800-552-8211