De-duping Software?

Latest post 01-28-2009 11:23 AM by jciccone. 6 replies.

De-duping Software?

06-25-2007 8:03 AM

Does anyone have a "tried and true" program that they use for deduping excel and access files?
Thanks for your response.

RE: De-duping Software?

06-25-2007 10:10 AM

Whatcha mean by "deduping"? :cwm11;

RE: De-duping Software?

06-25-2007 10:46 AM

By deduping I mean taking out duplicate files from a combined database. Lets say I have three different member or contact databases/spreadsheets and I want to combine them. In the three spreadsheets though, I have a number of the same entries. How could I consolidate the files and clear out the duplicates?

RE: De-duping Software?

06-25-2007 11:08 AM

If this is a one time project (and you are in the US), look at postal mailing list standardization and dedupe software like from http://www.semaphorecorp.com/

for $99.00 you get a two month license to standardize your addresses so duplicate finding will be more effective.

This would be all manual of creating dbf data bases to import and run the software against and then importing your corrected data back into your database.

I've done it on our data base a couple of times and need to do it again, but it improves your address quality, and will let duplicate efforts find the same address (Like 123 Main Street, 123Main, 123 Main St which with conventional deduping will look like separate addresses.)

Dave

RE: De-duping Software?

06-25-2007 3:23 PM

I doubt there is a program that will take several excel and access files and dedupe them the way you are asking for.

I won't get into all of the technical aspects, but I will give you a quick outline on how I think you can solve what you are asking.

1) Decide in what program you are going to work with the data. You can use Excel if you have less than 64k records total, or Access is really the best thing to work in, but not if you don't have the experience for something like this.

2) Make sure all of the files have the same standard layout. If you have extra fields for some, make sure they are at the end and that your master "template" file includes the fields needed.

3) Make sure each original file has the data formated in the same way. Set a standard (USPS guidlines are recommended) so that it will later be easy to find the dups.

4) Add one field that you can use to identify which file the data came from, a "source code".

5) Export the data into a format like ASCII delimited.

6) Import the data into your template file. At this point replace all the fields for file identification (source) with the proper information. It may be easier to do this in the original file before you export.

7) Import the remaining files.

8) Sort the file on address, phone, lastname+firstname, or other fields and each time go through the file looking for duplicates.

Another option would be to take this master file and get it to a mail house or place that does dataprocessing. They can do NCOA/CASS address correction and standardization if you need it and deduplication with some commercial software.

9) Export the deduped data using your source code to ASCII delimited and re-import into the original files, after you have made backups.

Or, see if you can use the combined database in a shared manner and avoid this kind of problem in the future. :-)

RE: De-duping Software?

06-26-2007 4:26 PM

Fixing the problem is one thing, and it can be extremely "painful". Preventing it in the future is another. I recommend that you give a lot of serious thought to that otherwise you will keep repeating the "pain" on a regular basis!

If you can get your data into an Access DB then there are a number of different query based ways of finding duplicate data. There are no particularly easy ways in Excel that I am aware of.

Re: De-duping Software?

01-28-2009 11:23 AM

There is a great product called DataMatch that solves your problem made by a company called Data Ladder

 

www.dataladder.com

 

We use it all the time when we have multiple data sources, excel, access, mysql, etc and need to combbine/deduplicate them.

 

Very customizable, best software we tried. We deduplicated our first data set within 15 minutes of getting the free trial

 

-JC

 

Dedupe Software