I’ve got a data analysis project.
In an excel (.xlsx) file, I have 2 sets of data, each set in a separate tab of the file:
Data set 1. “DB Units” contains 41,786 unique people names, that’s the client base of a Business chain, and there is a field (column) that identifies to which unit of the chain each client belongs.
Data set 2. “DB Vendor” contains 2,022 people names, they are visitors from a vendor who may have visited one or more units of the chain.
This Project consists in identifying which of the vendor visitors are already within the client base of the chain, and to which unit those found belong.
Be aware of the following:
- Both data sets come from different sources and the name of the same person may not be written identically. That is the challenge of this project; we have already identified the exact matches, but odds are that there are more matching visitors that have not yet been identified since one or more characters may differ between sets.
- The names are in Spanish and some contain the Spanish character “ñ”. We have taken out accents (á, é, í, ó, ú) from both sets.
- We have taken out special characters and double or triple spaces, leaving a single space between words on both sets.
- Complete names are usually formed by first names (first, middle and probably a third name) and 2 last names. Examples:
- Maria Antonieta Jimenez Martinez (Maria and Antonieta are first and middle names, Jimenez Martinez are both last names)
- Juan Roberto Lucas Vazquez Ramos (Juan Roberto Lucas are all names, while Vazquez Ramos are both last names)
We need someone who can go further than a vlookup search and can identify variations of the same name. We need you to identify and tag different types of matches:
- Type 1: Exact matches
- Type 2: High certainty matches: in which one or two letters may be missing from a typo but the rest of the name is the same
- Type 3: Medium certainty matches:
- One of two names may be omitted or abbreviated
E.g. Maria Rosa Jimenez Juarez vs. Ma. Rosa Jimenez Juarez, or Juan E. Lopez Vazquez vs. Juan Enrique Lopez Vazquez, or Juan Lopez Vazquez
- One last name may be missing. E.g. Juan Jose Rodriguez vs. Juan Jose Rodriguez Villanueva
- Any combination of the above that could lead to a match but there is not 100% certainty.
The end product is the list on the “DB Vendor” tab, having filled two fields for each match:
1. Match type
2. The corresponding “Consecutive unit” value for the Name Vendor that matched. (See excel file).