My first "real-life" artificial intelligence project
~ 3 mins to read

Recently I've been reading and watching a lot about AI. I did some trials and had a lot of fun with it. So, I was looking forward to having a real-life project to try out my newly acquired skills. Due to some internal changes at work, we got a list of users without user IDs and we had to match those users to Active Directory accounts.

The "fun" part of the whole thing was that there is no easy way to match the list of users to the list of users in the Active Directory. To make things even more interesting, the whole matching had to be done in 2 days for about 3 000 users.

The Challenge

We had to face a couple of challenges:

  • 2 days to complete the entire matching job for about 3 000 names
  • Lots of misspelled names in all systems. E.g. there might be a John Smith in one system, but only a John Smythe in the list. The same goes for writing double-names with dashes, spaces, etc.
  • There might be people with identical names, and we did not know from our list whom we should match with that, at least not algorithmically

The Solution

My idea was to perform the matching based on text similarity. I was hoping that this would reduce the number names that need to be checked manually to a handful.

Text similarity is a tricky thing. There are several algorithms to choose from. The first one I tried was cosine similarity, using the character frequency of each name as the vector. However, the method I chose did not consider the order of the letters in the names, so I got lots of false matches.

As I digged deeper, I came across a method named Ratcliff-Obershelp algorithm. In this case, similarity of two strings is defined as the number of matching characters divided by the total number of characters in the two strings, times two. Running the test with this algorithm on a testing batch of 300 names, I found that there was a false match ratio of 4.3 %.

My colleague proposed the idea that we should speed up the matching process by not calculating the similarity of names which are 100 % matches in the first place. By doing so, we could shave off 27 % of the matching time.

While going through the list, we figured out that some people from the Active Directory are matched to several people, even though they got a 100 % match before. To eliminate this, we decided to remove these folks from the list containing our Active Directory users as soon as a match is reached.

The test results were already quite satisfactory, but there was still one issue left. We had some people in our name list whose names were written like Smith, John instead of John Smith. So, I added a list of conditions in a dictionary with lambda methods to reverse the names if the conditions were met.

Finally, I wanted to give another method a try as well. So, I implemented the Levenshtein distance algorithm into the tool to see if I got different (and hopefully even better) matches. This wasn't the case, so in the end, I decided to remove it from the final version.

The end result:

  • The whole matching task was completed within 1.5 days
  • We had a false match rate of about 4.1 % of the total list
  • We only had to review about 25 % of the entire list to find real matches

Gabor Schulz

In love with tech, especially with Python and Machine Learning

Similar Stories


Django

Project: This Website

When I found our about Bootstrap in 2016, it was love at first sight. I immediately knew that I wanted to build my own website with this framework. However, I was busy all the time with customer projects, tasks at work, etc. so this dream never seemed to come true. Until now.

by Gabor Schulz , 2 months ago

Python

Top 3 Courses To Get Started With Python

I often hear the question what courses I recommend to start learning Python. In this article, I'm sharing my 3 favorites.

by Gabor Schulz , 3 weeks ago