Recently I've been reading and watching a lot about AI. I did some trials and had a lot of fun with it. So, I was looking forward to having a real-life project to try out my newly acquired skills. Due to some internal changes at work, we got a list of users without user IDs and we had to match those users to Active Directory accounts.
The "fun" part of the whole thing was that there is no easy way to match the list of users to the list of users in the Active Directory. To make things even more interesting, the whole matching had to be done in 2 days for about 3 000 users.
We had to face a couple of challenges:
My idea was to perform the matching based on text similarity. I was hoping that this would reduce the number names that need to be checked manually to a handful.
Text similarity is a tricky thing. There are several algorithms to choose from. The first one I tried was cosine similarity, using the character frequency of each name as the vector. However, the method I chose did not consider the order of the letters in the names, so I got lots of false matches.
As I digged deeper, I came across a method named Ratcliff-Obershelp algorithm. In this case, similarity of two strings is defined as the number of matching characters divided by the total number of characters in the two strings, times two. Running the test with this algorithm on a testing batch of 300 names, I found that there was a false match ratio of 4.3 %.
My colleague proposed the idea that we should speed up the matching process by not calculating the similarity of names which are 100 % matches in the first place. By doing so, we could shave off 27 % of the matching time.
While going through the list, we figured out that some people from the Active Directory are matched to several people, even though they got a 100 % match before. To eliminate this, we decided to remove these folks from the list containing our Active Directory users as soon as a match is reached.
The test results were already quite satisfactory, but there was still one issue left. We had some people in our name list whose names were written like Smith, John instead of John Smith. So, I added a list of conditions in a dictionary with lambda methods to reverse the names if the conditions were met.
Finally, I wanted to give another method a try as well. So, I implemented the Levenshtein distance algorithm into the tool to see if I got different (and hopefully even better) matches. This wasn't the case, so in the end, I decided to remove it from the final version.
The end result:
Today, I wanted to run conda update --all on my computer. This is normally a very straight forward process. This time, however, it wouldn't work.
I'm a huge fan of Jupyter Notebooks. Whenever I'm experimenting with something or have to do something quick and dirty, I always use notebooks.