Open Source to E-discovery
As a data scientist who is working in E-discovery industry, I am glad to be able to use many good tools to work with E-discovery data, i.e. Nuix for processing, Relativity for review and some analytics, and so on. However, the limitation for any hard-coded software is they do not always work well when data is messy and unorganized. Coupled with the high license fees and server fees, those tools are not always a good option. So, is there anything free that you can rely on to perform all the core functionalities that those big tools provide? Well, the answer is yes and no. Yes, there are numerous open source tools that can do what you are looking for, and No, you still need to pay money for hiring a data scientist but not pay money for software. To demonstrate an open source approach to E-discovery, I decide to start writing blogs about this topic.
When it comes to data, Python is hands down a very good option. (The other equivalent is R). Python has a very complete and robust standard library, plus many awesome data-related libraries like Numpy, Pandas, Scikit-learn, Tensorflow, Pytorch and so on. It offers a complete coverage from data cleaning, data mining, data manipulation to advanced areas like machine learning, deep learning and etc. As a python guy, I will be using Python as my primary tool but I might throw in C++/C, Java or C# when necessary.
For the first blog post, I decide to show how to use Python to analyze two documents, and display the textual difference. The similar thing you might have already seen in Relativity after you run textual near duplicate identification, you can compare two documents within a near duplicate set. In Python, you do not have the limitation. You can compare any two documents. In Python world, there are two popular options: 1) difflib library 2) Google Diff Match Patch library The latter is very likely what Relativity uses ( : However, I would show today is option 1. difflib is one of the standard libraries of Python 3, so if you install Python 3.5+, you should already have it. difflib is very powerful already.
Let me introduce difflib a little bit. The underlying core is SquenceMatcher class. It implements Ratcliff-Obershelp algorithm. If you are interested, you can google and learn more but i am not going to expand on this. I would just like to point out the performance for this algorithm is cubic time for worst case, quadratic time for expected case, and linear time for best case. As you can see, it is not a very fast but acceptable. If you want to compare two very very long texts which differ a lot, it will take quite a long time. It is also the main reason why Relativity only allows you to compare two documents which are considered near duplicates.
Now it is time to code. ( : Just like most of Python snippets, you can code just a few lines to get the job done. In difflib, there are basically two options to analyze and display the textual difference. 1) difflib.HtmlDiff class which generates a html file with table and coloration to show the difference 2) difflib.context_diff method which shows the difference in text format. I will be demonstrating both ways. I have uploaded all source codes and sample text files in github. Feel free to check it out. The first snippet is below
The result is shown as below:
for context_diff, it prints out a list of difference. Below is part of it just for demonstration
it is much easier to see the difference in the report than in the console printout. However, we notice the comparison is on characters. It works but somehow harder for human to inspect. If we can make the comparison based on words not characters, it will be much easier to spot the difference, right? How do we do that? Well, we need to break the text down into words (we call it tokenization). Fortunately, there are several ways to do that in Python. I choose to use NLTK library. Below is the new snippet, and we simply just tokenize the text into word list before pass them for comparison:
It displays the difference by words.
To summarize, to perform textual comparison in Python is just piece of cake. We can compare any two documents and have Python to generate a nice html report for us. There are much much more open source tool like Python can do, so stay tuned.
Comments
Post a Comment