Why you should consider using asynchronous programming for your eDiscovery work

To most programmers in eDiscovery world, asynchronous programming might sound like something new. However, the concept has existed for decades, and there are many implementations in pretty much every mainstream languages. If you do not know it, you must have heard of multi-threading/multi-processing for concurrency. Async programming is a way of achieving concurrency by only using one thread! Because there is no need to start new thread or new process, there is no need for OS to participate. Everything is done within programming language itself. 

A simple analogy how asynchronous programming works

Let's say you are working on 3 tasks. You can start each task immediately, and the tasks will go on by themselves without needing you to intervene till they finish. Let's assume task A takes 5 mins, task B takes 3 mins, and task C takes 2 mins. The traditional way of finishing them is doing them one by one. Order doesn't matter. You will need 10 mins to wait for them to all finish. In asynchronous world, you would start task A first, and then you immediately start task B and C shortly after. Since task A takes the longest time, by the time when task A is done, your task B and C are both done already. In this case, you will just need slightly more than 5 mins to finish all. Now you see how this asynchronous thing works? In the nutshell, it allows you to do other things while waiting for some long running tasks. 

In our eDiscovery world, we deal with a lot of long waiting tasks like waiting for a web server to respond, waiting for network to send a file, waiting for database to return a result and etc. If we do everything synchronously, we could waste a lot of time waiting. Asynchronous programming is here to rescue.

An example implemented in Python

The is one of many things that I do quite often during my work. I need to write a script to interact with a Restful server. You could imagine it is a Relativity instance, and you are querying via Restful APIs to get some data. In real world, when we do eDiscovery work, we never work with just a few documents, right? we work with hundreds of thousands, sometimes millions documents. In this example, we are querying a dummy RESTful API http://dummy.restapiexample.com/api/v1/employees. It will return a json string containing all the dummy employees. We are curious to know how many employees we have. To simulate this to a real world, we initiate 1000 calls to this endpoint, and time the process. Let us see the time difference between sync and async programming. 

You can see with only 5 workers, the async version is already 10 times faster than sync version. 

Conclusion

Asynchronous programming has already been adopted by Relativity Kepler Framework, and Relativity Service API. I will not be surprised to see more async programming shift in the world of eDiscovery. 

Comments

Popular posts from this blog

Perform efficient Latent Semantic Index using Python

SVM

Open Source to E-discovery