Late last year, Google announced the release of Google Refine, a free data-cleaning tool described by Google as “…a power tool for working with messy data.” In the litigation support industry, data oftentimes needs to be converted, filtered, or modified and, at first glance, Google Refine seems a perfect tool for data processing. Upon closer inspection, load times, system limitations and technical jargon roadblocks keep Google Refine from being a useful tool for large scale data processing.
When I began to explore Google Refine, my first concern was security, as Google Refine is installed onto your local computer and runs through a web browser. However, after determining that Google Refine works offline as well, meaning no data is transferred across the internet, I felt confident that there were no security issues and decided to load some test data into Google’s tool. The data consisted of approximately 30 metadata fields for about 50,000 documents. This is where the first hiccup occurred. Google Refine can import either csv files or delimited text files—and that’s it. If the data is in any other format, it will need to be converted prior to importing. With large data sets, converting all documents to csv or delimited text files could take valuable time.
Filetype conversion aside, as I continued my exploration, I discovered that the import tool was very simple and user friendly. Unfortunately, more troubles were right around the corner. Once importing began, I waited and waited…and waited. The import of 50,000 records (a relatively small amount of data) took around 5 minutes to complete. To further test Google Refine’s limitations, I decided to try a larger import of around 400,000 records. After about 15 minutes, I started receiving “Out of Memory” errors. I decided to give Google Refine a break, since 400,000 records is a very large data set, and continued testing. I decided if the tool performed well enough, I could work around the import size limitations.
At first glance I thought one of the more promising features of Google Refine was the Facet tool, which allows a user to summarize the unique values in a column. Users can then bulk modify these values to standardize the data. Working with the same 50,000 record test data, I took the “TAGS” field and used the Facet tool to list all the unique TAGS in the data, hoping to correct discrepancies in spelling, capitalization and punctuation. In my data, there were approximately 175 different TAGS, with about 20 variations in spelling, case and punctuation. When I attempted to enable Faceting on the TAGS column, I was disappointed to read the message, “Too Many Choices,” rendering it useless on my data set.
One final roadblock I discovered during my testing was computer language knowledge. While there are many features available in Google Refine outside of those I tested, most require knowledge of the programming languages Clojure, Jython or Google Refine Expression Language. As Google’s search has become one of the top search engines used by computer users worldwide, it is disappointing to find that the majority of Google Refine’s functions can only be utilized by a select group of individuals with expert computer knowledge.
Google Refine only allows for uploading small sets of data and can only pick up a few discrepancies in data sets. It seems the tool, created to group and standardize data, becomes less useful when it’s needed most. Prior to testing, I had high expectations for the many features offered by Google Refine. However, I quickly realized it wasn’t designed to handle the capacity of data we encounter on a regular basis in the ESI industry.