PERSPECTIVE: Liberating Public Data – Easier Said Than Done
/by Sasha Cuerda For the past few years we have been working with the Connecticut Secretary of the State (SOTS) on initiatives related to the business registration data that they process and manage. These are the official data set of record for businesses that are required to register to operate within Connecticut.
However, accessing these data has been historically difficult. The office’s CONCORD system was built and optimized for the important work of managing transactions around business filings and registration. It was not designed as a system to be search or analyzed for data. The data are also old, with data going back centuries, which is both good and bad
When we started on this project we were faced with a significant initial obstacle. We were given raw data on a CD-ROM and a "data dictionary" that consisted of a photocopied database schema. Variable names were not particularly readable and while some of the relationships could be inferred from the table names, there was a lot of information that was missing. Another technical challenge was that most of the values were stored as character strings. This meant that the validation of data being input was taking place in the application that staff used to input data. We couldn't rely on the database to have done much work to catch "mistakes" that made it through the business systems. Therefore, many of the input errors such as spelling mistakes or typos remained in the data.
For example, we had to determine a solution for the many instances where a business has both a "primary" address and also a mailing address. If you've ever had to fill out a form online where your address is required, you've probably encountered a case where you are presented with a check box that gives you the option to use your primary address as your mailing address.
Checking that box usually results in all of your address data being copied, or sometimes that system will grey out the form fields and keep you from editing them. Regardless, something is happening that captures the fact that both address are the same. Most likely this means that when the data are saved to a database, the same data are entered in both fields. This didn't seem to be the case with these data.
Additionally, we encountered businesses with a primary address and no mailing address, which was easy enough to reconcile (the assumption is that they are the same). But we also encountered the opposite, where a mailing address was present but no primary address. This poses a challenge. Does this mean that the business does not have a physical location? Is it a business on paper only? Or was there a data entry error resulting in the wrong set of address fields being updated? If the registering agent filled out a paper form, did they leave primary address blank? These inconsistencies make seemingly trivial questions, such as how many businesses were formed in town X in time period Y, hard to answer.
Given these issues, the first step in bringing these data into a modern, flexible, search context was to build out the data dictionary and develop an understanding of the values that were present in the data. We conducted informational interviews with staff and contractors, explored the range of values in the data and bit by bit, built a model of how data flowed, what it meant, and how it was structured in the context of the regulations and procedures that business owners go through when submitting information to the Secretary of the State. Once we were confident that our model was accurate and complete enough, we set about building the search system (http://searchctbusiness.ctdata.org/).
The search interface offered by the CONCORD system is quite fast, but suffers from a few significant limitations, primarily that there is limited support for wild card searches (e.g. a search for pizza will not find pizzeria) and that the search is very character/word sensitive.
For example, say you wanted to search for businesses that contain the term "New Haven" in their name. This probably indicates something about where they are located, but it might also be used as a name by housing developers or in a variety of other circumstances. With CONCORD, you'd get roughly 26 pages of results, all for businesses starting with "New Haven". However, you won't find the Advanced Nursing & Rehabilitation Center of New Haven, LLC, Peoples Church of New Haven, League of Women Voters of New Haven, or any number of other businesses whose name features but does not start with New Haven. Moreover, if you want to search for a business whose address is in New Haven, you cannot. These were two problems we wanted to resolve.
The Secretary of the State also wanted to support searching by place and by date of formation. We handled this by building a search index table that would allow us to very quickly find matches and return enough information to the user for them to decide if they wanted to explore a given business in more detail.
In the course of doing this work, we've worked with folks from other state agencies. Given the importance of business activity, a number of state agencies place high value on the business registration data. However, the structure of the live database limited how data could be extracted and analyzed. Our workflow enabled us to support more free-form exploration of the data on their behalf and led to a number of additional projects where we linked these data with other internal datasets.
Recently we worked with a state agency to develop a methodology to link businesses registered with SOTS with internal and third party data. Linking business names is hard to do manually. In some cases, it is possible to handle look ups on an ad hoc basis, but bulk work is very time consuming. Moreover, it is quite common for businesses to use slight variations of their names in different contexts, particularly small businesses. They might be required to add something to their name when registering to avoid name conflicts, but in practice they may advertise themselves with a simpler name.
Businesses also can formally change their name with the Secretary of the State, but those changes are almost certainly not reflected in other datasets and databases. Larger companies are often structured in complex ways for legal purposes; it may attach one name to a company when discussing it in policy terms, but from a legal perspective, that business might exist as groups of distinct entities. All of these issues make matching lists of businesses challenging, so much so that there is a technical term for the problem: entity resolution.
Entity resolution takes two forms, grouping entities that can be functionally considered one "business" for a given context, and linking different representations for the same underlying entity. We were able to conduct a number of matching runs using business name, address, city, and the name of principals to build out a list of linked entities and eliminating the need to try to manually link hundreds of thousands of entities.
We have more to do with these technologies and with these data. We want to add a map interface to our business search portal, which would enable users to search for businesses within a certain geographic area, making it possible to ask questions like: How many businesses were formed in our main street district since we altered the zoning? Further, since we’ve undertaken this work, more powerful open source solutions have been developed and we would like to optimize our search backend with these new offerings.
Liberating this data has been a fun challenge. It has demonstrated to us the value in investing in open source solutions and approaches and we have also learned by making these data open, the value it has for many users across the state; not only other state agencies but also economic development organizations, regional planning associations, and chambers of commerce.
_________________________
Sasha Cuerda is Director of Technology at the Connecticut Data Collaborative. He is a software developer with experience building data visualization tools, developing database systems for managing spatial data, and developing data processing workflows. He is also a trained urban planner and geographer. This article first appeared, in a somewhat lengthier version, on the website ctdata.org, the Connecticut Data Collaborative.