PERSPECTIVE: Liberating Public Data – Easier Said Than Done
/by Sasha Cuerda For the past few years we have been working with the Connecticut Secretary of the State (SOTS) on initiatives related to the business registration data that they process and manage. These are the official data set of record for businesses that are required to register to operate within Connecticut.
However, accessing these data has been historically difficult. The office’s CONCORD system was built and optimized for the important work of managing transactions around business filings and registration. It was not designed as a system to be search or analyzed for data. The data are also old, with data going back centuries, which is both good and bad
When we started on this project we were faced with a significant initial obstacle. We were given raw data on a CD-ROM and a "data dictionary" that consisted of a photocopied database schema. Variable names were not particularly readable and while some of the relationships could be inferred from the table names, there was a lot of information that was missing. Another technical challenge was that most of the values were stored as character strings. This meant that the validation of data being input was taking place in the application that staff used to input data. We couldn't rely on the database to have done much work to catch "mistakes" that made it through the business systems. Therefore, many of the input errors such as spelling mistakes or typos remained in the data.
For example, we had to determine a solution for the many instances where a business has both a "primary" address and also a mailing address. If you've ever had to fill out a form online where your address is required, you've probably encountered a case where you are presented with a check box that gives you the option to use your primary address as your mailing address.
Checking that box usually results in all of your address data being copied, or sometimes that system will grey out the form fields and keep you from editing them. Regardless, something is happening that captures the fact that both address are the same. Most likely this means that when the data are saved to a database, the same data are entered in both fields. This didn't seem to be the case with these data.
Additionally, we encountered businesses with a primary address and no mailing address, which was easy enough to reconcile (the assumption is that they are the same). But we also encountered the opposite, where a mailing address was present but no primary address. This poses a challenge. Does this mean that the business does not have a physical location? Is it a business on paper only? Or was there a data entry error resulting in the wrong set of address fields being updated? If the registering agent filled out a paper form, did they leave primary address blank? These inconsistencies make seemingly trivial questions, such as how many businesses were formed in town X in time period Y, hard to answer.
Given these issues, the first step in bringing these data into a modern, flexible, search context was to build out the data dictionary and develop an understanding of the values that were present in the data. We conducted informational interviews with staff and contractors, explored the range of values in the data and bit by bit, built a model of how data flowed, what it meant, and how it was structured in the context of the regulations and procedures that business owners go through when submitting information to the Secretary of the State. Once we were confident that our model was accurate and complete enough, we set about building the search system (http://searchctbusiness.ctdata.org/).
The search interface offered by the CONCORD system is quite fast, but suffers from a few significant limitations, primarily that there is limited support for wild card searches (e.g. a search for pizza will not find pizzeria) and that the search is very character/word sensitive.
For example, say you wanted to search for businesses that contain the term "New Haven" in their name. This probably indicates something about where they are located, but it might also be used as a name by housing developers or in a variety of other circumstances. With CONCORD, you'd get roughly 26 pages of results, all for businesses starting with "New Haven". However, you won't find the Advanced Nursing & Rehabilitation Center of New Haven, LLC, Peoples Church of New Haven, League of Women Voters of New Haven, or any number of other businesses whose name features but does not start with New Haven. Moreover, if you want to search for a business whose address is in New Haven, you cannot. These were two problems we wanted to resolve.
The Secretary of the State also wanted to support searching by place and by date of formation. We handled this by building a search index table that would allow us to very quickly find matches and return enough information to the user for them to decide if they wanted to explore a given business in more detail.
In the course of doing this work, we've worked with folks from other state agencies. Given the importance of business activity, a number of state agencies place high value on the business registration data. However, the structure of the live database limited how data could be extracted and analyzed. Our workflow enabled us to support more free-form exploration of the data on their behalf and led to a number of additional projects where we linked these data with other internal datasets.
Recently we worked with a state agency to develop a methodology to link businesses registered with SOTS with internal and third party data. Linking business names is hard to do manually. In some cases, it is possible to handle look ups on an ad hoc basis, but bulk work is very time consuming. Moreover, it is quite common for businesses to use slight variations of their names in different contexts, particularly small businesses. They might be required to add something to their name when registering to avoid name conflicts, but in practice they may advertise themselves with a simpler name.
Businesses also can formally change their name with the Secretary of the State, but those changes are almost certainly not reflected in other datasets and databases. Larger companies are often structured in complex ways for legal purposes; it may attach one name to a company when discussing it in policy terms, but from a legal perspective, that business might exist as groups of distinct entities. All of these issues make matching lists of businesses challenging, so much so that there is a technical term for the problem: entity resolution.
Entity resolution takes two forms, grouping entities that can be functionally considered one "business" for a given context, and linking different representations for the same underlying entity. We were able to conduct a number of matching runs using business name, address, city, and the name of principals to build out a list of linked entities and eliminating the need to try to manually link hundreds of thousands of entities.
We have more to do with these technologies and with these data. We want to add a map interface to our business search portal, which would enable users to search for businesses within a certain geographic area, making it possible to ask questions like: How many businesses were formed in our main street district since we altered the zoning? Further, since we’ve undertaken this work, more powerful open source solutions have been developed and we would like to optimize our search backend with these new offerings.
Liberating this data has been a fun challenge. It has demonstrated to us the value in investing in open source solutions and approaches and we have also learned by making these data open, the value it has for many users across the state; not only other state agencies but also economic development organizations, regional planning associations, and chambers of commerce.
_________________________
Sasha Cuerda is Director of Technology at the Connecticut Data Collaborative. He is a software developer with experience building data visualization tools, developing database systems for managing spatial data, and developing data processing workflows. He is also a trained urban planner and geographer. This article first appeared, in a somewhat lengthier version, on the website ctdata.org, the Connecticut Data Collaborative.


Yes, this day it was Delta, and they have not made too many friends lately. But tomorrow it could be a smaller company. Maybe a local place that has been a pillar to your community. Do you trust Coulter, or any other person with social power, with the ability to tarnish a business’ reputation indefinitely?
The younger years of children are crucial because the brains of the children grow extremely rapidly between birth and the age of 8. Specifically the brain undergoes rapid cognitive development as well as linguistic and motor development in the first 3 years of life. By the age of 5 children have developed problem-solving skills and pre-literacy skills. Children are constantly learning from the environment and behaviors surrounding them. Early brain development has a great influence on children’s long-term outcomes 

State funding for BIAC is an investment that pays off substantially, ensuring that those in need have access to lives of wellness, dignity and fulfillment. In the short time it took you to read this article, at least 5 people in the U.S. sustained a brain injury. This could be your neighbor, your co-worker, your family member, or you.

Connecticut’s inability to retain young people has become increasingly evident over the past several years, resulting in dour headlines as major companies pick up stakes and move to those locations that are attracting young talent. An interest in lively downtowns, a variety of housing options, walkable communities, access to transit, and availability of jobs and economic opportunity are topping their lists, according to numerous studies and reports.

mon? Both made equal access to educational opportunity a top priority.
If our educational system in America provided equal educational opportunity to all students regardless of income level, making the SAT and ACT free might significantly increase the number of low-income students in college. However, since this is a far cry from our current reality, it is higher education’s responsibility to think more creatively about whom it allows in the door. We are a long way from ensuring that every citizen has equal access to high-quality education, but in the meantime, universities can play a significant role in ensuring inclusivity of all talent.
These research findings have important implications for society. If early childhood programs produce healthier adults, investing in these programs could reduce the burden on the health care system. If children who participated in early childhood programs grow up to experience higher employment rates and earnings, requests for public assistance should decrease. If these children are less likely to engage criminal activity, their communities and society as a whole should benefit.
Nationally, rescinding DACA would be disastrous to our economy. Removing 800,000 people from the workforce nationwide would be short-sighted and harmful. It would cost the country $433.4 billion in GDP loss over a decade. It would cost employers $3.4 billion in unnecessary turnover costs. Contributions to Medicare and Social Security would be cut by $24.6 billion over a decade.