Any competent spreadsheet user can construct custom database

When an organization needs a new database, it typically hires a contractor to build it or buys a heavily supported product customized to its industry sector.

Usually, the organization already owns all the data it wants to put in the database. But writing complex queries in SQL or some other database scripting language to pull data from many different sources; to filter, sort, combine, and otherwise manipulate it; and to display it in an easy-to-read format requires expertise that few organizations have in-house.

New software from researchers at MIT’s Computer Science and Artificial Intelligence Laboratory could make databases much easier for laypeople to work with. The program’s home screen looks like a spreadsheet, but it lets users build their own database queries and reports by combining functions familiar to any spreadsheet user.

Simple drop-down menus let the user pull data into the tool from multiple sources. The user can then sort and filter the data, recombine it using algebraic functions, and hide unneeded columns and rows, and the tool will automatically generate the corresponding database queries.

The researchers also conducted a usability study that suggests that even in its prototype form, their tool could be easier to use than existing commercial database systems that represent thousands, if not tens of thousands, of programmer-hours of work.

“Organizations spend about $35 billion a year on relational databases,” says Eirik Bakke, an MIT graduate student in electrical engineering and computer science who led the development of the new tool. “They provide the software to store the data and to do efficient computation on the data, but they do not provide a user interface. So what inevitably ends up happening when you have something extremely industry-specific is, you have to hire a programmer who spends about a year of work to build a user interface for your particular domain.”

Familiar face

Bakke’s tool, which he developed with the help of his thesis advisor, MIT Professor of Electrical Engineering David Karger, could allow organizations to get up and running with a new database without having to wait for a custom interface. Bakke and Karger presented the tool at the Association for Computing Machinery’s International Conference on Management of Data last week.

The tool’s main drop-down menu has 17 entries, most of which — such as “hide,” “sort,” “filter,” and “delete” — will look familiar to spreadsheet users. In the conference paper, Bakke and Karger prove that those apparently simple functions are enough to construct any database query possible in SQL-92, which is the core of the version of SQL taught in most database classes.

Some database queries are simple: A company might, for instance, want a printout of the names and phone numbers of all of its customers. But it might also want a printout of the names and phone numbers of just those customers in a given zip code whose purchase totals exceeded some threshold amount over a particular time span. If each purchase has its own record in the database, the query will need to include code for summing up the purchase totals and comparing them to the threshold quantity.

What makes things even more complicated is that a database will generally store related data in different tables. For demonstration purposes, Bakke loaded several existing databases into his system. One of them, a database used at MIT to track research grants, has 35 separate tables; another, which records all the information in a university course catalogue, has 15.

Likewise, a company might store customers’ names and contact information in one table, lists of their purchase orders in another, and the items constituting each purchase order in a third. A relatively simple query that pulls up the phone numbers of everyone who bought a particular product in a particular date range could require tracking data across all three tables.

Bakke and Karger’s tool lets the user pull in individual columns from any table — say, name and phone number from the first, purchase orders and dates from the second, and products from the third. (The tool will automatically group the products associated with each purchase order together in a single spreadsheet “cell.”)

A filter function just like that found in most spreadsheet programs can restrict the date range and limit the results to those that include a particular product. The user can then hide any unnecessary columns, and the report is complete.