Monthly Archives: December 2016

High precision under adverse weather conditions

During a mid-March snowstorm, researchers from MIT Lincoln Laboratory achieved real-time, nighttime, centimeter-level vehicle localization while driving a test vehicle at highway speeds over roads whose lane markings were hidden by the snow. The sport utility vehicle used in the demonstration was equipped with a system that employs a novel ground-penetrating radar technique to localize the vehicle to within centimeters of the boundaries of its lane. The technique could solve one of the problems limiting the development and adoption of self-driving vehicles: How can a vehicle navigate to stay within its lane when bad weather obscures road markings?

“Most self-localizing vehicles use optical systems to determine their position,” explains Byron Stanley, the lead researcher on Lincoln Laboratory’s localizing ground-penetrating radar (LGPR) program. “They rely on optics to ‘see’ lane markings, road surface maps, and surrounding infrastructure to orient themselves. Optical systems work well in fair weather conditions, but it is challenging, even impossible, for them to work when snow covers the markings and surfaces or precipitation obscures points of reference.”

The laboratory has developed a sensor that uses very high frequency (VHF) radar reflections of underground features to generate a baseline map of a road’s subsurface. This map is generated during an LGPR-equipped vehicle’s drive along a roadway and becomes the reference for future travels over that stretch of road. On a revisit, the LGPR mounted beneath the vehicle measures the current reflections of the road’s subsurface features, and its algorithm estimates the vehicle’s location by comparing those current GPR readings to the baseline map stored in the system’s memory. An article titled “Localizing Ground Penetrating RADAR: A Step Toward Robust Autonomous Ground Vehicle Localization” in a recent issue of the Journal of Field Robotics describes Lincoln Laboratory’s demonstration of the use of LGPR to achieve 4-cm (1.6-inch) in-lane localization accuracy at speeds up to 60 miles per hour.

“The LGPR uses relatively deep subsurface features as points of reference because these features are inherently stable and less susceptible to the dynamics of the world above,” Stanley says. Surface characteristics, such as lane striping, signs, trees, or the landscape, are dynamic; they are changed or change over time. The natural inhomogeneity in the highly static subterranean geology — for example, differences in soil layers or rock beds — dominates the GPR reflection profiles. The GPR-produced map of the subsurface environment is a fairly complete picture, capturing every distinct object and soil feature that is not significantly smaller than a quarter of a wavelength.

The LGPR’s main component is a waterproof, closely spaced 12-element antenna array that uses a very high frequency, stepped-frequency continuous wave to penetrate deeper beneath the ground than can typical GPR systems. This deeper look allows the LGPR to map the most stable, thus reliable, subsurface characteristics. “The GPR array is designed so that measurements from each element look similar and thus can be compared with others on subsequent passes,” Stanley says. A single-board computer is used to correlate the GPR data into a three-dimensional GPS-tagged GPR map and extract a corrected GPS position estimate.

The researchers who conducted the March test runs — Stanley, Jeffrey Koechling, and Henry Wegiel — alternated driving over a route in the Boston area during a snowstorm from midnight to 9 a.m. Correlation and overlap estimates demonstrated that the LGPR technique could accurately, and repeatedly, determine the vehicle’s position as it traveled. Post-processing methods on a sensor suite were then used to provide estimates of the accuracy of the system. “We thought the radar would see through snow, but it was wonderful to finally get the data proving it,” Koechling says.

Any competent spreadsheet user can construct custom database

When an organization needs a new database, it typically hires a contractor to build it or buys a heavily supported product customized to its industry sector.

Usually, the organization already owns all the data it wants to put in the database. But writing complex queries in SQL or some other database scripting language to pull data from many different sources; to filter, sort, combine, and otherwise manipulate it; and to display it in an easy-to-read format requires expertise that few organizations have in-house.

New software from researchers at MIT’s Computer Science and Artificial Intelligence Laboratory could make databases much easier for laypeople to work with. The program’s home screen looks like a spreadsheet, but it lets users build their own database queries and reports by combining functions familiar to any spreadsheet user.

Simple drop-down menus let the user pull data into the tool from multiple sources. The user can then sort and filter the data, recombine it using algebraic functions, and hide unneeded columns and rows, and the tool will automatically generate the corresponding database queries.

The researchers also conducted a usability study that suggests that even in its prototype form, their tool could be easier to use than existing commercial database systems that represent thousands, if not tens of thousands, of programmer-hours of work.

“Organizations spend about $35 billion a year on relational databases,” says Eirik Bakke, an MIT graduate student in electrical engineering and computer science who led the development of the new tool. “They provide the software to store the data and to do efficient computation on the data, but they do not provide a user interface. So what inevitably ends up happening when you have something extremely industry-specific is, you have to hire a programmer who spends about a year of work to build a user interface for your particular domain.”

Familiar face

Bakke’s tool, which he developed with the help of his thesis advisor, MIT Professor of Electrical Engineering David Karger, could allow organizations to get up and running with a new database without having to wait for a custom interface. Bakke and Karger presented the tool at the Association for Computing Machinery’s International Conference on Management of Data last week.

The tool’s main drop-down menu has 17 entries, most of which — such as “hide,” “sort,” “filter,” and “delete” — will look familiar to spreadsheet users. In the conference paper, Bakke and Karger prove that those apparently simple functions are enough to construct any database query possible in SQL-92, which is the core of the version of SQL taught in most database classes.

Some database queries are simple: A company might, for instance, want a printout of the names and phone numbers of all of its customers. But it might also want a printout of the names and phone numbers of just those customers in a given zip code whose purchase totals exceeded some threshold amount over a particular time span. If each purchase has its own record in the database, the query will need to include code for summing up the purchase totals and comparing them to the threshold quantity.

What makes things even more complicated is that a database will generally store related data in different tables. For demonstration purposes, Bakke loaded several existing databases into his system. One of them, a database used at MIT to track research grants, has 35 separate tables; another, which records all the information in a university course catalogue, has 15.

Likewise, a company might store customers’ names and contact information in one table, lists of their purchase orders in another, and the items constituting each purchase order in a third. A relatively simple query that pulls up the phone numbers of everyone who bought a particular product in a particular date range could require tracking data across all three tables.

Bakke and Karger’s tool lets the user pull in individual columns from any table — say, name and phone number from the first, purchase orders and dates from the second, and products from the third. (The tool will automatically group the products associated with each purchase order together in a single spreadsheet “cell.”)

A filter function just like that found in most spreadsheet programs can restrict the date range and limit the results to those that include a particular product. The user can then hide any unnecessary columns, and the report is complete.

Anonymity if all but one of its servers are compromised

Anonymity networks protect people living under repressive regimes from surveillance of their Internet use. But the recent discovery of vulnerabilities in the most popular of these networks — Tor — has prompted computer scientists to try to come up with more secure anonymity schemes.

At the Privacy Enhancing Technologies Symposium in July, researchers at MIT’s Computer Science and Artificial Intelligence Laboratory and the École Polytechnique Fédérale de Lausanne will present a new anonymity scheme that provides strong security guarantees but uses bandwidth much more efficiently than its predecessors. In experiments, the researchers’ system required only one-tenth as much time as similarly secure experimental systems to transfer a large file between anonymous users.

“The initial use case that we thought of was to do anonymous file-sharing, where the receiving end and sending end don’t know each other,” says Albert Kwon, a graduate student in electrical engineering and computer science and first author on the new paper. “The reason is that things like honeypotting” — in which spies offer services through an anonymity network in order to entrap its users — “are a real issue. But we also studied applications in microblogging, something like Twitter, where you want to anonymously broadcast your messages to everyone.”

The system devised by Kwon and his coauthors — his advisor, Srini Devadas, the Edwin Sibley Webster Professor of Electrical Engineering and Computer Science at MIT; David Lazar, also a graduate student in electrical engineering and computer science; and Bryan Ford SM ’02 PhD ’08, an associate professor of computer and communication sciences at the École Polytechnique Fédérale de Lausanne — employs several existing cryptographic techniques but combines them in a novel manner.

Shell game

The heart of the system is a series of servers called a mixnet. Each server permutes the order in which it receives messages before passing them on to the next. If, for instance, messages from senders Alice, Bob, and Carol reach the first server in the order A, B, C, that server would send them to the second server in a different order — say, C, B, A. The second server would permute them before sending them to the third, and so on.

An adversary that had tracked the messages’ points of origin would have no idea which was which by the time they exited the last server. It’s this reshuffling of the messages that gives the new system its name: Riffle.

Like many anonymity systems, Riffle also uses a technique known as onion encryption; “Tor,” for instance, is an acronym for “the onion router.” With onion encryption, the sending computer wraps each message in several layers of encryption, using a public-key encryption system like those that safeguard most financial transactions online. Each server in the mixnet removes only one layer of encryption, so that only the last server knows a message’s ultimate destination.

A mixnet with onion encryption is effective against a passive adversary, which can only observe network traffic. But it’s vulnerable to active adversaries, which can infiltrate servers with their own code. This is not improbable in anonymity networks, where frequently the servers are simply volunteers’ Internet-connected computers, loaded with special software.

If, for instance, an adversary that has commandeered a mixnet router wants to determine the destination of a particular message, it could simply replace all the other messages it receives with its own, bound for a single destination. Then it would passively track the one message that doesn’t follow its own prespecified route.

Public proof

To thwart message tampering, Riffle uses a technique called a verifiable shuffle. Because of the onion encryption, the messages that each server forwards look nothing like the ones it receives; it has peeled off a layer of encryption. But the encryption can be done in such a way that the server can generate a mathematical proof that the messages it sends are valid manipulations of the ones it receives.

Ensure databases used in medical research

Genome-wide association studies, which try to find correlations between particular genetic variations and disease diagnoses, are a staple of modern medical research.

But because they depend on databases that contain people’s medical histories, they carry privacy risks. An attacker armed with genetic information about someone — from, say, a skin sample — could query a database for that person’s medical data. Even without the skin sample, an attacker who was permitted to make repeated queries, each informed by the results of the last, could, in principle, extract private data from the database.

In the latest issue of the journal Cell Systems, researchers from MIT’s Computer Science and Artificial Intelligence Laboratory and Indiana University at Bloomington describe a new system that permits database queries for genome-wide association studies but reduces the chances of privacy compromises to almost zero.

It does that by adding a little bit of misinformation to the query results it returns. That means that researchers using the system could begin looking for drug targets with slightly inaccurate data. But in most cases, the answers returned by the system will be close enough to be useful.

And an instantly searchable online database of genetic data, even one that returned slightly inaccurate information, could make biomedical research much more efficient.

“Right now, what a lot of people do, including the NIH, for a long time, is take all their data — including, often, aggregate data, the statistics we’re interested in protecting — and put them into repositories,” says Sean Simmons, an MIT postdoc in mathematics and first author on the new paper. “And you have to go through a time-consuming process to get access to them.”

That process involves a raft of paperwork, including explanations of how the research enabled by the repositories will contribute to the public good, which requires careful review. “We’ve waited months to get access to various repositories,” says Bonnie Berger, the Simons Professor of Mathematics at MIT, who was Simmons’s thesis advisor and is the corresponding author on the paper. “Months.”

Bring the noise

Genome-wide association studies generally rely on genetic variations called single-nucleotide polymorphisms, or SNPs (pronounced “snips”). A SNP is a variation of one nucleotide, or DNA “letter,” at a specified location in the genome. Millions of SNPs have been identified in the human population, and certain combinations of SNPs can serve as proxies for larger stretches of DNA that tend to be conserved among individuals.

The new system, which Berger and Simmons developed together with Cenk Sahinalp, a professor of computer science at Indiana University, implements a technique called “differential privacy,” which has been a major area of cryptographic research in recent years. Differential-privacy techniques add a little bit of noise, or random variation, to the results of database searches, to confound algorithms that would seek to extract private information from the results of several, tailored, sequential searches.

The amount of noise required depends on the strength of the privacy guarantee — how low you want to set the likelihood of leaking private information — and the type and volume of data. The more people whose data a SNP database contains, the less noise the system needs to add; essentially, it’s easier to get lost in a crowd. But the more SNPs the system records, the more flexibility an attacker has in constructing privacy-compromising searches, which increases the noise requirements.

The researchers considered two types of common queries. In one, the user asks for the statistical correlation between a particular SNP and a particular disease. In the other, the user asks for a list of the SNPs in a particular region of the genome that correlate best with a particular disease.