2013, October 21

Relations as First-Class Citizen - A Paradigm Shift for Software/Database Interoperability

I'm happy to announce that Alf Alf 0.15.0 has just been released and with it, this web site! I've been thinking about all of this for many years, often as a cross-cutting concern in my (other) research work. I've been hacking on Alf in particular during my free time for more than two years now. I think it was time to share it in a slightly more official way than as an (almost invisible) open-source research prototype on github. Recent personal events gave it a serious boost and a few people convinced me to give it more visibility. So here we go.

Alf is a modern, powerful implementation of relational algebra. It brings relational algebra where you don't necessarily expect it: in shell, in scripting and for building complex software. Alf has an rich set of features. Among them, it allows you to:

Query .json, .csv, .yaml files and convert from one format to the other with ease,
Query SQL databases with a sounder and more powerful query language than SQL itself,
Export structured and so-called "semi-structured" query results in various exchange formats,
Query multiple data sources as if they were one and only one database,
Create database viewpoints (mostly read-only viewpoints for now), to provide your users with a true database interface while keeping them away from data they may not have access to,
Enjoy a rich set of relational operators and even define your own high-level and domain-specific ones.

Alf is very young and not all of the advanced features are stable and/or documented. I plan to spend some time in the next weeks and months to work on them, so stay tuned. In the mean time, you can play with Alf on this website, install Alf 0.15.0 and start playing with it on your own datasets and databases, in shell or in ruby. I'll come with advanced material on this blog as soon as possible, I promise.

The rest of this post explains the context of this work and why it exists in the first place, in the form of a (very accessible) scientific paper (this writing style is also a test, let me know what you think). The following section provides a short overview of the proposed approach, explaining the title of this blog post. We then detail Alf's proposal, first with a short example illustrating the advantages compared to existing solutions, then with a more theoretical presentation covering three main questions: why true relational algebra?, what type system to expose?, and why not classes and objects?. Alf's limitations and features to come are then briefly discussed, before concluding.

Yet another database connectivity library?

We already have ARel, Sequel, SQLAlchemy, Korma, jOOQ and probably hundreds of similar projects for connecting to databases from code. Do we really need one more?

Well, Alf is a database connectivity library but it is first and foremost about a proposal for a new kind of software/database interoperability, or a paradigm shift if you want. This paradigm is called Relations as First-Class Citizen and it makes Alf different from existing approaches. The difference lies in the kind of data abstraction exposed to the software developer:

Call-level interfaces (e.g. JDBC) expose SQL query strings and database cursors (e.g. java.sql.ResultSet),
Higher-level SQL libraries, such as ARel, Sequel, and jOOQ expose SQL queries as well. However, they abstract them behind abstract syntax trees (AST), and algebra-inspired manipulation operators.
Object-Relational Mappers (ORMs) expose classes and objects together with the SQL/AST interface they generally rely on (e.g. the symbiosis between ARel and ActiveRecord),
Alf and Axiom expose Relations (i.e. sets of tuples) and relational algebra. For those interested, I'll discuss some differences between Alf and Axiom later in this blog post. In the mean time and unless stated otherwise, what is said about Alf applies to Axiom too.

In this blog post, I'm going to compare Alf with the second category above, i.e. high-level SQL-driven libraries. Not because the Relations as First-Class Citizen paradigm cannot be compared to, say, Object-Relational Mapping but because, at first glance, Alf shares a lot more with those libraries than with ORMs. First things first thus, let start looking at those similitudes and (sometimes subtle) differences. We start with a motivating example in the next section before moving to more theoretical arguments in the one immediately following.

Motivating example

This might appear rude or offensive, but I need to start by complaining about existing approaches and libraries (why would I work on Alf in the first place otherwise?). Sequel is used in this blog post but the situation is similar with all the libraries I mentioned previously. I've chosen Sequel because I commonly use and actually love it. No offense to be taken therefore even if I claim, in essence, that things could be improved.

My main complaint is that, despite providing closure under operations, existing libraries fail at providing a truly composable way of tackling data requirements. To understand why, let me take a concrete software engineering example on (a slighly modified version of) the suppliers and parts examplar. We'll use the following suppliers and cities relations:

suppliers:                                     cities:
+------+-------+---------+--------+            +----------+----------+
| :sid | :name | :status | :city  |            | :city    | :country |
+------+-------+---------+--------+            +----------+----------+
| S1   | Smith |      20 | London |            | London   | England  |
| S2   | Jones |      10 | Paris  |            | Paris    | France   |
| S3   | Blake |      30 | Paris  |            | Athens   | Greece   |
| S4   | Clark |      20 | London |            | Brussels | Belgium  |
| S5   | Adams |      30 | Athens |            +----------+----------+
+------+-------+---------+--------+

Let suppose that the suppliers themselves are the software users and that the following requirements must be met by the particular inferface showing the list of suppliers to the current user:

A supplier may only see information about the suppliers located in the same city than himself.
The supplier's status is sensitive and should not be displayed.
The country name must be displayed together with the supplier's city

In terms of the query to be built, those requirements involve a restriction (same city as), a selection (no status) and a join (with country name). Suppose you are supplier S3, the list of suppliers you see looks like this:

+------+-------+-------+----------+
| :sid | :name | :city | :country |
+------+-------+-------+----------+
| S2   | Jones | Paris | France   |
| S3   | Blake | Paris | France   |
+------+-------+-------+----------+

Struggling with reuse and separation of concerns

Writting a monolithic query is rather straightforward. Using Sequel for instance:

requester_city = ... # from context (authenticated user)

DB[:suppliers]
  .natural_join(:cities)
  .select(:sid, :name, :city, :country)
  .where(:city => requester_city)

# => SELECT sid, name, city, country
#    FROM suppliers NATURAL JOIN cities
#    WHERE (city = ...)

In software involving complex requirements, relying on monolithic queries is unfortunately not always possible and/or desirable (otherwise, creating database views would simply be enough). Two main reasons explain this:

The same requirements tend to apply to various and independent software features. For instance, the first two requirements above might apply everytime a list of suppliers is shown, while the third one might not. Complex requirements generally call for a design that achieves both separation of concerns and reuse.
Complex software also involves context-dependent requirements. For instance, the first requirement above might be relaxed for administrators (say, suppliers with status greater than 30).

This explains why connectivity libraries and their SQL utilities exist in the first place: because of the need to build queries, often at runtime and according to some context. There is a desperate need for more support for this in DBMSs themselves. In the mean time, developers rely on the ability of host programming languages and third-party libraries.

Back to our example above, what about the following "design"?

# Meet 1) and 2) together as a utility method: separation of concerns
def suppliers_in(city)
  DB[:suppliers]
    .select(:sid, :name, :city)
    .where(:city => city)
end

# Meet 3) as a utility method: separation of concerns
def with_country(operand)
  operand.natural_join(:cities)
end

# Meet them all: composition and reuse
requester_city = ... # from context
with_country(suppliers_in(requester_city))

Wrong. The original, and correct, SQL query was:

-- Give the id, name, city and country of every supplier located in city ...
SELECT sid, name, city, country
FROM suppliers NATURAL JOIN cities
WHERE (city = ...)

The new one seems smiliar, but is wrong. As shown below, we lost the country in the process:

-- Give the id, name and city of every supplier located in city ..., provided
-- the city is known in `cities`
SELECT sid, name, city
FROM suppliers NATURAL JOIN cities
WHERE (city = ...)

What happened? In short, Sequel's join does not correspond to a algebraic join of its operands. Instead, its specification looks like "adds a term to the SQL query's FROM clause", whose data semantics is far from obvious (here you can blame SQL itself). Observe in particular that the following algebraic equivalence does not hold in Sequel, preventing us from using the design above:

suppliers
  .natural_join(cities)
  .select(:sid, :name, :city, :country)
<=!=>
suppliers
  .select(:sid, :name, :city)
  .natural_join(cities.select(:city, :country))

Join is a striking example of the problem at hand, but others exist that involve different operators. Let me insist on something: the same is true with ARel, Sequel, SQLAlchemy, Korma, jOOQ to cite a few. The fact is:

SQL has not been designed with composition and separation of concerns in mind,
Avoiding strong coupling between subqueries tends to be very difficult in practice,
Coupling hurts separation of concerns and software design.

To be fair... There is a way to use SQL (and, sometimes, those libraries) so as to avoid the problem described here. It amounts at using SQL in a purely algebraic way. Unfortunately, that way is not idiomatic and leads to complex SQL queries, that may have bad execution plans (at least in major open-source DBMSs). In the example at hand, using Sequel's from_self in a systematic way (e.g. on every reusable piece) is safe from the point of view of composition and reuse:

def suppliers_in(city)
  DB[:suppliers]
    .select(:sid, :name, :city)
    .where(:city => city)
    .from_self
end

def with_country(operand)
  operand
    .natural_join(:cities)
    .from_self
end

requester_city = ... # from context
with_country(suppliers_in(requester_city))

# SELECT * FROM (
#   SELECT * FROM (
#     SELECT sid, name, city FROM suppliers
#     WHERE (city = ...)
#   ) AS 't1'
#   NATURAL JOIN cities
# ) AS 't1'

The complete recipe for using SQL in such a "safe" way is more complex, of course, but possible. I won't provide the details in this blog post, let me know if a dedicated one is welcome. For now, let see how our new paradigm helps.

Relation Algebra at the rescue...

Libraries like Sequel and Arel offer closure under operations, meaning that you can chain operator invocations (e.g. operand.select(...).where(...).where(...)). Subtly enough, that does not make them exposing an algebra, because SQL is not itself a pure relational algebra (see later) and these libraries do espouse SQL in a rather faithful way.

In contrast, the Relations as First-Class Citizen paradigm aims at providing an interface that is designed for composition and reuse. To achieve this, Alf takes some distance from SQL and exposes a true relational algebra instead, inspired from Tutorial D. This makes a real difference, even if subtle. To convince yourself, I invite you to use Alf's Try console to check that the example below works as expected. As shown, the three requirements of our case study can be incorporated incrementally thanks to the true composition mechanism offered by an algebra. Commenting a line amounts at ignoring the corresponding requirement:

requester_city = 'Paris'
solution = suppliers

# 1). A supplier may only see information about the suppliers located
# in the same city than himself.
solution = restrict(solution, city: requester_city)

# 2) The supplier's `status` is sensitive and should not be displayed.
solution = allbut(solution, [:status])

# 3). The country name must be displayed together with the supplier's city.
solution = join(solution, cities)

Relations as First-Class Citizen - A Paradigm Shift for Software/Database Interoperability

Yet another database connectivity library?

Motivating example

Struggling with reuse and separation of concerns

Relation Algebra at the rescue...

... plus extra

More about the paradigm and its motivation

From Relational Calculus (SQL) to Relational Algebra

From SQL's to Host's Type System

From One-At-a-Time to Set-At-a-Time

Limitations and ongoing work

Towards high-level, domain-specific relational operators

Database viewpoints

Reconciling heterogeneous type systems

Conclusion

Acknowledgements