Why I love PostgreSQL

Now, I’ve written some ranting about why I hate MySQL, now it is time to write about another Open Source RDBMS, PostgreSQL.

In this blog I am going through why I think PostgreSQL is a great database – comparable with the most advanced commercial database products.

I have worked mostly with DB2 and Oracle in my professional life. Although DB2 and Oracle are quite different, there are a lot of features they have in common, that you get to expect from a professional database.
Then when I started looking at MySQL some years ago, I got quite surprised to find that it lacked most of the data-integrity features that you would expect any decent RDBMS to have, and that the SQL dialect and supported data types were quite different from the ordinary.

As a freelance developer and an OSS enthusiast, I wanted a serious Open Source RDBMS and somehow found PostgreSQL. This was back when it was a version 8-beta something. The first PostgreSQL version to have a native windows install.

I will start with repeating some of the perhaps-obvious features that PostgreSQL has:

  • Lots of native data-types: integer types, arbitrary precision numeric types, time-, date- and timestamp types (with or without time zone), network types, geometric types, Booleans, arrays and user defined types. With no “surprises” in the implementation.
  • Views.
  • Triggers, stored procedures, user defined functions.
  • Transactions; 2 out of 4 standard isolation levels supported.
  • Indexes; on any column type and functional indexes…
  • Sequences…
  • Table spaces.

And worth mentioning, most of the features have been core features since the beginning.

Generally, compared to most other non-OS RDBMS in the market, PostgreSQL looks most like Oracle – if you are using Oracle already, switching to PostgreSQL is almost a no-brainer, unless you use partitioning or other Oracle specific features.

Now, I’ve written an entry called “Why I Hate MySQL” and I will start this entry trying to do a 1-1 comparison between the entries…

Foreign keys

First I will tell about foreign keys. A foreign key is when one entity (the child) has a reference to a key in another entity (the parent). The “reference to a key” in the child entity is called the foreign key, as it is a key to a foreign entity.

An example could be the Order – Order-Line relation in an order-entry system.

The Order (parent) entity contains all the details about the order in general: reference to the Customer, possible Delivery-Address, payment details, entry timestamp, status etc.

The Order-Line (child) contains details about a Product being ordered: reference to the Product, quantity, Size, Color, packing options, pricing etc. And a reference to the Order entity.

This is pretty basic stuff and you can implement it in just about any RDBMS. Even MySQL…

The fun starts when the RDBMS actually knows about the relationship. The RDBMS can then help you ensure that no Order-Line exists without a corresponding Order and that you cannot delete the Order without also deleting the corresponding Order-Lines. This is about data integrity; that you cannot have orphan children dangling in the system.

Only problem with foreign keys are that they can hurt performance and even be a source of hard to debug deadlock problems, as the RDBMS will sometimes have to lookup an Order when operations are done on an Order-Line and vice versa.

With PostgreSQL, RDBMS managed foreign keys are possible, and you have all the extended options with cascade-deletes etc. And PostgreSQL has had these for yearsÒ€¦

Transactions

Transactions are important; very important. If you are going to ensure some kind of referential integrity and try to maintain “controlled redundancy”, transactions are strictly necessary.

Take as example the classical accounting scenario: we have an Account entity with, among other things, a calculated balance. And we have a corresponding Account-Posting entity, which contains account history. Let us have some (PostgreSQL specific) DDL:

CREATE TABLE account (
  id serial NOT NULL,
  balance NUMERIC(10,2) NOT NULL DEFAULT 0,
  created TIMESTAMP NOT NULL DEFAULT now(),
  updated TIMESTAMP,
  CONSTRAINT account_pkey PRIMARY KEY (id)
);
CREATE TABLE account_posting (
  id serial NOT NULL,
  account_id INTEGER NOT NULL,
  amount NUMERIC(10,2) NOT NULL,
  created TIMESTAMP NOT NULL DEFAULT now(),
  CONSTRAINT account_posting_pkey PRIMARY KEY (id),
  CONSTRAINT account_posting_account_id_fkey FOREIGN KEY (account_id)
    REFERENCES account (id)
);

Lots of details are left out for clarity, but I think everyone gets the picture (and we have a nice example of a foreign key constraint as well πŸ˜‰ ).

Now, take the following sequence of events, trying to transfer “:amount” from “:src_account_id” to “:dst_account_id”:

INSERT INTO account_posting (account_id, amount) VALUES (:src_account_id, -(:amount));
INSERT INTO account_posting (account_id, amount) VALUES (:dst_account_id, :amount);
UPDATE account SET balance = balance - amount, updated = now() WHERE id = :src_account_id;
UPDATE account SET balance = balance + amount, updated = now() WHERE id = :dst_account_id;

This all looks good on paper… Now, just for the sake of discussion, assume that just one of the 4 statements fails! And we are in big trouble – at least one account will have a bad balance! (E.g. assume we have an RDBMS without foreign key constraints and that either src_account_id or dst_account_id refers to a non-existing account…)

This is where transactions come into the picture:

BEGIN;
INSERT INTO account_posting (account_id, amount) VALUES (:src_account_id, -(:amount));
INSERT INTO account_posting (account_id, amount) VALUES (:dst_account_id, :amount);
UPDATE account SET balance = balance - amount, updated = now() WHERE id = :src_account_id;
UPDATE account SET balance = balance + amount, updated = now() WHERE id = :dst_account_id;
commit;

Now, if anything goes wrong, all 4 statements are rolled back, and no accounts are modified.

This is possible with PostgreSQL…

Functions, Stored Procedures and Triggers

The ability to create “user defined functions” is an integral part of most modern RDBMS’. Basically it enables developers to place important logic as close to the data as possible.

This can be misused, but can also be a very important feature for ensuring data integrity. And for solutions where there is no “middle tier” or where the middle tier has poor transaction management, putting business logic into functions and stored procedures actually makes sense.

I don’t actually fancy neither too much – I prefer to work with an “Enterprise Ready” middle tier – but if you are forced to use PHP or the like, they can be a great help.

I have seen triggers misused greatly, where most business logic was placed in triggers – magic happens whenever you touch anything in the database, but no one can figure out how or why.

Triggers should never update anything but “its own” entity and perhaps its parent or children. Or to maintain historical or statistical information.

A typical example of a trigger is one that maintains “update” timestamps on entities, implementing “soft deletes” or performing non-trivial checks on data.

One of my favourite PostgreSQL triggers is this (7.4 dialect):

CREATE FUNCTION ts_update() RETURNS "trigger" AS '
BEGIN
 NEW.updated := ''now'';
 RETURN NEW;
END;
' LANGUAGE plpgsql;

It can be used on any table having a timestamp column named “updated”, e.g:

CREATE TRIGGER trig_update
 BEFORE UPDATE ON any_table FOR EACH ROW
 EXECUTE PROCEDURE ts_update();

PostgreSQL gives the developer the ability to create functions, stored procedures and triggers. PostgreSQL is born with SQL and its own “plpgsql” language. Other languages can be installed, e.g. tcl, Perl, JAVA, PHP, Ruby and Python. Even C can be used if you insist πŸ™‚

In PostgreSQL, a stored procedure can return a result-set, which can be used in e.g. a join. The following example returns all dates within the given period (PostgreSQL-8 dialect):

CREATE FUNCTION period(from_date DATE, to_date DATE) RETURNS SETOF DATE AS $BODY$
DECLARE
 CURRENT DATE := from_date;
BEGIN
 while CURRENT <= to_date loop
  RETURN NEXT CURRENT;
  CURRENT := CURRENT + 1;
 END loop;
 RETURN;
END;$BODY$ LANGUAGE 'plpgsql' IMMUTABLE;

Views

Views are like named sub-selects. They are great for a large number of uses:

  • They can help de-normalize and “humanize” heavily normalized data.
  • They can help hide soft-deleted entities.
  • They can give limited, even read-only, access to selected rows and columns to low-security users.
  • They can help generate aggregated and other “view-related” information for presentation purposes.
  • They are good for capturing often-used queries that are reused in different places of the application.

So, I really like views a lot. And I actually prefer to create views from queries that I have spent time creating and optimizing – they are also a great documentation tool.

And, in applications developed with a “weak” middle-tier, they can help reduce the amount of complex SQL hardcoded into the application.

PostgreSQL supports views. Almost any possible query can be turned into a view. Even joins with results from stored procedures can be used in views.
PostgreSQL also supports updateable views, further enhancing the soft-delete scenarios and making “time travel” implementations relatively easy.

Subselects

Okay, I’m not going into details here. Need I say the PostgreSQL naturally supports all the different kinds of sub-selects you will ever need?

Conclusion

There are lots of nice features in modern “Enterprise Ready” relational databases that I have not covered here. And there are lots of nice features in PostgreSQL as well.

Now, having worked mostly with Oracle the last couple of years, there are only 2 features I sometimes miss:

  1. Advanced partitioning. It is possible to implement something that looks like it in PostgreSQL, but the implementation leaves a lot to be desired, and there is far too much manual work involved.
  2. XML processing – having the database know about the structure of XML content, enabling you to do XPath searches, transformations etc.

As far as I know, XML support is planned in the coming version 8.3 release of PostgreSQL.

So, are you already an Oracle user and are looking for a good Open Source alternative, you should definitely consider PostgreSQL. It comes with native Windows installer that also gives you the absolutely amazing graphical pgAdmin tool that is far better than any GUI tool I have ever seen from Oracle πŸ™‚
PostgreSQL has a far smaller memory footprint and takes a lot less space on your harddrive (not that it matters too much these days). The windows installation is a few-minutes no-brainer!

If you need a web-based GUI tool, there is a great PHP based tool called phpPgAdmin.

PostgreSQL has only one thing against it – its name… It is a little hard to pronounce – at least for a Danish guy.

Update 2007-05-23: People start complaining that I compare PostgreSQL version 8 with MySQL version 4 and me not being fair. But I compare PostgreSQL in general with pre-5 versions of MySQL. And this is what is offered to me by hosting companies per se.

And, I’m not a marketing or product person. I’m a blogger and I blog about what I like and dislike. I like PostgreSQL (any supported version) and I dislike MySQL (any version at all).

About Jesper Udby

I'm a freelance computer Geek living in Denmark with my wife and 3 kids. I've done professional software development since 1994 and JAVA development since 1998.
This entry was posted in Databases, Open Source and tagged , , , . Bookmark the permalink.

7 Responses to Why I love PostgreSQL

  1. Stephen says:

    You complained about MySQL version less than 5 because “that’s what hosting companies offer”. Many hosting companies don’t offer PostgreSQL, either. I would like to see an article where you talk about the pros and cons of the most recent version of MySQL compared to the most recent version of PostgreSQL. All the MySQL vs. PostgreSQL debate seems to be based on old versions or unfounded assertions about new versions of MySQL. I just want to know honestly which one is better and why.

  2. admin says:

    Hello Stephen πŸ™‚

    I actually found one hosting company giving me PostgreSQL support. It is version 7.4.16 but still. Their MySQL installation is version 4.1.21…

    The company is “surftown” – http://www.surftown.com.

    I will blog about “newer versions” of MySQL when I’ve had the pleasure to actually use them in “production”.

  3. John says:

    MySQL – even the latest version – is a complete piece of shit folks. Not only does the implementation suck boar tits, but the license sucks worse. PostgreSQL is truly free, unlike MySQL. So all you MySQL fanbois need to STFU.

  4. mluttrell says:

    As i currently work with all of the db’s being discussed, it is worth mentioning that MySQL 4 and Postgres should be purposed differently. MySQL 4 is very fast and is great at serving db’s which support one user such as a website.

    PostGres may be too complicated for some users and doesn’t have the same performance.

    Weird enough I like MySQL, Postgres, Oracle, Hsql, etc all equally when they are used correctly.

  5. admin says:

    I agree – MySQL definately has its uses. But pre-5 versions are IMHO not “true” RDBMS’ nor “enterprise ready”.

  6. Mike McNally says:

    Does PostgreSQL support true clustered indexes yet? I know that there’s a “cluster” operation that can be invoked on an index, but that’s really not the same thing. For large scale apps (particularly web applications), where there are “account” entities and large growing populations of “owned” entities for each account, clustering on the “account” index effectively gives you separate tables per account as far as query optimization goes. When tables get really big, having a “clustered index seek” be the cornerstone of the queries helps out a lot.

    Seems minor, but to me that’s a deal-breaker. However maybe there’s some other way to achieve similar scalability, and I’m just ignorant.

  7. Robert C says:

    @Make NcNally:

    Interesting, I was wondering the same thing myself, which is how I stumbled on this article. Clustered indexes aren’t actually the fix-all that a lot of people think they are.

    I work a lot with M$ SQL Server, where of course they are used. For certain operations, they’re absolutely great, for others, less so.

    There’s two things to watch out for here: firstly, having a clustered index on a table actually increases the number of page reads needed to access data via a non-clustered (i.e. any other) index on that table, since the NCI contains the cluster key rather than a direct row pointer, as is the case if you’re based on a heap table.

    Secondly, if you cluster on a key that isn’t nearly always ascending over time, you can end up in serious pain. Because the clustered index keeps data in index order, inserting into the middle of the index range can lead to page splits, requiring a potentially large amount of data to be moved around.

    I take your point that if you’re largely retrieving data based on only one factor, clustering does speed things up (and can really help the cache locality), but it’s no silver bullet, and in your example, inserts could suffer as a result.

    Cheers,
    Rob

Comments are closed.