Data Access Objects (DAO)

From PKP Wiki
Revision as of 17:01, 1 March 2011 by (Talk | contribs) (Added a rule for methods that change data.)

Jump to: navigation, search


Before anything else you should make sure that you have a good understanding of data normalization. The DAO pattern works best for normalized data which also is very good in avoiding data corruption.

Some good introductory articles are (in that order):

The DAO pattern

Typical properties of implementations of the Data Access Object (DAO) pattern in the context of a Model-View-Controller (MVC) application as ours are:

  • The DAO maps the relational database model to an OO model thereby bridging the "impedance mismatch" of these two data modeling approaches. It also sometimes hides implementation details of specific storage technologies (e.g. different database vendors). The latter is often, at least partially, delegated to a special data access abstraction layer, though (i.e. ADOdb in our case or maybe PDO in the future). DAOs are designed to decouple the controller from the relational data model and from database implementation details.
  • The DAO usually implements a basic "Create-Read-Update-Delete" (CRUD) pattern to handle single instances of entity objects.
  • Most of the time additional use-case specific methods are required that batch-retrieve objects to improve data access performance, populate objects differently, e.g. with or without dependent objects loaded and deal with polymorphism.
  • The DAO returns domain objects (DO) that represent instances of data entities (e.g. articles, users, etc.).
  • DOs contain either only values or in more recent design paradigms (domain driven design, DDD) values & domain-specific behavior. In our case we mostly stick to the value-only approach but exceptions confirm the rule and there's no problem with that. We do not usually use the data transfer object (DTO) pattern in the sense of untyped objects created on the fly to represent use-case specific data. There may be exceptions to this rule but you should try to always return fully populated DOs for better re-use of DAO methods.
  • DAOs usually instantiate DOs and not the other way round except when implementing lazy-load (see below). If you can choose then you should avoid dependencies from the DO to the DAO to improve encapsulation.

DAO design problems and trade-offs

DAO design is not as straightforward as it may seem. The main challenges are caused by the "impedance mismatch" between the relational and OO data models:

  • How to map inheritance hierarchies (polymorphism) to database tables via DAOs?
  • How to map relations between objects (composition/aggregation) to the database (one-to-many and many-to-many relations) via DAOs?
  • How to trade off performance (runtime, memory) against code re-use/maintainability and complexity?


The relational data model does not implement polymorphism out-of-the-box. This means that as soon as DOs are polymorphic we'll no longer have a clean one-to-one relationship between database entities (aka "tables") and domain objects. An object-relational (O/R) mapping will be required. The three standard O/R mappings for polymorphic objects are: Single Table Inheritance, Class Table Inheritance and Concrete Table Inheritance. I won't describe these standards here - they are well documented on the web.

The standard pattern for an inheritance mapper is like this:

Inheritance Mapper.png

In the given example, the SubmissionFileDAO will often be the only DAO required to handle access to mixed lists of files. Whenever the implementation doesn't matter you'll use the methods of the domain object DAO to access objects from the whole inheritance hierarchy. The DAO will make sure that monograph files will be instantiated as MonographFile and artwork files as ArtworkFile classes. It will also transparently convert between implementations when necessary.

To achieve this, the SubmissionFileDAO will delegate to one of the two concrete DAOs when one of its methods is being invoked. It identifies the right DAO either by inspecting the type of the object (on update/insert) or by querying the base table that contains some type indicator (on read access, e.g. genre in the case of the monograph/artwork file distinction).

To avoid unnecessary database queries the SubmissionFileDAO can also prepare data via outer joins over all class tables to be passed into the concrete DAOs. The methods of the concrete DAOs do not have to correspond 1:1 to the methods of the "orchestrating" DAO. That's the reason why they are on separate inheritance hierarchies and why we distinguish between the SubmissionFileDAO and the AbstractSubmissionFileDAO in our example.

The inheritance mapper design does not prescribe a specific database design (see the three O/R mapping strategies above). We can use all common mapping techniques.

You can use one of the concrete DAOs in your handler but only if you really want to deal with that specific domain object type. If you implement a special artwork file editing form for example it may make sense to use the ArtworkFileDAO directly. Most of the time access through the domain object DAO should be preferred, though, as it will deal with most cases and is easier to extend and maintain.

Please keep in mind that not all inheritance hierarchies are as simple as this one. There are usually multiple children to one parent class. There is often not such a clear-cut design, then. You'll have to decide which parts of the inheritance hierarchy need their own "orchestrating" DAO managing "delegate" DAOs. You can also nest DAOs into each other so that what is an "orchestrating" DAO on one level appears as a "delegate" DAO from the viewpoint of a parent class.

Mapping Relations Between Objects

The problem of mapping relations between objects to the database and back also relates to the "impedance mismatch" between objects and relational tables mentioned before. Depending on the type of relation and the way in which such relations are being used there are several options to map those relations in your DAOs:

The following questions can help you to identify the right implementation:

  • To which class belong(s) the object(s) you return?
  • Do you implement a one-to-many relation and the related objects are only being used in the parent object? Or do you implement a one-to-many relation with an object that is (also) being used independently of the parent object or do you implement a many-to-many relationship?
  • Does the relation table contain only foreign keys or also additional attributes?
  • How expensive are simpler design options in terms of execution performance? What is the cost of additional complexity of different design options and what is the maintenance cost?

Where to put your method?

The first question to answer is where to put your method. You can potentially query objects in both directions of the relation - e.g. you can return an Author object with all Chapters of that author as dependent objects or you can return a Chapter object with all authors of that chapter as dependent objects.

The basic rule is:

  • For methods that read and return data: You always put your method into the DAO that corresponds to the return type of your method. In other words: If you return chapters with authors as dependent objects then you put your method in the ChapterDAO, if you return authors with chapters as dependent objects then you put your method in the AuthorDAO.
  • For methods that change data (update, insert, delete): You always put your method into the DAO that corresponds to the main entity being updated.

How to implement different types of relations?

Relations in which the related object is used by only one parent object

You can implement a single DAO that deals with both, the parent and the related objects at once.

By default you can query both objects in a single SQL joining their tables together and instantiate the related objects in a single load operation. If loading all options at once is too expensive then you can think of using specialized DAO methods: one that loads the dependent objects and one that doesn't (see below, performance optimization).

One-to-many or many-to-many relations of independent objects

You'll need two DAOs: one for the object being queried and one for each of the associated object types. The DAO of the main queried object will keep a reference to the DAO(s) of related objects.

You have basically three options to implement such relations in DAOs. The options are ordered by increasing performance (and complexity/maintenance cost):

  1. If you only want to access a few related objects then you can access them one-by-one which will keep all access code (SQL and object instantiation) in the DAO of the related object. This is the most simple implementation and if it doesn't cause any performance problems then it clearly is the preferred implementation due to it's simplicity and good potential for re-use of existing code.
  2. If this implementation is too expensive then you can load all related objects at once by implementing a corresponding method in the DAO of the related object. This implementation still keeps all SQL and object instantiation code for the related object in it's own DAO thereby improving encapsulation and code re-use. The disadvantage is that you'll have to implement specialized accessors that cannot be re-used as frequently as the basic accessors of the DAO.
  3. If this implementation is still too expensive then you can create a single SQL query that draws all required data together in the parent object's DAO and only delegate to the related object's DAO for object instantiation (calling its fromRow() method). This is the most performant method as it minimized the required roundtrips to the database. But this method requires much more complex code, breaks encapsulation and reduces re-use. It therefore only makes sense when you really have a noticeable performance problem to solve.

Relations as independent domain objects

All the above cases assume that your relation table does not contain any additional attributes but only the foreign keys of the entities it relates to each other. It's a frequent error that pure relationship tables are accessed through their own DAO rather than through one of the above mapping patterns.

It does happen, though, that relation tables themselves have properties. A good example are review assignments. While they relate submissions, review rounds and users (reviewers), they also contain a lot of additional properties like several review-related dates, status information, etc. In this case the relation table itself represents a domain object which is represented by it's own class or class hierarchy and will warrant an independent DAO. The only difference to "basic" entities is then that this domain object has a combined key.

When to load dependent objects?

You may have objects that point to too many other objects for them to be loaded immediately. Sometimes linked objects will link to other objects further increasing the amount of data to be loaded if you want to load all dependencies at once. It may even happen that you encounter endless reference loops leading to stack overflows or script timeouts.

This means that most of the time we need to make a judgment where to cut off the dependency chains. This can only be done based on specific use cases. In other words: We'll have to couple the DAO implementation at least to some extend to the controller and UI, at least in PKP's case where a fully-fledged O/R mapper is not an option.

As a basic rule we should populate as many compositions/aggregates by default as can reasonably be done without unduly hurting performance (execution time, server memory/CPU load, etc.). This means that we "avoid premature optimization" for the sake of better decoupling of the data access layer from the controller and better code re-use. It is obvious that the more dependencies are populated by default the more use cases can be served by a DAO method and the less DAO methods we'll have to write. It is wrong, though, to include data for which we have no use case just because there "might" be one in the future.

If you decide that you cannot serve all use cases by the same set of dependent objects then you can use one of the following strategies to break the dependency chain in a well-defined way:

Use-case specific DAO methods

We can create two different DAO-methods, one that retrieves a fully populated object and another that only retrieves an object with basic values set. This is often the cheapest, simplest and easiest to maintain solution. Of course this means that we introduce a certain dependency between the controller and the DAO but if the methods have at least some potential for re-use then this implementation is often the best. Code duplication should be avoided by sharing re-usable SQL snippets and object population code internally between different accessors.

Example: Implement a DAO method that retrieves an Author object with it's user group populated and another one that leaves the Author's user group null. The former calls the latter and only adds the user group. We can add an assertion to the DO, i.e. in this case Author::getUserGroup() that makes sure that the accessor is never called when the field has not been populated.

Lazy Load

We can initialize (parts of) an object at runtime when needed by depending on another DAO to retrieve the linked object when needed, i.e.

class Chapter {
  function &getAuthors() {
    $chapterAuthorDao =& ...
    return $chapterAuthorDao->getAuthors(...);

This makes sense when loading dependencies is expensive and they are not very often used. The pattern introduces close coupling between the domain object and other DAOs which potentially increases maintenance cost.

DAO inheritance

The most complex solution is that we can introduce subclasses so that we can use the general concept of an entity in use cases where we require less data and only instantiate the the more complex, specialized (and expensive) concept of an entity when really needed. See the inheritance mapper design pattern above for that case.

Please make sure, though, that the entities really implement a semantic concept that makes sense in "natural language", i.e. a concept that is not only introduced for performance but immediately makes sense to those who read the code and those who use the application.

The current specializations of Submissions into "ReviewerSubmissions" or "AuthorSubmissions" with their corresponding DAOs are anti-patterns - you shouldn't reproduce those! They couple the DAO to specific controller transactions, break encapsulation and good separation of concerns, produce duplicate code, do not improve performance and not even reduce complexity. They could be implemented much more elegantly with the other design options outlined above.

Common DAO Design Errors

Data that is not normalized

None of the reasons that may justify a deviation from third normal form (3NF) or similar normalization rules (e.g. big OLAP databases, no control over the database design, etc.) apply in our case. In an application like ours the advantages of 3NF outweigh by far the disadvantages.

Still we have got several examples of tables that are not correctly normalized. One example are tables that contain both, the file id and the submission id. The file id alone already implies a submission id (via the main submission file table in Oxx). Adding a submission id and a file id to a table means introducing the potential for data corruption. Experience shows that where data corruption is possible it will sooner or later occur. We already had plenty of bugs due to insufficiently normalized data.

DOs that instantiate themselves via their corresponding DAO

We had cases where a DO instantiated itself via its own DAO and copying this code across the inheritance hierarchy. This is not the same as the above lazy-load pattern! While under some circumstances it may be ok that a domain object lazy-loads other objects via their DAOs, it's not ok that a domain object instantiates itself via its own DAO.

Compositions that are not part of the concept of an object

Sometimes you may be tempted to introduce a dependent class into an object just for a single use case. You shouldn't do that. Always pass your design by the "natural language test". Does it make sense to say "my object has-a xyz object" (composition) or "my object is-a xyz object" (inheritance). If not then you should think again and find out where the information should live from a domain object model perspective.

It is especially bad to partially copy properties of other objects into your object rather than copying the whole object as a dependency.

See this for an anti-example:

class Author {
  function getLocalizedUserGroupName() {
    $userGroup =& $userGroupDao->...
    return $userGroup->get...();

This is wrong for two reasons:

  1. The user group name is a property of the user group and not of the author. It is wrong to say that "the author has-a user group name". It would only be correct to say "the user group has-a name".
  2. An author can be part of several (author) user groups. So copying over a single property won't work anyway.

If we allowed that then we'd also have to implement Author::getUserGroupRoleId(), Author::getUserGroupPath(), etc. all by re-instantiating the same DAO over and over again.

(This is real code by the way that was part of our code base...)

What's really the case is that the author belongs to (aka "has") user groups and not a user group name. So the correct "lazy-load" implementation would be:

class Author {
  function &getUserGroups() {
    $userGroups =& $userGroupDao->...
    return $userGroups;

See bug report 5231 for further examples and discussion of this error.

Coupling DAOs to the controller

A very frequent error is that DAOs are coupled too closely to controllers without any necessity. This happens when DAO methods are implemented for single use cases (transactions) only needed in a single controller and retrieving very specific data that could easily be retrieved by using more re-usable methods, too.

This also includes creating role-based DAOs (like AuthorSubmissionDAO) or DAOs that are fraught with concerns like transaction authorization that should be dealt with on a controller level.

Creating DAOs for relation tables

Relation tables are not entities in their own right as long as they do not carry their own properties. When a relation table contains nothing but foreign keys then it is not a candidate for a DAO. See the paragraph on many-to-many relationship mapping above for the right DAO-design in this situation.

Implementing methods in the wrong DAO

The rule that DAOs should only return objects of the entity type they are named after is very often being disregarded. Plenty of examples can be found in DAOs like the Reviewer/Author/SectionEditorSubmission DAOs that do much more than a DAO should.