Monday, October 24, 2011

Storing documents in a database using EntityFramework Code First

Our team has been developing a workflow application that allows our Board of Regents to submit requests for information to one or more Regents institutions and allow those institutions to respond back in return.  There are a number of other requirements, but one I'd like to focus on is the need to allow the Board to upload a document that one or more institutions must download, fill in, then upload as their response. We expose the data in the application via an API that each institution can use with their own workflow applications to process requests for information internally.  We've also developed a companion application that our institution will use internally and make available to the other institutions as desired. The applications serve a limited number of users and has relatively light data requirements so we chose a database-centric approach for simplicity.
We're using EntityFramework Code First (actually, database first, code first, but that's a different post). We've been through a couple of iterations on the approach and I thought I'd share our lessons learned.

First Approach
Our first approach was to use a single table per type holding both metadata and content for each type of attachment.  In our case we have three different types of documents, referred to internally as attachments, since that is how they are delivered via our web site: request attachments, response attachments, and a final response or "letter" attachment.  Using a single-table-per-type approach allows us to easily model the relationships between the main entity (Request/Response) and their related attachments (RequestAttachment and RequestLetter, and ResponseAttachment, respectively).  Even though the table schemas are identical, we segregate them into separate tables in order to keep the relationships and the models simple.
This approach is pretty clean.  On the model side we can refactor the code into a shared base class that all of the attachment entity models can derive from.  This fits in well with our standard model as well, in which we use an artificial primary key (identity) ID column and a Version (timestamp) column for concurrency control.  To be honest, we didn't really give this a lot of thought because it seemed a relatively straightforward mapping.
In developing the first, Regents, application our model seemed to work very well.  There were hints, however, that our choice might not be optimal.  First, we only need the attachment metadata to render links to the attachments in our views.  However, because of the model we're forced to either have two models per table, one for metadata-only and the other for the entire model, or we must load all of the data for each request.
Second, our "list" screens don't even reference the attachments, but our API needs to have a mechanism to "list" all requests to support synchronization between the main and client apps.   Again, we have a choice: our repository can be more complex, supporting different queries for different types of lists or we can use the same queries, but in some cases load extraneous data.  We chose to use a simpler repository and eager load attachments along with requests for information.  During development of the initial application, this seemed like a reasonable choice because the attachments tend to be small and the number of requests per "list" screen is small.  Better to do one request and get more information than needed than fire off many requests for a small amount of data was our thought.

Cracks Appear
At the point where we started on the synchronization support in our API, it became apparent that this approach was not going to work.  To reduce the chance of inconsistent data between client and master applications, we minimize the amount of data stored in the client application, pulling the details from the master application as necessary.  To ensure consistency, client applications, at least ours, will periodically synchronize the glue that connects entities in the client with their corresponding entities in the master application.
Unfortunately our approach to attachments meant that whenever this synchronization occurred, all of the attachments for all requests we being eagerly loaded.  The obvious fix for this was to back off to lazy loading, but this is less optimal for the master application, forcing it to make more requests whenever a "list" view is generated.  What we'd really like is to eager load the attachment metadata, but lazy load the attachment content since it is only needed at the point where an actual download of a particular attachment occurs.

Current Solution
The approach that we settled on (for now) is to use one table per type and one shared table for all content.  The first table includes the attachment metadata, including a foreign keys to the content and the related request or response (it acts as a join table with additional metadata).  The shared table contains only the value of the content.  The attachment metadata (join table) is eagerly loaded for most queries, but the attachment content is not.  The attachment content is only loaded when handling a download request for an attachment.

Some Notes

There are a couple of things that you should be aware of before adopting this approach.  First, if your data is larger than ours it might be better to use either a hybrid DB/file system approach or investigate using FileStream data in MSSQL.  You can find a nice introductory article by Jacob Sebastian at SimpleTalk and Paul S. Randal has a nice white paper on MSDN on using FileStream data.  Our solution fits into the sweet spot for table-based attachment storage.  Yours might not and the extra complexity of a hybrid or FileStream solution might be worth the effort. Second, all of this discussion is predicated on an ASP.NET MVC architecture and particularly one supporting an API used to synchronize with external applications.  I think the approach we are using is better than our original even for a single application, but we didn't really discover the original's drawbacks until we implemented the API.

Please let me know if you have any suggestions for how we could improve on this or if you've found it helpful.