Wednesday, December 31, 2008

Essential Software Concepts: Fail fast

Software development is hard. Not only is development hard, the environment and hardware is not exactly fault proof. Let's not forget the users either, you can be sure they will do something you didn't think about. Or even the human factor in terms of deployment, database-updates and the like. So one of the few truths about software is that things will fail, the question is what you do about the failures. 


You generally have two (three) options in dealing with errors: you can try to make the system recover, or you can fail at once (Or you can try to swallow the error, and cross your fingers that it will work out in the end).


Unless it is essential that the system shouldn't fail, I like to fail things fast. By that I mean, if something doesn't work properly then fail the operation as soon as possible. Failing doesn't mean crashing the application, you should provide an informative message to the user. Failing like this has, as usual, both advantages and disadvantages. The main advantage is related to tracking and fixing bugs. One of the major efforts in fixing bugs is understanding where and why it happened. If you fail early rather than late, you'll most likely have an easier task tracking the bug down. You don't have to worry about the system having worked in an inconsistent state and there's a larger chance that the stack trace actually shows where the problem is located.


The main disadvantage, or rather question to ask yourself is "is failing early fine in production?". If you can't say yes to this, then know that you'll have a much more complex bug-tracking time ahead of you. It's not an easy question though. Failing early can sometimes mean that users are unable to use important functionality because of a strict policy of failing early. Furthermore, a system that keeps failing is not good for building customer trust. Perhaps it is better to fix it in the background in some cases. Whatever you choose, it is certainly not a black or white question.

Tuesday, December 30, 2008

Essential Software Concepts: Broken windows

The concept of broken windows is one of the concepts I remember best from the Pragmatic Programmer book. It is one of those simple concepts that you need to be aware of as a developer. But even though it is known to most developers (at least in principle), it is still violated all over the place - just to try to crank out a little more functionality. The technical debt you build up there you will pay for a long time.


I'll explain the concept, but if you don't know it, you really should go ahead and read the Pragmatic Programmer book. A quick and good read.


The concept can be explained like this: Say you have a house. As long as the house stands firm, there's a good chance it will keep that way. But as soon as a window is broken, and nothing is done about it, the downward spiral begins. Another window is broken, vandals run free, and eventually you get to the point where restoring the house it's former self is more costly than just replacing it.


This certainly holds true for software projects as well. If you keep the architecture sound and continuously work to hold the implementation maintainable; by separating concerns, avoiding duplication, and so forth and so on, and if you fix problems once you find them, you'll have a much larger chance of keeping things maintainable than if you don't.


Taking the time to implement something properly will benefit you in the long run (and I'm talking weeks and months here, not years). The additional time it will take to implement something properly cannot be compared to the cost of being bogged down in a poorly architected code. Doing changes (again, we're talking weeks and months here) will be more costly, and you don't need to pay that price.

Monday, December 29, 2008

Essential Database Practices: Parameterized queries

When you access a relational database the old-fashioned way (not ORM's and LINQ), you generally have three options; you can call stored procedures, you can send queries manually as a concatenated text block or you can send queries manually as a parameterized query.

Option one and three are both valid ways of doing it, but if you ever create queries as concatenated text block WITHOUT parameters you need to stop. I'll explain why by explaining why parameterized queries are good for you.

You should use parameterized queries for three reasons:

  • SQL injection
  • Query plan caching
  • Query plan cache size

 

SQL injection

This is the most obvious. It is hopefully known to all, as much focus has been put on it in recent years. If not, have a look in wikipedia.

 

Query plan caching

Every time a query is executed in SQL Server, it goes through two main steps: parsing/compilation and execution. During the parsing/compilation step, SQL Server creates a plan for how it will run the query. Once this is completed, the query is put in the query plan cache and executed.


In the parsing/compilation step, before doing anything, SQL Server will look in the cache to see if the query plan has already been created. If it has, there is no need to create the plan again. Depending on the query and how many times the query is run, you can save a significant amount of time if the query is cached.


The cache check is done by a look up in a hash table where the form of entry is the entire SQL query, formatting included. The reason I listed query plan caching for parameterized queries, is that the contents of the parameters does not matter, thus running the same query with a number of different parameters will result in using the same plan each time. If you concatenate your queries manually, a new plan will be made for each call. Note that this is important to remember also for the queries you create in your stored procedures. Use sp_executesql instead of just using exec to execute your queries, then you can also add parameters to your dynamic SQL.

 

Query plan cache size

For each query plan created the cache increases in size. The queries go out of the cache when they are not reused, but this can take some time on a busy server. I don't think this usually has a big impact, but creating a large amount of query plans does fill up the cache size. That means they use up an unneccessary amount of your server's ram. Shame on you.

 

If you want to have a look at how much time the parsing/compilation step takes for a given query, run SET STATISTICS TIME ON before running the query. The actual cache can be accessed via the sys.dm_exec_cached_plans view.

New blog series: Essential this, Interesting that

I recently got the idea of a new form of blog posts I think will be interesting. Basically it involves covering topics that I find to be either essential or interesting within a particular area, and labeling  them visibly as such.

 

I hope you'll find value in it.

 

Happy holidays :)

Monday, December 15, 2008

Good stuff on Botnets, DDos and Scripting

Rob Conery has a nice blogpost titled "The Perfect Storm Botnet".

 

A good, fairly short read - a bit like kids horror stories for computer geeks, except for the fact that these are true of course.

 

So just watch out for all those *-injection attacks kids, it's seldom pretty

Sunday, November 30, 2008

Another "fagdag" done

We once again had another "fagdag" yesterday, and I went away really pleased. The contents was good, but more than that, the discussions are really fruitful. I think we have some good things going there..

It was on a Saturday, so it didn't last into eternity, but here's what was covered if anyone cares:

  • Code Quality

  • Claims-based identity

  • Project demo: xxx Silverlight

  • BDD/TDD

  • Testing legacy code

  • Microsoft ORM Futures: Entity Framework V1/2, LINQ To SQL (lightning talk)

  • Windows Powershell (lightning talk)

  • Parallell computing in .NET 4.0/VS2010 (lightning talk)

  • Ninject (lightning talk)

  • XAML Power Toys (lightning talk)

  • Onion Architecture (lightning talk)

Saturday, November 29, 2008

@Twitter

Need to find out what all the buzz is about... You can find me at twitter.com/RuneSundling

Thursday, November 13, 2008

SCRUM in 5 minutes

Want to get the gist of SCRUM, but have only got 5 minutes?

Check out this pdf.

.NET 4.0 poster

Want a fancy poster showing you the contents of .NET 4.0?

If you haven't seen it yet, you can get it here.

Wednesday, October 29, 2008

.NET and Parallelism

I attended a really interesting session today about the current work being done with parallelism for the .NET framework 4.0.

And the future is bright! :)

You should see the session yourself, but I'll try to summarize what I got out of it anyway.

Threading and parallelism are important now, but will become ever more important in the future. Our processors have stopped getting noticeably faster, but they are continuing to become smaller and more efficient, which enables to have more and more of them. Currently most of us are running with two cores, and quite a few with four. Hardware producers are saying that 80-100 cores are withing reach within a not too distant future for regular machines. We absolutely need to design for this.

So we need to start taking threads ever more into account. But it is such a complicated field, with locking, contention etc.

The current implemenation of threads aren't helping either. Create a thread and it takes 1 mb of committed memory. Try to manage them, follow them with current debugger support in VS, etc etc - it's not easy.

First of all, the annoying Thread will be replaced by much more enjoyable Task in 4.0(The thread will live on, but the Task should take over in terms of being used.).

The much improved (compared to threads) Task API, their internal handling, and the new tools to support them and parallelism (in VS2010) looks really cool.

I can't tell you how tasks differs from threads yet, except that they are similar, but that tasks has a more advanced API accompanying them, and a different internal handling.

Consider the scenario of running a program on a two or four (or something else) core computer. What's the best number of threads you could possibly have then? The answer is of course two on the two-core, four on the four-core, and so on. (Don't design your app to run with two parallel execution paths if you have a two-core computer, you want these things to scale when you them on much more powerful servers.)

When you create a myriad of threads, you get the problem with the constant fight for resources and time. With tasks this is handled in a different way. Consider this simple scheme:

  • One global queue where all your tasks are created

  • One local queue for each of your processors


Once a task is created it is put into the global queue, and then moved into one of the processors local queue.

Each individual processor then processes each of the queued tasks in a LIFO fashion, as the last one added is the one most likely to have information cached. Once it is done with its tasks, it steals (stealing is good here) in a FIFO fashion from the other lists, attaining a few things:

  • None of your processes are standing lazily around

  • You will get into much less contention issues, since you grab tasks from different sides

  • Everything is executed in an orderly, efficient manner
  • <


So how do we use these tasks?

You create tasks much in the same fashion as you created threads. The difference is you have quite a few more options in how they are handled: if you want them to work together with the parent thread, if you want to crash the main thread if there is an unhandled exception on a child-thread, in creating generic versions to return a particular type of data (+++++++). Check it out yourself.

You also get more structured assistance, which looks really nice. Instead of running for instance a standard ForEach which runs through each item sequentially, you can use a new Parallel.ForEach which runs each task in parallel. This will work great if each step can be run in isolation and don't depend on each other, but even if they do, you can use the API to say stop, and then for each task run it check whether stop has been set, in practice stopping much more prematurely. With lists you can just add ToParallel() to create a paralell version of the handling of for instance a LINQ query.

I really should go on and on.

And I have forgotten to mention the new Visual Studio tools, haven't I? There are a couple of new debugger-windows which seem extremely helpful in visualising which tassk are currently running, which tasks have been created, what values the tasks have, how the stack trace of each task is, and what methods any and every tasks have hit. Hard to explain, but you'll like it!

This is just emerging, but this will make parallel life much easier in the future. I'll surely be looking much into this!

Windows Azure

As you've bound to have heard by now, Microsoft recently announced Windows Azure at the PDC08, the new Cloud operating system.

I must say I'm really intreagued by this.

The possibilities are big. Think of the extreme costs you have for running and maintaining an application. For instance:

  • You need to be able to scale according to user interest

  • You need to be highly responsive - which means a server in Norway might work pretty bad when accessed from US or Asia

  • You need to have a server park in at the very least one place to just host the applications (if you can afford the downtime a electricity failure would cause, or possibly the loss of data and hardware a fire/earthquake could cause)

  • You need failover database solutions (If manually recreating a backup won't do)

  • Some data might not be allowed to be stored in some countries, you need to be sure that whatever solution you choose will handle this properly


Instead of these options you could get into some sort of external hosting solution. And this is in many regards what MS is offering with Azure, but I bet a fair bit of hosting-providers are literally shaking now after the new plans of MS have been introduced.



We are talking big, big scale investsments in this. The tight integration with the current MS software and process of doing things (Visual studio for instance) doesn't hurt in the process of making this popular. And it's not like they don't have quite a bit of internal expertise in this already; for instance through running the Live services.

One of the most important things are the continous expectations of ever more interactive and graphics-heavy software delivered at the blink of an eye over the internet. You can do this yourself, but I doubt the cost/value calculation will be on your size.

Another things is scaling. With Azure, or the cloud service offering, you can scale up or down according to user interest. An application with user peaks that vary over time can be tough financially, since you can end up with much more hardware set up than you usually need.

One big issue is privacy. If you host this in the cloud, you put you data in Microsofts hands. Are you cool with that?

As you have noticed, my detailed understand of Azure is lacking, to be honest I haven't bothered focusing too much on it, so much else which is intersting here at PDC! I do believe though that this can become such a big thing, and I'm expecting to have to consider it much more in the future - but for now I'll leave it to someone else.

Check out Azure here.

Wednesday, October 8, 2008

Microsoft completely miss the point of Agile Development?

Slightly annoyed about a post in CIO about Visual Studio 2010. The contents of the article is not so interesting, but a user comment annoyed me.

"Microsoft, unfortunately, continues to show that they in fact completely miss the point of agile software development. Agility is about simplicity of design, of process, of feedback mechanisms. It is also about open, community-based tools, frameworks, and standards. MS keeps offering hilariously bloated, complex, monolithic, closed, and expensive IDE "solutions" that worsen every problem they attempt to solve. Visual Studio is now, at more than 43 million lines of code (and counting), so counter to agile development practices that I must question its architects' sanity or motives. Is all of this bureaucratic bloat forced upon the VS team by clueless marketing drones? That might explain the continuing madness."

This guy has just completely missed the point. A few things:

Agile = open, community-based tools, frameworks, and standards.
Why?? Agile is (a lot more than this, but also is) about using agile methodologies and practices to drive a project to success. Any software that can help in this endavour is great, but all I care about is using the best software. Whether it is Microsoft or Thoughtworks that delivers my CI system is unimportant to me, as long as I get to make the choice. I can use an open source IDE, CI, source control, build tool, test tool, etc. if I want to (and I often do), and that's all that matters.

MS keeps offering hilariously bloated, complex, monolithic, closed, and expensive IDE "solutions" that worsen every problem they attempt to solve.
Dude, Visual Studio is a great tool! Together with Resharper it really is a tool that rocks :)

That might explain the continuing madness
Keep up the madness! I, for one, can't wait to see what comes next.

Admittedly Microsoft wants as big a part of the pie as possible, and that can certainly lead to situations that is less than optimal, but just because everything isn't good about Microsoft certainly doesn't make everything wrong.

Friday, September 26, 2008

NNUG Oslo




Just wanted to note that I've just been elected into the board of the Norwegian .NET User Group (NNUG) - Oslo.

I'm looking forward to the experience! Hopefully something good will come out of it for you too :)

Tuesday, September 23, 2008

Beware the enum-compilation!

I was recently made aware of a real eye-opener when it comes to how enums are compiled by the c# compiler.

When you compile a c# project, each enum is internally replaced with the corresponding value of the enum. What that mean is that in the IL code the reference to the enum is replaced by the integer value in the enum.

This could give you a problem if you redefine an enum that is used across multiple dlls. Say you have an enum with two values, and you decide to switch their place in the enum definition. If you only compile the dll with the enum definition, the other dll will work the opposite of what is fairly natural to expect!

The enum tip of the day is thus:

  • Always specify the value of the enum in the defintion rather than using the default order.

  • Never replace a value with another, always add new values to a number higher than the current top

  • Always follow the points above. Would you like try to debug a patch creating that problem?

Thursday, September 18, 2008

Definition of good quality solutions

Recently I heard a new definition for what a good quality solution should be like:

  • robust and stable

  • easily maintainable

  • good performance

  • good user experience

If asked I would have listed this somewhat differently, but my point is rather:

This must be the most obvious thing. Could this possibly be necessary to say?
(Note: Was not directed at me :))

Could someone possibly try to create ..
  • nonrobust

  • unstable

  • hard to maintain

  • lousy performance

  • poor user experience
.. code deliberately?


I doubt it.


You won't always be able to do it - there's always compromises (maintainability+performance, time-contraints, etc), but still.


There's not really much point to this post. I just wanted to make a point. Or something.

Wednesday, September 17, 2008

Boo... Worth the time?

One of my colleagues, Tore Vestues, has recently sparked an interest in Boo internally.

Boo is, for those who don't know it, just another .NET language. I like Ayendes definition:

Boo is an object oriented, statically typed language for the common language runtime with a Python inspired syntax and focused on language and compiler extensibility.

Today we had a voluntary Boo-night, which meant about 15 developers having a go at Boo for a while. Good stuff :)

First of all, I really don't know Boo (at least not yet), but there's one thing I totally dig about it - the way anything and everything can be put together at compile-time. You got code generation at compile-time, without all the runtime hazards of standard code-generation. It really just rocks compared to the everyday c#-code.

For instance you can define what a singleton is one place, then just add a [singleton]-attribute around your class to define for the compiler to compile this as a singleton. Or a better example Tore came up with: INotifyPropertyChanged. Anyone like writing that code over and over again? How sweet is it to get to write it once, and use it over and over again? I can think of lot's of examples were I keep rewriting annoying boring code, which isn't just boring to write, it provokes errors, and it makes the code much harder to read than a declarative style possibly could.

So what's the negative things then?

First of all, Boo is a separate .NET language. It will compile to IL-code, but you need to use it fully in it's own assembly. You can of course not mix and match. In other words, you won't get C#-code with Boo compile-time code-generation. Unfortunately :(

Number 1 once again. This is the biggest problem the way I see it. The language in itself seems really cool. But what's the chances for mainstream development? Will you implement part of your application in Boo, or everything? Can you justify the cost of learning up all developers on the project to the Boo way of doing things? (I don't know, so I'm really just asking questions) Is it mature enough or has proper tool-support? (not yet)

I guess most of the initial issues with Boo comes from the fact that it's just new (of sorts) - with the standard problems you have when something is new. There is not enough documention, tool-support, or followers.


But isn't many of the concepts I just mentioned possible in C# for instance? It is, to some extent. You have AOP-support through for instance PostSharp, which does it's magic after your code has compiled. Now I don't really know just how big a difference these two approaches make (Can someone please educate me?), but since it gets in after you compile, at least all your code needs to compile already - which removes the possibility for defining your own keywords for instance, like "print" or something.

In conclusion, Boo seems exciting, but I need more time to conclude properly. Is it worth the time? Might be. I know I definitely need to have a closer look.

Sunday, August 31, 2008

Handling the object-database-problem (Presentation)

A few days ago we once again had a "fagdag" - a full day dedicated to learning (and a splendid evening building team spirit afterwards of course).

I held a presentation about the different ways to handle the object-database issue, focusing mainly on the positives and negatives of the various approaches, thus, I believe, providing more value than giving any sort of tutorial on tools or approaches which can easily be googled.

The goal of the talk was to make sure that all my colleagues were up to speed on what is available in this area today, and what should be used when.


Here's a brief overview of what I talked about.

The issue is the difference between objects and relational databases. Databases focuses mainly on storage and (more or less) normalized relationships through keys, while objects focus mostly on the interaction between objects in a business domain. They typically have the concepts of inheritance and polymorphism close at hand, while databases only handles these if they must, and definitely not naturally. the approaches from the object or database side certainly don't go hand in hand, neither technically nor conceptually.

So which options do we have for handling the object-relational impedance mismatch, and what's the pros and cons of each approach?

I covered the following approaches:

  • Manual SQL

  • Stored procedures/views

  • Dataset

  • Active Record

  • OR-mapper (self-made)

  • OR-mapper (open source/commercial)

  • Code generation


As well as a few concrete implementations of active record and OR-mapper:
  • NHibernate

  • Linq To SQL

  • Entity Framework

  • Castle Active Record


One concept important to have in mind when comparing many of these options is Persistence Ignorance. This means the degree of which your domain model is freed from having any reference or concerns about how the entity should be persisted. A large degree of Persistence Ignorance means you can model your domain model without any restrictions or interruptive information. Note that this does not correlate with how easy the solution is to use in general.

Manual SQL

Often used with datasets or "manual" OR-mapping.

Pros
- Easy to use

Cons
- Can be hard to handle / loose overview
- You need to write lot's of code
- No indication to the DBA what accesses the database
- SQL-injection
- Query plan caching - if you run a query against SQL Server without using parameters properly, the entire query will be cached, whereas if you use proper parameters the query excluding the parameters will be cached - meaning you can reuse the cached query plan with any number of queries with different parameters.

Stored procedures / Views

Often used with datasets or "manual" OR-mapping.

Pros
- Easier for the DBA to know what accesses the database
- Well known
- Great for batches and reports
- Could be used well as a layer of abstraction
- Security if you can't handle it on the application layer or when several applications must access the database and you need a central handling of it. You should have a good reason for doing this.
- Tuning of queries

Cons
- Can be time-consuming to specify everything as stored procedures
- If business logic starts leaking into your stored procedures you're in problems
- Switch database type. Not usually an issue, but if you are planning on changing databases anytime soon, don't model everything in stored procedures.
- I believe it's more hostile to change than many other options.
- Error messages and debugging is less clear
- (Testing. This is being used by many as a reason against stored procedures. I don't agree with this, becuase stored procedures are really quite easy to unit test, as long as you keep them clear and concise and don't go superdynamic on them)

Dataset

Pros
- Rapid Application Development(RAD) support
- Very quick to get up
- Support throughout the .NET framework
- Microsoft product. Easy to get a customer into
- Quick to get going, can be sure that everyone involved in a project will know of it.
Cons
- Scales very poorly with complexity
- Often need to live with utility-classes for functionality
- The database schema often leaks down into the code, frequently even into the GUI-binding. Try changing anything there..
- Not exactly an object oriented approach
- Everything must be casted (stored as object)
- String-based access to table and column-names (Unless you're using typed datasets)
- Since they're so easy to use and can be filled anywhere, some solutions even fills them in the codebehind. Oh beware

Active Record

Row object (entity) with domain logic and knowledge of how to persist itself. 1-1 mapping towards a database table.

Pros
- Easy to get started
- Not duplicated mapping of property names as in OR-mapping with mapping files
- Mapping against database in object. Nice to have everything in one spot
- Centralised connection to the database in one layer
- Always works with entities. Automagic queries to the database

Cons
- Works well with a uncomplicated schema (1:1). As soon as your domain model won't map well with the database schema, Active Record won't do for you.
- Data centric. Active Record says your objects need to be the same as your database tables. In simple approaches this works well. But as I mentioned briefly in the beginning, what you care about when you model the database is often very different from what you want and need when you model the domain model. Thus an active record approach can produce a domain model which is unclear and bothersome to work with.
- Hard for the DBA to tune queries
- Not big on Persistence Ignorance (PI)
- Testing can be an issue because of the point above

OR-mapper

Automatic mapping between database and objects through mapping files or attributes

General
-------
Pros
- Domain centric approach. Can model the domain separately from the database (to some extent)
- Dynamic SQL-generation
- Support multiple databases (At the same time or simplify switching at a point in time)
- Good solutions have support for stored procedures and/or manual SQL in the cases where you need to do things on your own
- Fast development-time once up and running

Cons
- Complexity (will get back to it)
- Performance (will get back to it)


With mapping-files
------------------
Pros
- Separated object and mapping

Cons
- Must duplicate class name, property names etc.
- XML and pure text (Help with GUI-support, advanced plugins)


OR-mapper developed in-house
----------------------------
Pros
- Own functionality
- Debugging can be easier / possible
- Know that everything that affects a project is internal
- Cool for the one doing it

Cons
- Time-consuming
- Need to redevelop something already developed a myriad of times
- Noone will continue development and testing of your solution outside of the project/organisation
- Anyone new to the project won't know the solution

OR-mapper - commercial/open source
----------------------------------
Pros
- Relatively short startup time (compared to creating it yourself, not compared with most other approaches)
- Advanced functionality available
- Well tested in projects and scenarios
- Free bugfixing and new releases
- New project members might know the solution
- User group with knowledge of its use

Cons
- Does take time
- Can't debug on errors (Might be possible)
- (Generally) limited by the products limits
- Learning curve
- If it's not Microsoft, it might be an issue in your organisation (hope not)
- What if the product go out of production/stops further development?

Code generation

Generate data access and domain objects from database

I don't know this area too well myself, so I'll be short here

Pros
- Advanced generation scrips
- With so many solutions available, there must be something good?
Cons
- Data-centric
- Many of the same cons as with OR-mapper, in terms of complexity etc.



And then it's the concrete implementations:

NHibernate

- Most known and mature
- Port from Hibernate, well known and have existed for a long time
- Large crowd of followers
- Domain centric
- Functionality
-- Mapping files
-- Inheritance (all three approaches)
-- Stored procedure support
-- All types of relationships (1-1, 1-*, *-*)
-- Caching
-- Support multiple databases
-- LINQ to NHibernate
-Persistence Ignorance (PI). Only need to set everything virtual and have a default (any visibility) constructor

I believe it's the de-facto OR-mapper in .NET at the moment.

Entity Framework

MSDN: "Designed to provide strongly typed LINQ access for applications requring a more flexible object-relational mapping, across Microsoft SQL Server and third-party databases"

- Enterprise solution
- Several layers of mapping (Conceptual, Storage, Mapping)
- Is supposed to be more than an OR-mapper (Don't deliver on this in V1)
- Poor in terms of Persistence Ignorance
- Not very mature

But I'm positive to Microsofts entry into this arena. The fact that they don't believe datasets can solve every problem in the world anymore is a big step forward. Also, they have taken all the criticism to heart, and have started fixing many of the shortcomings for v2. They also have a blog describing the work and asking for feedback, as well as an expert group following the progress. That is good stuff. Still hope I don't have to work with it yet becuase of some fancy marketing slide though. But hopefully it does become a good contender for the OR-mapping crown, what could be better than better tools?

LINQ to SQL

MSDN: "Designed to provide strongly typed LINQ access for raplidly developed applications across the Microsoft SQL Server family of databases."
- Good for RAD-development
- Support only direct 1:1 mapping towards the database
- Only table-per-hierarchy inheritance
- Only attribute-mapping when you use the 08 designer
- Only SQL Server (Though this apparently wasn't becuase of any technical challenge.)

Castle Active Record

Pros
- Simple to get started
- Created on top of NHibernate, taking advantage of all it has to offer, as well as making it much simpler to get started
- Don't need to specify column name in the attribute if it is the same as the property name

Cons
- Standard problems with PI for Active Record
- Common to inherit from base class (although you don't have to)
- Data-centric
- 1:1

I created a couple of diagrams to help compare the four approaches in terms of Peristence Ignorance and OR-mapping in general:



Friday, August 22, 2008

When is refactoring unappropriate?

In a recent meeting, our own self-proclaimed agile guru (not saying I disagree!), Odd Helge Gravalid, came up with a good suggestion on when refactoring can be really troublesome.

If you are working on a "legacy project", a system in production, lot's of ugly code, no structure - in other words refactoring heaven, refactoring can actually be dangerous. Say you do refactor lot's and lot's of the code, and at the same time, someone else need to do a fix and they bring in a bug in the code. If this is patched into the live system, and you try to find out what went wrong, you have a problem.

Why? Often you can simply start out by doing a code-comparison in these cases, to see what has been changed. Try doing that after a big refactoring (Refactoring here normally includes removing unused code, etc. Not really refactoring, but for lack of a more complete word....)

I'm not saying that you shouldn't refactor in these cases. Just be aware of the risks involved. And hopefully you do write a complete test suite for the code before you refactor it, making it a lot easier to track down any problems with both the refactoring and any new functionality.. (Remember: Typemock + legacy code testing = true)

Thursday, August 21, 2008

MSTest ExpectedException doesn't support System.Exception

This isn't an informative post, as I like to keep it, but MS just got me really annoyed.

The MSTests ExpectedException attribute, where you put in the exception you expect will be thrown, does not support System.Exception, only derived exceptions!

How crazy is that?

Obviously you should try to have as specific an exception as possible (And I follow that, must be why I haven't seen it before), but it's perfectly right to throw a System.Exception if a more detailed is not possible. Hey, it's even the one MS praises in every .NET 2.0-> certification... (ApplicationException earlier)

Tuesday, August 19, 2008

Refactoring of the persistance-solution in agile projects

I invited my fellow colleagues to a discussion today about a topic I felt could make for some interesting learning: Refactoring of a persistance-solution in an agile project.

By that I mean, when/how/if do you refactor away the solution you have for communicating with the database. The general possibilities we reached was

  • You can drop the whole solution and create a new one after a few years. Evidently this happens a lot more often than you’d expect.

  • You hide the refactoring in a new big task, often accompanied by some cool buzzword

  • You try to isolate the old solution away, and start fresh with new functionality

  • A few more I’ll cover below


I was surprised that few had been in the situation where replacing the persistance-solution was deemed necessary. I don’t have the illusions that any non-developer (-strong) organization could see the real value in doing something like that, but I thought I’d get more examples of projects where that was the only solution. Say you run the msdn-friendly dataset-strong application which works great in the beginning, gets the GUI up quickly, and have a few initial iterations delivering great functionality and a GUI which rapidly fills with functions. Great stuff! Until you reach a level of complexity which just breaks the design completely down. What do you do then? A few possible solutions:

  1. Explain to the customer that you need an iteration or two to redesign the application to accommodate the newfound complexity.

  2. Quietly include the refactoring within the subsequent iterations tasks, far lowering the work delivered to the customer

  3. Postpone everything, plan instead to get around to it once everything quiets down

  4. Have separated things properly, making it easier to change from one solution to another

  5. From the initial phase of the project, settling where we are going and the time it will take, using your experience to choose a better solution, perhaps NHibernate or Castle Active Record, and hopefully your domain-model will never got into this extreme a situation (with contant refactoring)

  6. Hah, the object-relational impedance mismatch is obsolete; object databases will save the world!


First of all, there so no single answer to any of this, as with everything in our business – it depends. It depends very much on what kind of solution you are working on and what kind of environment you are working in.

Obviously number 1 isn’t going to go down well with the customer – “you’ve had such a good flow so far, just keep it going” is not a surprising answer. And that’s quite understandable, and really correct in many cases. You should have applied constant refactoring to your solution, enabling you to steer clear of that situation. Of course, if you make a msdn-friendly dataset-app in the first place, you probably know that it isn’t going to handle the most complex of tasks, so you’ll never get into this situation. I bet it does happen though, and has happened more than I can imagine (I hope not).

Number 2 is the one you should have been doing all along, except now it will far halt the current work. Perhaps that is okay. You could get to an 80% finished solution quickly, and ask if that is ok. If it is, then the customer will have gotten his money worth of application quickly. If not, you can be clear to the customer that to get this and this, it will cost exponentially more than it did before. Cause we all (should) know the Pareto principle, or 80-20 rule, 80 percent of the job takes 20 percent of the time (or cost), and vice versa. Again, the customer might not be too happy about that either.

Number 3: postpone doing anything until later. You’re already in a pretty dire situation, so postponing is not likely to do much to improve on that. If you don’t know what to do, then I guess just keep doing the same is probably the best – but if you get to that point, I don’t want to have anything to do with you anyway :)

Having separated things properly (4) will give you a good basis when you get into this situation. Hopefully it won’t be too costly to make the change to tougher concurrency-issues, more advanced business logic, etc.

As long as you didn’t go with a big design up front, but leveraged experience to choose a clever framework to start on, good for you. I currently fall very easily into the Domain Driven Design with NHibernate/Active record-camp (5), using a proper domain model and everything. Not nearly as fast as the dataset approach, but more applicable in the situations I’m usually in (The dataset-approach can be great though, don’t overdesign a simple application – there’s nothing ¬with more framework support from top to bottom in .NET than datasets!)

Slightly sarcastic with the object database-title here, I just have a feeling someone who falls for one without having much experience with the other side of the table (relational), could possibly proclaim it as the new(ish) silver bullet. Now I’d love to try using an object database, because it seems you do get around many of the issues you need to consider and constantly work on with the traditional relational database approach. However, being in the real world, you mostly have to live in an existing business environment, where relational databases are the big G. Converting the object database at the end of the development cycle is a possibility, but beware of pitfalls, some intelligent colleagues found quite a few (Perhaps they could write about it soon?).

You should always be aware of the total cost of ownership (TCO) of course, knowing that the initial development of an application is only a small part of the entire cost of it, maintenance being the biggest thief. But you know, in the real world, the one funding or driving the application development isn’t necessarily the one paying for the maintenance, so… Hopefully you’ll have someone who has a bit more professional integrity than that, but it’s no wonder shortcuts on that side are taken.

One of my colleagues, Sverre Hundeide, came up with a nice suggestion. The contents should be clear to all, but the conclusion has value. An OR-solution has a few possibilities in how it interferes with your system

  • Persistance ignorance - Nothing in your domain-model assembly references anything concerning the persistance solution used

  • POCO – For me, this is the same as above, but he means this has been defined as almost the same, except there can be some references in the assembly, extra metadata, etc.

  • IPOCO – The domain classes can inherit from a base class used for persistance

  • Code-generation – The domain and persistance code is generated from the database, often highly coupled.


Basically, the higher you are on the list, the easier it is to change to another solution. Simple, but an important thing to have in mind.

I guess it’s time for the conclusion now. Or it should have been anyway, except I’m not quite sure what I covered. Just Another blog post with a lack of well arranged contents. You’ll have to make do for now :)

Friday, August 15, 2008

SQL: Convert list of values to comma-separated list

Ever had a list of values in SQL, but wanting them in a comma(or anything)-separated list?

Just do something like this:

SELECT c.co_companyName,
substring(List, 1, datalength(List)/2 - 1) AS 'Employees'
FROM co_Company c
CROSS APPLY (
SELECT e.em_employeeName + ',' AS [text()]
FROM em_Employee e
INNER JOIN co_Company c2
ON c2.co_Id = e.em_co_Id
ORDER BY 1
FOR XML PATH('')
) AS Dummy(List)


If you had a a company-table and an employee-table, you'd end up with a list of the companies in the table, with a column with a comma-separated list of all the employees.

By the way, apologies for how absurdly ugly this looks as plain text.

Wednesday, July 23, 2008

VS Add-in: Extending Sealed Classes (without Extension methods)

If you are ever in the situation when you have a .NET Framework class with most of the functionality you want, but not really doing everything to the extent you need, you have a few options:

  • You can inherit from the class and extend as you see fit

  • You can create a utility class to perform the extra functionality

  • You can use extension methods to add functionality to the class

  • You can create a new wrapper class which holds the class as an internal variable and create wrapper-methods for whatever methods you need from the class you are extending, and add any extra functionality you need.

Inheriting from the class is likely the best approach. Unfortunately, about 40% of the .NET 2.0 Framework classes are sealed, which of course means you can't inherit from them.

The second approach is becoming somewhat obsolete, as it can lead to a sort of functional programming which takes you away from proper object-oriented programming. It makes things just a bit harder to find and understand.

Number three, extension methods, is the new cool kid on the block. Add new methods just like magic to any class. The possibilities are enormous! The problem is that you can't do anything with the internals of the class, you can only add new isolated methods.

The last approach can be the way to go if extension methods won't do. This way you'll have complete control over what you want to be public and you can extend and improve on existing and new functionality. There are two problems with this approach. Number one is clarity and number two is identity. If you create an extension of the DataSet for instance, it never is a dataset. You can, in a way, get around the issue by inheriting from one of it's base classes or interfaces though (Using that as a common base).

Be sure only to use it when no other approach is better. The clarity/identity issues can be confusing for someone seeing the class at first. But I do think the wrapper/composition approach have some merit, especially considering the extreme amount of sealed classes that exist. I made it because I was annoyed to find that there was no support for doing something like that besides typing it in manually. And if you have a lot of methods to wrap, that means a whole lot of time typing nonsense methods, not to forget the comments you should add if you want it understandable. No way I'm doing that unless I'm really keen on wasting a lot of time.

That's why I've created a code-generation add-on for the last approach. It's a Visual Studio Add-in to help extend sealed classes.



I’ll begin with showing some results, and then I’ll get to how and why it works/makes sense/you should use it. In this example, I'd like to extend the StringBuilder class. To do that I need to write something like:



Once I press SPACE or TAB while at the end of this line, the add-in starts its work on using this StringBuilderExtender class to extend System.Text.StringBuilder. The result is this:




More examples here and here.


What happened is this:
  • Added an internal StringBuilder instance

  • Recreated all public constructors, methods, properties and fields, and made sure all of them use the internal StringBuilder instance.

  • Added any available comment from the StringBuilder class.

  • Changed returntypes from StringBuilder to StringBuilderExtender where necessary.

  • Added the Serializable-attribute, since StringBuilder is serializable.

  • Listed the interfaces StringBuilder implements in a comment next to the class, in case you'd like to implement the same.

  • Added the System.Data namespace to the using statements if it did not exist.


Why would you want to do this?

  • You really want to extend one of the sealed classes.

  • You need new functionality in a sealed class, and you either only have .NET 2.0, or extension methods just won’t do it.

  • You reuse existing functionality, making a quick browse of the code enough to understand the majority of what the extended class does. Compare that to reimplementing a larger part of the functionality of a sealed class.

Why would you not want to do this?

  • You are somewhat pretending to extend a class you cannot. This can get you into problems with equality and comparison. The commented interfaces do make it simple to extend any common interfaces though.

  • Potential performance hit. You do add some overhead, and internal performance tweaks through for instance Win32 code could have has less effect.

More on the internals

In terms of what you can visually do
  • You need to specify the keywords for the add-in on the class line. The input must always be in the form specified above:

    • public class StringBuilderExtender SealedClassExtender(System.Text.StringBuilder)

    • or in other words

    • [visibility] class ClassName SealedClassExtenderKeyword(TypeFullName)

  • You can use both SealedClassExtender and the short form scx/SCX as keywords.

  • The TypeFullName is, obviously, the full name of the type.

  • You can also extend a class by specifying the path to the assembly. The form is then

    • public class StringBuilderExtender scx(System.Text.StringBuilder, C:\WINDOWS\Microsoft.NET\Framework\v2.0.50727\mscorlib.dll)

    • or

    • [visibility] class ClassName SealedClassExtenderKeyword(TypeFullName,FullAssemblyPath)


Apart from the list above it does a few more things in the background
  • Adds a default constructor if one isn’t specified in the base class.

  • With any method, property or field that returns the base type (StringBuilder), it returns the extended type instead (StringBuilderExtender). A new private constructor with the base type is added if necessary to fulfill this.

  • Static classes does not get an internal instance but just point directly at the class.

  • Lots more fun things to fix every special case.


If you do not specify a path it will try to resolve where to load it from by itself. It will currently do this by first trying to load the type from memory, then specifically checking all assemblies in the current application domain, and then trying to resolve the name of the assembly from the types full name. It will first try to look in the current runtime directory, then in the executing assembly path.

The comments are loaded by using the XML-document corresponding to most framework dlls, as comments are not part of the information retrievable through reflection.

Be aware though, the add-on will rewrite the entire document, so only add using-, class- and potentially namespace-declarations, anything else will be overwritten.

It doesn’t handle everything though. Specifically it doesn’t do:
  • Generics

  • Events

  • Identity-handling (Trying to guess how you want the identity-part handled will probably just end up with a bad guess, you’ll have to do this however you want it. Methods for equals, GetHashCode etc. are thus not created.)

  • Limited attribute support.

  • It also currently loads all assemblies into the current app domain while running. This is just because of a time-constraint on my part, and I’ll fix that.


If you have gotten this far, send me a note if you see a use for it, just don't get it, or have any comments.

Usage

To use this, you first need to install it like any other add-in. Copy the files in the zip to the correct Addins folder, for instance \My Documents\Visual Studio XXXX\Addins. (If the Addins folder does not exist, just create it yourself) Once this is done, restart visual studio, and you can enable the add-in in the "Tools->Add-in Manager"-dialog.

Downloads

Download the 2005 version from here.
Download the 2008 version from here.

Wednesday, June 25, 2008

Entity Framework V2.0

Microsoft has started the work on Entity Framework V2.00 now. Apparently, this version will have a great deal more transparency during the process, making sure it ends up as something the users want.

Good stuff.

NHibernate, beware?

Update:

A petition against the current quality of Entity Framework has begun. Read it to get a brief overview of the issues, and add your signature if you agree.

Tuesday, June 10, 2008

PropertyNamesGenerator (source)

I recently wrote the post Replace strings for property names with type-safe version

This is the source code for the project it describes. I'm posting it because of a request, so hopefully you'll have use for it.

PropertyNamesGenerator

  • Generates type-safe versions of property/field names.

  • Generates C# or VB files

  • The only parser implemented is for parsing NHibernate mapping-files


The NHibernate parser handles regular classes as well as subclasses.
Another parser which will probably get there is for regular domain classes.

Warning: This is not production-grade software, but a simple project I made for myself to handle a current need! This should not be used as a guideline for good design, nor expected to work flawlessly. No such thing have been attempted.

You can download it here.

The zip includes the source, the current binary, and a sample bat-file to use it.

How to use it:
To run the generator, start it with something like:
  • "C:\path\PropertyNamesGenerator.exe" /lang:cs /files:"../../source/Domain/*.hbm.xml" /out:"Output"


An explanation of the various params can be found by running it with a -? switch. It prints:
Run like: PropertyNamesGenerator.exe /files:"path/*.hbm.xml" /out:"path/PropertyNames/" /lang:cs
Parameters:
/Files [Mandatory] - Must have both a directory and a search string.
/Out [Mandatory] - Must specify an output directory.
/lang [Optional] - Output-language. C# or VB. (C# is default)
-V - Verbose output.
Note: All directories can be relative.

It currently only supports searching for files in the directories explicitly specified, i.e. not subdirectories. This is a concious choice.

To integrate it with your build, typically use a bat-file in the same manner as the Generate.bat example file included in the zip.

To integrate it with your build, you can add something like the line below to your Domain Pre-build event: (Project/Build Events)
call "$(SolutionDir)\..\Tools\PropertyNamesGenerator\Generate.bat"
Note that you will need a full path to the PropertyNamesGenerator.exe-file in the bat-file to use it like this.

It will generate files on the form [ClassName].[lang] in the output directory specified. Using a PropertyNames folder to put all of these in is recommended.

You will of course have to include these file in your solution to integrate them. Do this every time you have added and parsed a new hbm-file.

Thursday, June 5, 2008

Persistance-solutions and characteristics

The other day I invited my colleagues to a discussion about persistance solutions. The goal was to cover the various options for persistance in a software project, the pros and cons of each approach, and to increase everyone’s knowledge about it.

The approach planned was to identify the possible solutions, identify important characteristics to help compare each solution, and to go through each possibility.

The different types we identified were:

  • Code generation

  • Object relational mapper (custom made)

  • Object relational mapper (commercial/open source)

  • Row data gateway (Which would be active record with domain logic)

  • Table data gateway (Dataset)

  • Stored procedures/views

  • Manual SQL

  • Object database


And the characteristics in no particular order:
  • Development time

  • Flexibility

  • Performance

  • Stability

  • Complexity

  • Refactoring

  • Expertise

  • Effect/limitation/intrusion on system

  • Rapid Application Development (RAD)-support

  • Attachment to data type

  • Amount of developers

  • Company demands/guidelines

  • What you are creating


I think it is best to start off by defining the characteristics; what they are and why they are important.
  • Development time

    • The time it takes to get your persistance solution ready to use and understood. This can either be the time it takes to configure and use an existing solution or the time it takes to develop your own.

  • Flexibility

    • The flexibility of the system in terms of handling the various storage challenges in a project. This also includes handling a system which inevitably grows and changes.

  • Performance

    • Obviously performance is important, right? This includes how the system performs in general, how it scales, how it can help handle special cases. A word of caution though: Even though performance is essential, don’t optimize prematurely. In any system of reasonable size, you will have special cases that will become bottlenecks. However, with a profiler and the possibility of fine-tuning those cases, you should be quite alright.

  • Stability

    • You need to know that the system won’t break down under normal usage or high peaks. Testing is essential for this.

  • Complexity

    • How complex is it to use and understand? How much help do you get with error messages? What do you when you need to debug? Is it too complex for the average developer on the project, making people take (eventually unmanageable) shortcuts?

  • Refactoring

    • What sort of help do you get when the need for refactoring appears (often)? How hard is it to do changes?

  • Expertise

    • You always need to take into account the expertise and skill level of the people on the project. This is more important that selecting “the best” solution.

  • Effect/limitation/intrusion on system

    • What does it take to integrate the persistance solution with your overall system? Will your domain model be POCO? Do you need to inherit from a base class? Implement an interface?

  • Rapid Application Development (RAD)-support

    • What kind of GUI-support does the solution have? If you need something quick and dirty, what do you do? The winner here should be pretty clear (If not, you’ll soon find out)

  • Attachment to database data type

    • How close is it tied to database data types? Can it for instance handle the types of several databases?

  • Amount of developers

    • How many developers the project has. This could influence the complexity of the solution, or the time available to get it up and running.

  • Company demands/guidelines/policy

    • Company policies/limitations are a part of normal life for a consultant, and often there is little you can do about it. The company might have policies against using anything open source, demand that all database access should go through stored procedures, etc.

    • What can you do about this? Unless you have a small project or are able to influence the architecture group, chances are you’ll have to manage. I guess you can pray that the architecture group doesn’t consist of a bunch of non-coding architects that selects technologies out of marketing slides.

    • Another point I need to raise here: If your company still live in the “don’t use open source”-world, chances are that most decision-makers are a bit out of sync with what is happening in the .NET world today. Agreed, just a few years ago things were pretty thin in terms of open source projects on .NET, but today things have most definitely turned to the better.

  • What you are creating

    • The most important point! Make sure whatever you choose will work for your project. Unneeded complexity is expensive! But remember that going from one solution to the other is often hard, so beware the danger of a project growing on a bad solution; you’ll have a tough time getting out of it


That should cover the basics of what is important to think about before choosing a persistance solution. Let’s get to the actual solutions!

Code generation

This involves generating the data access and potentially partly the domain model from your database schema. You specify rules for how you want things generated. There are a couple of potential problems with this approach.
  • If the software has a binary solution specifying the mapping, you’ll have no way to map two differing changes, which in effect means you can only have exclusive checkout. In a multi-person project this is highly unproductive, and don’t forget that you’ll be unable to use branches, as you can’t merge them later.

  • The other is that once you want to add a field or property to a domain object, you’ll have to update the database structure and generate the files before you can use it. This does take some time. I’m not saying that most other approaches are must faster; the annoying thing is that you can’t simply add properties etc. during testing for instance.

I have to admit that I have limited knowledge of this type of persistance-solution, but I believe this approach could be successful. There is no reason why the general mapping solution between the database and domain objects should be any poorer than with a general OR-Mapper. It depends what extra functionality is available and how everything is implemented (works). By this I mean the SQL generated, how the session is handled, lazy loading, how it affects the domain model and system, if it supports queries or SQL/stored procedures for special cases, type safe support for anything and everything, maintainable/understandable script files, and a few more I probably forget at the moment.


Object relational mapper (custom made)

An OR-Mapper is a piece of software that handles the problem of transforming your domain objects data into its equivalent form in the database. Your domain model is made of objects and pointers; the database model is made of rows, columns, keys and relationships. These are very different ways of handling data.

By creating your own OR-Mapper you will have to handle the problem of mapping from the domain model to the database for starter. This is just a small part of what you need to handle though, and to make it clear => There are slim chances that creating your own OR-Mapper will be beneficial for your project. Doing it will be time-consuming and error-prone, you will reinvent the wheel without need (See Object relational mapper (commercial/open source)), you will tie up your best resources for a substantial amount of time, you will have to implement a lot of added functionality to make it functional. Oh, and I doubt any project will wait before the mapper is finished before starting the rest of the development - which means some other form of temporary solution must be created.. (An object database might be the best choice in such a scenario.)

However, as a colleague commented, it is probably a dream for most developers to do it. Why? First, it’s complex, so you’ll learn heaps by doing it. Second, you don’t have to relate to the business side, so you really just define most of the tasks yourself. Perfect or what?


Object relational mapper (commercial/open source)

This is the same as above, except you use an already existing solution from a vendor or open source project. I recommend this approach.

Why create your own OR-Mapper when great solutions already exist? If you choose a premade mapper you will (potentially) get:
  • Shorter development time – You will need time to configure, use and understand the solution. This is time-consuming as well, but far easier than creating your own. Getting started should also be fairly easy, even though the tougher concepts need more time.

  • There’s a chance that present/new developers have used the solution before

  • A solution tested in many projects, which means less bugs and more working features.

  • New versions at no development cost.

You need to live with a few risks though:
  • Harder to customize and debug. You’ll have to use time on strange error messages which can make little sense.

  • The company might stop development and support, or the developer base of an open source project could die out. If you create your own OR-Mapper, the main developer(s) could quit as well.

In general, getting something of the shelf is far cheaper than building it yourself, as long as your requirements are met. I believe that this solution most of the time far surpasses any positive effects of building it yourself.

Picking a mapper at random just by browsing some marketing slides are not the way to go. Make sure you thoroughly read specs and user feedback, or the best thing if possible: talk to people that has experience with what you are considering, if it is a popular solution they shouldn’t be too hard to find.

There are a range of solutions in this field, but personally I’ve had the pleasure of working with perhaps the most known of them: NHibernate. NHibernate is an open source OR-mapper, which is a port of the well known OR-Mapper Hibernate from the Java world. NHibernate is a great piece of software, it has been used on numerous projects, and quite a lot of information is available through blogs and forum.

I’m not going to list all the reasons to use NHibernate, a quick search on the web should give you that, but a few things:
  • NHibernate lets you have (almost) POCO objects (You need to mark all persistable fields/properties as virtual. NHibernate subclasses your objects with the virtual proxy pattern, to give you lazy loading etc.)

  • NHibernate has mapping files to map between your domain object and database. These enable easy modeling of inheritance, collections, etc.

  • It is mostly type safe (with NHibernate Query Generator at least), except when you need to write advanced queries. (You could use something like my PropertyNamesGenerator though)

  • You can use Hibernate Query Language to create special queries in the cases where performance isn’t good enough.

  • Automatic lazy loading

A few bad things as well:
  • There is some overhead involved, and performance has been noted as an issue on several occasions. I think this is more of a design issue with your average developer though. If you try to retrieve gazillions of data from lots of tables, you can’t expect it to be lightning fast. Proper table and index design, as well as good use of lazy loading should get you well under way. For the special cases where you do get a performance issue – use a profiler to see what the issue is, use query analyser to look at the SQL, use HQL or use manual SQL behind a well designed layer to access the data, use DTO’s to limit the data loaded and sent…. There are plenty of possibilities

  • NHibernate has some strange error messages. Before you get to know it enough, you’re bound to use a few hours trying to figure out error messages which doesn’t make much sense. I base this on experience with the 1.2 version, not the new 2.0 release.

If you don’t want to go down the open source way, you’ve probably been (or will be) introduced to Microsofts new OR-Mapper Entity Framework. It’s Microsoft, so it’s bound to be good, right? (…)
Note: I have only read about Entity Framework, and thus my experience (both good and bad) are of questionable quality, so make sure you do your own research before making any conclusions.
The pros:
  • Integrates nicely with LINQ (But Ayende has a project going to bring this to NHibernate as well

  • Microsoft helps bring the concept of an OR-Mapper to public knowledge, which really is good

  • With Microsoft’s size, there’s a good chance that the project will continue. This depends upon how many uses it of course; it might get shut down like Microsoft’s music service did.

  • Developers versed exclusively in the Microsoft world have something besides datasets and manual ADO.NET to use.

The cons:
  • They’ve built a completely new product - which means you should think they should be able to leverage the experience of already existing OR-mappers. You’ll need a good amount of resources to do this though. For some reason it doesn’t like Microsoft has quite lived up to this.

  • Having read quite a few blogs about Entity Framework and spoken to Microsoft employees about it, I must say I’m initially skeptical. At least for enterprise development. According to one Microsoft employee, Entity Framework is only believed to have a third of its user group in the enterprise software world; the rest is your simple application developers. The needs are quite different.

  • One of the aspects that alarms me quite a bit relating to enterprise development and Entity Framework (or really any modern type of development – read: with source control), is that until recently the mapping files were unmergable. (The XML created were put in a “random” order. A small change could lead to big changes in a document) I can’t for the life of it understand that Entity Framework could be designed as anything but a play version if that wasn’t an important design point from the beginning. Ayende had a post about a meeting he had with Microsoft about this point, and based on his reactions it seems likely that the team defended this decision. It seems Microsoft has improved this feature after the range of reactions on it, as mentioned here

  • Explicit lazy loading. You have to explicitly say that you want to load a lazy-loadable collection. I think this sounds mostly annoying, as you need to fill your code with logic of testing if a collection has been loaded, and then explicitly loading it, compared to the NHibernate way of automatic lazy loading by simply using it. There is a good thing about it - the fact that you won’t get unexpected database calls from the GUI-layer because you forgot to load everything you needed. This could lead to a performance hit and other problems. I don’t believe that merits this solution.

Even though I’m skeptical about the current quality of Entity Framework, and would recommend using NHibernate instead, I’m positive to Microsoft’s general move into this realm. With their funds, future versions have the potential of becoming really useful, with hopefully seamless integration with the rest of the framework.

For now I’m mostly scared that I’ll be put on a project where its use will be mandatory. Unfortunately, we’re still in a world were non-Microsoft software is looked at with skepticism from many of the decision makers in companies.


Row data gateway (Which is Active Record if you have domain logic)

This is the same as having a gateway which gives you objects per row in the database. If you add domain logic to these objects you have what is called Active Record. I’m going to concentrate on the Active Record approach, as I can’t see a good reason why you’d want to have a simple row data gateway in .NET. Active record is a domain object which handles persisting itself.

Active Record has the advantage of being simple and quick to implement. It is not hard to understand, and is a good way to make a quick prototype while retaining a domain model. It breaks down once your database gets complicated, and once you don’t have a one to one mapping between an Active Record object and a database table.

You can use Castle Active Record to do this. It is built on top of NHibernate. In terms of refactoring away from Active Record if the complexity increases, apparently there is a way to automatically go from Active Record to a full OR-Mapper NHibernate solution automatically. I haven’t tested this though.
(You’ll probably end up with a NHibernate solution through Castle then as well. This is not a bad thing! Castle integrates very well with NHIbernate through its NHibernateFacility, and easily allows for instance a Dependency Injection approach as well. I’ve written about it in a previous post)


Table data gateway (Dataset)

In the .NET world, table data gateway is the same as the dataset-approach.

The major benefit of using datasets is the unmatched framework support for it, where creating a datagridview and datasource, using databinding, adding a navigator to handle paging, etc., is extremely simple and powerful. For quick demos, or Rapid Application Development (RAD), nothing can match it.

There are two main problems with this approach
  • It really doesn’t scale. Once you start adding business logic, you’ll have problems with the lack of object orientation, with the lack of type safety, with duplication of logic, and lots more.

  • Expectations, if you use it with prototypes. If you give the business side a quickly running demo with this approach, you’ll get into problems when you try to explain to them how long it will really take to build.

If you know you are creating a very isolated, not to be extended, solution, by all means use the dataset approach – nothing can match it in speed or simplicity. If there’s a chance you need to add more to it later, opt for another alternative. You’ll have a hard time refactoring it later on.


Stored procedures/views

Putting everything in stored procedures/views is another approach that has been used. I’m not going to bother saying much about it, as it’s not really a viable alternative. You’ll be better of using this approach than manually concatenating information into SQL queries though (Like avoiding SQL Injection attacks)

It is a possible approach if you have special cases where you just can’t make the performance demands without using stored procedures.


Manual SQL

Don’t bother. But if you have to, at least limit its use to a database layer. And make sure you remove hazardous characters so you avoid SQL Injection.


Object database

An alternative quite unlike the rest. Whereas the previous sections concentrated on ways of working with a relational database, you also have the option of using an object database. If you choose this approach you can just pass your objects to and from the database.
Again, I haven’t tested this, but I have very experienced colleagues who have little but positive things to say about this approach.
The pros of this approach includes
  • No need to map between a relational design and a domain model

  • Don’t need to update a schema several places

  • According to some benchmarks, they can be superior for certain kind of tasks. It has been said that they are very efficient at specific queries, while they are slower at more general queries.

  • Most of the object databases also support some sort of query language when the need arises

  • Some even fully support SQL, but I have no idea how this works in practice.

The cons
  • Practical knowledge of these are still fairly limited

  • Hard to access from other parts of the company network, for reporting purposes for instance

One approach some of my colleagues took was to go through the whole development period with an object database before converting to a NHibernate solution before going live. The conversion into a relational database was only done because of company demands. This was still a success, but if you do this you have the danger of not quite knowing how long it will take to set up the final solution or exactly how the data will perform.

I look forward to testing this in a real world project, you should too.


Conlusive thoughts

In summary I’d say that you should
  • Use the table data gateway/dataset approach if you have a short and sweet application

  • Use Active Record if you need to get results fairly quickly, and have a close relationship between your table and domain structure

  • Find out whether a commercial/open source OR-Mapper or code generation tool or an object database suits your needs best - if you need a somewhat complex application

  • Build your own OR-mapper if you’re forced to, it’s not good for the project, but you’re lucky :)

Wednesday, June 4, 2008

Replace strings for property names with type-safe version

At times you have the misfortune of having to write property/field (I’ll just call it properties from now on) names as strings, for instance in advanced NHibernate queries. (In general, NHibernate Query Generator avoids this for you in most cases, but there are still some times it cannot be used. Say inheritance for instance). If you’re a bit slow on progressing in mocking tools, this might be a problem as well (It shouldn’t still be though…)

The big problem with non-type safe strings is refactoring, which should happen all the time on your projects. Property names will change, and when these are used as non-typesafe strings in your system it has a few unfortunate consequences:

  • You could break something without knowing it. Worst case is you won’t find it until production

  • The knowledge above could restrain you from doing refactorings.

  • You will have to spend time doing text-searches in the system to find out if and where it is used.

So for a personal project I’m currently playing with, I figured I wanted to do something about it. So I created a small generator project which generates static classes where you have access to the domain objects property names in a type safe manner. What this does for me is:
  • If I do changes to a property of a domain object, and that has been used as a (previously) non-type safe identifier, my compiler will complain.

  • I can do refactorings all the time without worry.

For now it looks for instance like this in use:



This will return “Text”, which is the name of a property in the Resource class or a base class.

I’ll be the first to admit that this is not ideal. Property names as strings are not something you want to deal with, but at times you have no choice, and this feels like the better of the two options.

For those interested in how I chose to solve it:
The generator takes in an input path, a search string and an output path. This is put in a bat-file which is called from the domain projects build. Currently I’ve only implemented a parser for NHibernate mapping files (HBM), which for each class creates a file with each property listed for that class. (I can see situations where you would want to use properties not just used in persistance, but that’s all I need for now). Inheritance is handled by duplicating the properties of the base class in the subclasses.
In short, the file created for the example above looks like:

Wednesday, April 9, 2008

Regular Expressions in Visual Studio – the top-down approach

Basically every approach to regular expressions I’ve seen uses the technique of learning all the symbols first, then using that to create various regular expressions. Let’s call that the bottom-up approach. I figured I could add some value by doing it the other way around – by showing specific strategies where regular expressions make sense, showing the regular expressions needed, and then explaining how it works. In other words, the top-down approach.

Once you’ve read this post you should be able to use regular expressions to do pattern matching and extraction in Visual Studio. I’ve kept the number of regular expression meta-characters to a minimum, (hopefully) making it easily understandable as well. You’ll be far from an expert on regular expressions after reading this, but hopefully I can either help get you interested in regular expressions (You should!) or learn some tips about using it with Visual Studio.

I’ve tried to divide the contents into logical sections, so just skip whatever doesn’t sound useful.

On Visual Studio

The built in regular expression support in Visual Studio (2005) is fairly strange – it doesn’t quite follow conventional regular expression syntax, it doesn’t follow the .NET Framework syntax, and irregular behavior has been found. It’s not better that it seems extremely slow once you give it fairly large files either.

Besides its drawbacks, it has become the place I use Regular Expressions the most. And trust me, I like using Regular Expressions. So why use Visual Studio? The probable reason is that I use it on a day to day basis, and since it’s not uncommon that I use it for searching, I guess it just became natural to start using it for other tasks as well.

Oh, and by the way: There’s lots of regular expression software out there you should give a try. Free, good software, with somewhat more standards-based meta-characters, probably faster, likely to have more functionality related to regular expressions, and various other improvements.

Enabling Regular Expressions

  • Regular expressions are used from the “Find and Replace” dialog

  • ”Match case” also applies to your regular expressions

  • The "Find in files" "Find and Replace" dialog (ctrl-shift-f) enables you to search only certain file types

  • The "Find in files" "Find and Replace" dialog shows all results in the Find window, instead of just finding the next, as with the "Quick Find" dialog.

  • The most used regular expression meta-characters are available in the button next to find once you enable regular expressions.

I’ll add the actual dialog Visual Studio displays here just for reference. It describes each meta-character fairly well.



Regular Expression Strategies

General

Ever after learning regular expressions I seem to always find good places to use them. Here’s the list of strategies(or common situations) I’ve covered in this post:
  1. Extract content from lines with a common pattern

  2. Remove empty lines (optionally with whitespace)

  3. Remove lines not following the pattern you are looking for

  4. Add characters around each line

  5. Retrieve the contents of all lines containing a pattern




Regex strategy #1 - Extract content from lines with a common pattern

Since this is the first example, I’ll walk through this one extra slowly

Description: You have a number of lines of contents that have some sort of similar text you want to extract.

Example: We have two lines of text, and want to extract the value in the target part of the xml.

<xml target="test" />
<xml info="" target="test2"/ >

Steps:
  1. Identify a unique string in the lines straight before the content you want to extract

    • target="

  2. Identify a unique string or character on the other side of the content

    • "

  3. The major part of the job is done; just create the regular expression to use the two first points :)


The regular expression we begin with to match the parts we identified above (Don’t test it yet, I’m skipping some important details still):
  • target=" [^"]*"


Splitting this up we get three parts: target=", and " are just plain text matching. [^"]* is different though. If we take a look at the picture above, we can recognize two meta-characters:
  • [^] – Any one character not in the set

  • * - Zero or more

So [^"]* means match a character which is not a ", zero or more times. All regular expressions are greedy by default, so it will try to match as many characters as possible.

To try to show this clearly I will differentiate the different parts of the regex and what it will match by using font styles.
The regex:
  • target=" [^"]*"

Will match:

<xml target="test" />
<xml info="" target=”test2"/ >

We’re not quite there yet. First of all we need to escape all non-letter and non-number characters with a \ to make sure they are interpreted as plain text characters. I didn’t do that above as it would have made it harder to understand at first. We get:
  • target\=\"[^\"]*\"

Harder to read, but necessary. You’ll get used to it.

We’re now able to match the text we want in both lines. However, we’re still unable to retrieve the information from the fields. To do that we first need to match the entire line, so that we can remove the parts we don’t want. To do that we add .* to the beginning and end of the expression:
  • .*target\=\"[^\”]*\".*


The . (dot) is a special character which is interpreted as "Any single character" (Except line break). The regular expression will then read:
  • .* - Match as many characters as possible (Up until the (last) target\=\" part of the text

  • target\=\" - Match the plain text target=",

  • [^\"]* - Match as many characters as possible until we reach a "-character

  • " - Match the "-character.

  • .* - Match as many characters as possible. (Since this is the last one, it means to the end of the line)


With the above expression we match the entire line. The final part we need is a way of extracting the information we want. We’ll do that by adding { and }. This is basically a grouping construct, and you can have any number of them in your expression.
  • .*target\=\"{[^\"]*}\".*


We now have the entire regular expression we need! Let’s test it. Copy the example xml text into a text document and open it in Visual Studio. Open a "Find and Replace" dialog, and add the regular expression. Then add \1 in the "Replace with: " part. \1 means the part you have between your first { and }. In other words:
  • Find: .*target\=\"{[^\"]*}\".*

  • Replace: \1

The result is:

test
test2

Success! :)

Final comments:
Interested in how the regular expression engine actually does the matching? Let’s see what it matches for each part of the expression, until we reach the final match.
  • .* - Match as many characters as possible. This will actually match the entire line. Since the .-character doesn’t match a line break it stops at the end of the line (in other programs with other options, it is possible to make it match line breaks as well.) Remember, it is greedy, so it wants to match as much as possible.

  • target\=\" - Match the plain text: target=",. Now, to be able to fulfill this requirement, it has to "let go" of some of the matched characters. So it let’s go of one and one character until it finds that it can match the string. That’s why, if there had been several target="-parts in the line, it would have matched the last one.

  • [^\"]* - Match as many characters as possible until we reach a "-character. Match one and one character, until it reaches a "-character.

  • " - Match the "-character.

  • .* - Match as many characters as possible.

In fact, different regular expression engines do the matching in a different way, but this is all you need to know to understand how it works.



Regex strategy #2 - Remove empty lines (optionally with whitespace)

Description: You have a document with several empty lines you want to remove.

Example: : We want to get rid of the empty line between the two words.

SomeText


SomeOtherText

Steps:
  1. All you really need is the following:

    • Find: ^:b*$\n

    • Replace:

The meta-characters used here means:
  • ^ - Beginning of line

  • $ - End of line

  • :b – Space or tab

  • \n – Line break

When splitting the regular expression into parts it can be read like this:
  • ^ - The match must start from the beginning of the line

  • :b* - It will be followed by zero or more (as many as possible) whitespace characters.

  • $ - The match must end at the end of the line. In other words, the line must contain either no characters or only whitespace characters to give a match.

  • \n – Finally we match the line break for the line as well.


This regular expression will match empty lines, including the line break. By replacing them with nothing you get the effect of removing the lines.

Final comments:
Make sure you have a line break after your last line, or it won’t match the regular expression, as the regular expression requires a line break. Optionally you could have appended a * to the \n, making the expression *:b*$\n*. This would have removed the need for the final line break.

Using Excel to sort the data is a good alternative to remove empty lines. That won’t work if you don’t want the data sorted of course.



Regex strategy #3 - Remove lines not following the pattern you are looking for

Description: With the apparent weakness of regular expression matching in Visual Studio, in that you need to replace the entire line to retain the information you want (instead of getting the matches in a separate window), it is not uncommon to get into the situation where you have lines that you need to remove to be able to focus on the lines containing the information you want.

Example: We got some more XML, this time with comments we want to get rid of:

<!—xml comment -->
<element>
<!—another comment -->
<subelement />

Steps:
The general strategy here is to do a (preferably) two-step action to
  • Mark the lines you don’t want with a special identifier

  • Remove all lines with the special identifier

The actual steps:
  1. Identify a unique string or recurring pattern in the lines you want to remove
    • <!

  2. Write the regular expression to identify and then add the identifier to the line
    • Find: ^.*\<\!

    • Replace: #\0

  3. Remove the lines containing the special identifier

    • Find: ^\#.*$\n

    • Replace:


With point 2, you want to single out all the lines you don’t need (Preferably in one operation, but not always possible). If you have a unique identifier across the entire document, then where you match in the string and add the identifier is unimportant. Often, placing it in the beginning or end of the line is a good starting point.

The first find operation (^.*\<\!) should be easy to read now, as there are no new characters. But for the sake of it: Match start of line, match as many characters as possible until we match the last set of <! in the line.

The replace operation is slightly different. # is the special unique identifier we have used here. It has no special meaning – you could have used an x, or three x-es for that matter, as long as it is unique for the entire line or position throughout the document. Whereas \1 meant the first specified match we found (marked by { and } on each side), \0 holds a copy of the entire line that had a match in it. In effect this means that we simply add a # to the beginning of each line matched.

With point 3, we want to remove all lines with the identifier. So we find each line with the special identifier in the beginning, match the rest of the line with .*$\n, just as before, and then replace it with an empty string.

Final comments:
Another possibility is matching the lines you want to keep, tagging them with the special identifier, and then removing the lines not containing the identifier. Of course you will have to remove the special identifier in the end, so there’s one more replace involved. Not a big thing though.

A third possibility is to use strategy #5 – Retrieve the contents of all lines containing a pattern. This is most useful when you need to match across several documents.


Regex strategy #4 – Add characters around each line

Description: At times I seem to end up in the situation where I have lines of information that I need to surround with information or characters.

Example: We have a number of lines of information which need to be used in an SQL in-query as strings, and thus need the necessary surrounding characters. The SQL query is: SELECT * FROM something WHERE name in (…). For those unfamiliar with SQL syntax, we want to add a ‘ to the left of the expression and ‘, on the right side. The information:

Test1
Test2
Test3

Steps:
As long as none of the lines involved already has the necessary characters and no characters that will invalidate the statement, all we need to do is:
  • Find: ^.*$

  • Replace: ‘\0’,

The find simply matches everything on each line, from start to end character.
The replace puts our plain text characters on each side of the expression. The result is:

‘Test’,
‘Test2’,
‘Test3’,

You’ll need to remove the last ,-character yourself, and paste it into the SQL query, making it:

SELECT * FROM something WHERE name in (‘Test’,
‘Test2’,
‘Test3’)

Final comments:
Another common usage is creating SQL inserts. The logic is just the same; just add some other information around.



Regex strategy #5 - Retrieve the contents of all lines containing a pattern

Description: You want to get the contents of all lines matching a certain pattern, either because the lines make sense for themselves, or because you want to do further work on the lines. This can of course be for one or multiple documents.

This is an alternative to using strategy #3 (Removing lines not following the pattern you are looking for). Strategy #3 is probably faster when you work on a single document.

Example: Reusing a previous example contents. This time we want to retrieve all of our XML-comments:

<!--xml comment -->
<element>
<!--another comment -->
<subelement />

Steps:
  1. The regular expression needed to find the correct lines aren’t really the important part here, as what we wan’t to show is how we go from there. We don’t even need to use a regular expression in this case, just search for <!--. The important part here is to do this search in the “Find in Files”-dialog (ctrl-shift-f). The results listed should be:

    • C:\aPath\file.txt(1): <!--xml comment -->

    • C:\aPath\file.txt(3): <!--another comment -->

  2. Now copy the results from the “Find Results”-window and paste them into a document.

  3. We need to remove the path information. Use the following:

    • Find: ^[^\:]*.[^\:]*.

      • ^ - The match must start from the beginning of the line

      • [^\:]* - Match zero or more (as many as possible) non-:-characters.

      • . - Match the :-character

      • [^\:]*- Match zero or more (as many as possible) non-:-characters.

      • . - Match the :-character

    • Replace:

  4. Now we remain with only the lines we’re interested in.


Where do you go from here?

If you really want to learn about regular expressions, you’ll need to get Mastering Regular Expressions (Jeffrey Friedl). The book is THE book on regular expressions.

If you for some reason don’t want to get the book, a series of 10 videocasts from Zain Naboulsi (Is this thing on) should be your second choice. Besides being well made, it focuses only on .NET, which Mastering Regular Expressions only does partly.