Friday, April 6, 2012

1c for the code, 99c for where you put it!

All engineers have probably heard this popular story at some point in their life: An engineer is called to fix a complicated machine, which the engineer fixes in minimal time by just replacing a screw. The engineer presents an exorbitant bill for the service, and when questioned by the customer, explains that the cost of the screw is only 1% of the bill, and the remaining 99% is for replacing the "right" screw.

It occurred to me that this story applies pretty well to the software engineering field. In the software engineering world, if I had a dollar to spend on a line of code, I would put 1c on the actual code itself, and the remaining 99c on the location of the code. Why would I do that? After all, as long as the code is working, does it really matter where it exists? The truth is that code added to the correct method, class and component makes a significant difference in the readability and maintainability of the code. Correct design ensures that the code is placed in the correct location, so the moral of the story is - to get the biggest bang for your buck, focus on the correct design and not on just getting the code to work. Remember, in a world with 9 billion people, there are probably millions who can write working code, but only a selected few who can design it correctly.

Wednesday, March 7, 2012

Enterprise vs. Cloud Computing

I recently found myself in a discussion where I was trying to differentiate between web computing and cloud computing. I was trying to make a point that these days all web-based products are claiming to be cloud-enabled. Anyhow, I later realized that I was making a wrong comparison. The web is in fact just a medium for product delivery; the comparison should have ideally been between "Enterprise Computing" and "Cloud Computing". In an era where "Cloud" has become the buzz word, most people are touting their web-enabled enterprise products to be cloud-enabled.

Enterprise and Cloud computing are two different things, the only thing common is perhaps that both are  web-enabled, which means that they can be accessed via the browser using an internet connection. However, there are many differences in these two technologies, and some of them are highlighted below.

1. Single-tenant vs. Multi-tenant: Enterprise solutions are generally single-tenant, which means that each customer or enterprise has its own deployment. Cloud solutions are multi-tenant, where the same deployment supports multiple domains (customers or enterprises).

2. Scalability: Enterprise products are expected to scale from 10s to perhaps 1000s of users, whereas Cloud solutions need to scale to millions of users. So whereas an Enterprise platform may be scaled vertically by beefing up the server hardware, Cloud platforms generally rely on horizontal scalability by increasing the number of processing nodes in the cluster.

3. Performance: In an Enterprise platform, performance may be improved by having  multiple processing threads; however, cloud platforms improve performance by processing on multiple nodes in parallel. Therefore, Cloud platforms require a distributed processing and coordination framework, such as Hadoop.

4. Virtualization: Enterprise products generally use dedicated resources - servers, storage devices etc. Cloud-enabled products, on the other hand, use virtualization to share hardware resources.

5. Access Control: In an Enterprise platform, access controls pertains only to user access control, whereas in Cloud platforms, access control also applies to the resources shared between the multiple tenants.

I am sure that there are other differences between Enterprise and Cloud computing, so this is certainly not a comprehensive list. The objective is primarily to get the product architects thinking in the right direction. I hope you found it useful.






Wednesday, January 18, 2012

Singleton Pattern - The Correct Usage

The singleton pattern is perhaps the most widely (mis)used pattern. Developers who program in object-oriented languages use the singleton pattern whenever they think of a service or manager class. I have seen a lot of projects use singletons even when the singleton class never encapsulates any private data that needs to be handled for multiple instantiations. Usage of the singleton pattern - with a private constructor and a getInstance() method - ensures that only a single instance of a class exists and all threads use the single instance. A database manager class that maintains a local data cache is an ideal candidate for a singleton, as we wouldn't want to maintain multiple local caches; the singleton cache should be loaded once and preserved in memory for faster access. Operations on this cache should also be thread-safe. If a class does not hold and control access to private data, then simple object instantiation makes perfect sense.

Another common mistake with the singleton pattern implementation is doing the object instantiation in the getInstance() method by making it thread-safe and having a check to see if the object is already instantiated. Given below is the implementation logic in Java. The method getInstance() is made thread-safe by using the synchronized keyword. This approach degrades the application performance as a lock is obtained on the object every time a thread calls the getInstance() method.


MySingleton mySingleton = null;


public synchronized MySingleton getInstance() {
  if (mySingleton == null) {
    mySingleton = new MySingleton();
  }
  return mySingleton
}


A better alternative is to instantiate the singleton object at the time of declaration. In this case, the ClassLoader instantiates the singleton when it comes across its reference, well before any thread is active in the application. Since the getInstance() is no longer synchronized, this offers better performance.



MySingleton mySingleton = new MySingleton();  // Object instantiated during declaration


// No need to synchronize method
public MySingleton getInstance() {
  return mySingleton
}


The concurrency-optimized implementation of the Singleton pattern is explained in many Java Concurrency books, but hopefully now you won't have to read the whole book to learn this neat trick.


Tuesday, June 21, 2011

Using Database Constraints

For many developers, a relational database is just a simple data store. They really don't believe is having any constraints in the database, and prefer to implement all checks-and-bounds in the application logic. This is really a recipe for disaster, and in this post, I will look at some key database constraints that should always be set for an application.
  1. Foreign Key: A relational database is called "relational" for a reason - the term signifies the ability to relate tables (and data) with foreign key constraints. A foreign key constraint avoids data integrity problems and prevents programming errors. Having foreign keys is essential for parent-child and other association relationships. A commonly quoted example is the "Purchase Order" and "Line Item" data, where the purchase order id is the foreign key in the line item table. Not having the foreign key relationship between these tables allows for the possibility of having an incorrect purchase order id in the line item table. This value may be inserted mistakenly by the application code or perhaps manually from the SQL prompt, which in turn results in unpredictable errors in the application and perhaps defensive code to check if the purchase order id is correct before performing further processing. It is much easier to avoid these problems by having proper foreign key constraints in the database.
  2. Not Null Constraint: This is another commonly overlooked problem. Even with limited domain knowledge, it is not too difficult to ascertain which columns can never be null and enforce this in the database by adding a "not null" constraint on the column. Again, the avoids the need to have defensive code all over the place that checks if the variable - that stores the data from this column - is not null before using it for further processing.
  3. Unique Constraint: Unique constraints can be placed on a single column, or on a group of columns. To create an unique constraint on a set of columns, it is required to create an index on them and set the index as unique. Without the unique index, it is common for application code to query the database to check if certain key data is already present before inserting a new record with the data. As an example, if the user id in the user table needs to be unique, then it is better to enforce this with a constraint, instead of querying the database to see if the user id is already exists. Attempting to insert with the duplicate user id results in an exception, which can be caught and the user can be requested to select a different user id. There is a significant performance benefit of using the constraint approach, instead of the using the query to check for duplicates.
  4. Using ENUM: Most programming languages support the enum datatype, and now most databases also support enum. Before enums, if a column could have a fixed set of values, then the solution was generally to have integer (or char) constants denote the various values. As an example, if the user account could be 'active' or 'disable', then the 'state' column would generally have integers (0 and 1) to denote the different states, or perhaps characters 'A' or 'D' for the same purpose. With enums, it is possible to enumerate all the values that a column might store, and use the same names in the code so that data comparison becomes easier.
While there may be other useful constraints, I generally find the above-mentioned ones most useful and find it difficult to comprehend how an application can do without them.

Friday, March 18, 2011

Detecting Concurrency Problems using TestNG

With the emphasis on product quality, unit and integration testing is gaining widespread momemtum, and TestNG seems to have become the defacto standard for writing these tests. A great feature in TestNG - that is often missed in all the JUnit vs. TestNG comparisons on the web - is the ability to execute a test in parallel using multiple threads. This feature is pretty useful in detecting concurrency problems in the code. A developer could write a test at the Controller level that executes the Service and DAO (Data Access Object) code, and if this code contains any concurrency constructs for thread-safe access, then the TestNG threads will detect concurrency problems such as race conditions and deadlocks.

Using the TestNG framework, it is relatively easy to specify that a given test be executed in parallel. This can be done using additional parameters to the @Test annotation.

As an example, consider the following test:

@Test (threadPoolSize = 3, invocationCount = 9, timeOut = 1000)
public void myTest() {
// write your test here
}

The threadPoolSize determines the number of threads that are used to execute the tests, and the invocationCount determines the total number of times that the test is executed. The timeOut parameter - 1 second in the above example - guarantees that none of the threads will block on the others, in effect avoiding a deadlock. However, this is not something that a developer should try to avoid, because if there is a deadlock, it is better to detect it sooner that later. Therefore, in general, it is better of omit the timeOut parameter.

With the advent of these super-easy testing frameworks and the value-proposition that they bring, most developers are jumping on the unit/integration testing bandwagon. This is great for the software engineering field in general, as eventually some day bugs will not be considered as the norm in a software product.

Wednesday, February 16, 2011

Ant build under Eclipse - Error running javac.exe compiler

I ran into this problem twice, where the ant build script executes perfectly under the command prompt, but fails under Eclipse with the message - "Error running javac.exe compiler". It took me a while to figure out the problem, and it turns out that the problem arises from the javac task in ant. If the task contains an attribute fork="yes", then ant tries to spawn a new instance of javac to compile the source. Now if the referenced Java library in the project is a JRE, Ant can't find javac and therefore gives the above-mentioned error. If the parameter is changed to fork="no", then the build completes successfully, but a better solution is to include the JDK - and not the default JRE - as the referenced library for the project.

Tuesday, December 14, 2010

Concurrency - Optimistic vs. Pessimistic Approach

Whenever developers think of concurrency, the first thing that comes to their mind is semaphores and mutex that provide serial access to a critical section of code. Most languages provide an extensive API for thread synchronization and very often folks just start using the synchronization primitive without much thought into the what they are trying to accomplish. As an example, the most abused concurrency primitive is the "synchronized" keyword provided by Java, which is often put anywhere and everywhere that a developer feels that there is a possibility of concurrent access. "Synchronized" is a monitor, and as such, it doesn't require explicit lock and release statements, as a semaphore or mutex would. This is why perhaps people generally add the synchronized keyword to methods, whenever they feel that the method does something that needs protection from concurrent access. It is not uncommon to come across instances of deeply nested method calls, with each method having a synchronized keyword in the declaration. Synchronized implicitly obtains and releases a lock on the object every time the method with the modifier is called. This is a computationally intensive operation that makes the application slower than it needs to be. Now Java 5 provides some powerful concurrency primitives, but before jumping the bandwagon and starting to use those primitives all over the code, it is better to evaluate the concurrency needs of the application that is being built.

There are generally two approaches to handle concurrency in a software program, each with its pros and cons. An engineering team should consider and evaluate both approaches and decide to use either one, or both, based on the needs of the product they are building. The two approaches are:

Optimistic approach: In this approach, there are no semaphores or mutex to protect a critical section of code that handles the shared data. There is a master copy of the shared data, with each thread getting a local copy to work on. When a thread wishes to update its local copy of the data, the local copy is compared with the master copy to ascertain if the data has been modified since it was last read by the thread. If not, then the update is successful; however, if the data has indeed been modified, then a concurrent modification exception is thrown and the user is expected to re-apply the modifications on the new copy of the data. This approach is common in databases, and it is also used by Java for collections that are not thread-safe by default (HashMap, HashSet, ArrayList).

Pros:
  1. Due to the absence of semaphores and mutex, the application exhibits better performance and scalability.
  2. The modifications of the first thread that performs the update are persisted, whereas the other threads are informed of the change in data and requested to repeat the update on the modified data.
Cons:
  1. User may need to perform the modifications again, if another thread updates the data after it was read by the user thread. This may cause frustration in a multi-user heavy-transaction environment.

Pessimistic approach: This approach requires the use of a semaphore, mutex or monitor to ensure serial access to a critical section in code. In this approach, a single copy of the data is maintained and serial access is provided to threads requesting access to this data. When a given thread enters the critical section, no other thread is allowed to access this data until the thread exits the critical section.

Pros:
  1. Suitable for situations where there is no shared data, however, serial access need to be provided to a shared resource, such as a socket.
Cons:
  1. If the semaphore or mutex is not released properly, it leads to a memory leak. This degrades the application performance over time.
  2. Another problem with semaphores and mutex is the possibility of a deadlock, which occurs when a circular dependency is introduced between two threads, each requesting a lock on a resource that is currently held by the other.
  3. Since serial access is provided to concurrently executing threads that wish to update shared data, the changes made by the last thread are persisted, whereas the other threads are unaware of what happened to their modifications.
A given application may use either one, or both, of the above-mentioned approaches. For shared data access among multiple threads, it is preferable to use the optimistic approach, whereas, for shared resource access (socket etc.), it is generally better to use the pessimistic approach.