When I look at the log files our server application creates (esp. the middleware tier) there is much information in it. But how much of it is helpful for administrators, our colleagues in „IT Operations“ department who are in charge of keep these applications running?
When you think about it, it all comes down to resources. Administrators can help when
- the application rans out of memory
- the filesystem is full
- the partner application (e.g. „web service“) is down
- we get errors because a SSL certificate of a server we talk to isout of date
- the database cannot be reached because the configured user password is wrong
- the configured URI of some partner application is wrong
- communication with a partner systems is interrupted because of a misconfigured firewall
- network connections are stale because of some network problems
- application cannot sent mails because of a misconfigured mail server
etc. In general, administrators can help when there is something wrong with resources our application needs or when the configuration for using these resources is wrong. These are errors I call „transient“ because when the problem with resource or configuration is fixed the problem „disappears“.
On the other hand, administrators cannot help when
- there is a integrity constraint violation when writing to the database
- there is a NullPointerException because programmers forget to check that a mandatory input field actually contains data
- there are errors because of unexpected input format
- there are validation errors while talking to a partner application
- some algorithm crashes with a „division by zero“ error
etc. Administrators cannot fix these things because only programmers can do by changing the code. These kind of errors I call „permanent“ errors because they cannot go away at runtime.
Of course, there is some kind of „gray area“ where administrator may check things but not necessarily are able to help, for instance
- today a partner application responded with a „NullPointerException“ – admins could check if everything else is OK with this application
- file handles or concurrent database sessions run out – admin could increase the number of file handles or concurrent database sessions until programmers fixes the root cause
- resolving a dead lock in the database.
In an ideal world, the applications log file entries clearly indicate whether an error is transient or permanent. But in reality it is not achieved easily. For instance, deciding whether a SQLException is rooted in a transient or permanent error means creating a list of SQL error codes for the database vendor your application uses. Take a look at the spring framework – they did exactly this. It is much easier to just log the SQLException and abort the transaction.
In summary, when notified about transient errors, administrators can help to keep the server application running, but with permanent errors only developers can help. When the log file entries identify which kind of error they refer to it helps administrator to do their job.