The Story of Bug Hunting in Production
BUGS COME AND GO. DEVELOPERS STAY!
Production bugs mean you cannot debug line by line; even worse that cannot reproduce them. The other days, I got one of them. We know when it happened by a screenshot reported by the client.
After a day hunting the bug, I finally found out the root cause. Once found out, the solution is a piece of cake. I thought the process I had been through is interesting and might be useful for others.
Here we go – the process of hunting a production bug.
Step 1 – Treasure in the Log
The first thing is to get the log. Find the entries at that specific time. Most of the system will display a user-friendly error message to end users. The real error, usually in the form of Exception, is buried in the log.
System.Transactions.TransactionAbortedException: The transaction has aborted. ---> NHibernate.Exceptions.GenericADOException: could not execute batch command.[SQL: SQL not available] ---> System.Data.SqlClient.SqlException: String or binary data would be truncated. The statement has been terminated. at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction) at System.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction) at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose) at System.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady) at System.Data.SqlClient.SqlCommand.FinishExecuteReader(SqlDataReader ds, RunBehavior runBehavior, String resetOptionsString) at System.Data.SqlClient.SqlCommand.RunExecuteReaderTds(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, Boolean async, Int32 timeout, Task& task, Boolean asyncWrite, SqlDataReader ds, Boolean describeParameterEncryptionRequest) at System.Data.SqlClient.SqlCommand.RunExecuteReader(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, String method, TaskCompletionSource`1 completion, Int32 timeout, Task& task, Boolean asyncWrite) at System.Data.SqlClient.SqlCommand.InternalExecuteNonQuery(TaskCompletionSource`1 completion, String methodName, Boolean sendToPipe, Int32 timeout, Boolean asyncWrite) at System.Data.SqlClient.SqlCommand.ExecuteNonQuery() at System.Data.SqlClient.SqlCommandSet.ExecuteNonQuery() at NHibernate.AdoNet.SqlClientBatchingBatcher.DoExecuteBatch(IDbCommand ps) --- End of inner exception stack trace --- at NHibernate.AdoNet.SqlClientBatchingBatcher.DoExecuteBatch(IDbCommand ps) at NHibernate.AdoNet.AbstractBatcher.ExecuteBatchWithTiming(IDbCommand ps) at NHibernate.AdoNet.AbstractBatcher.ExecuteBatch() at NHibernate.AdoNet.AbstractBatcher.PrepareBatchCommand(CommandType type, SqlString sql, SqlType[] parameterTypes) at NHibernate.Persister.Entity.AbstractEntityPersister.Update(Object id, Object[] fields, Object[] oldFields, Object rowId, Boolean[] includeProperty, Int32 j, Object oldVersion, Object obj, SqlCommandInfo sql, ISessionImplementor session) at NHibernate.Persister.Entity.AbstractEntityPersister.UpdateOrInsert(Object id, Object[] fields, Object[] oldFields, Object rowId, Boolean[] includeProperty, Int32 j, Object oldVersion, Object obj, SqlCommandInfo sql, ISessionImplementor session) at NHibernate.Persister.Entity.AbstractEntityPersister.Update(Object id, Object[] fields, Int32[] dirtyFields, Boolean hasDirtyCollection, Object[] oldFields, Object oldVersion, Object obj, Object rowId, ISessionImplementor session) at NHibernate.Action.EntityUpdateAction.Execute() at NHibernate.Engine.ActionQueue.Execute(IExecutable executable) at NHibernate.Engine.ActionQueue.ExecuteActions(IList list) at NHibernate.Engine.ActionQueue.ExecuteActions() at NHibernate.Event.Default.DefaultFlushEventListener.OnFlush(FlushEvent event) at NHibernate.Impl.SessionImpl.Flush() at NHibernate.Transaction.AdoNetWithDistributedTransactionFactory.DistributedTransactionContext.System.Transactions.IEnlistmentNotification.Prepare(PreparingEnlistment preparingEnlistment)
Our system leverages NHibernate for Data Access layer. And the exception tells us that there is a piece of string that does not fit into its column max length. It happens when the system commits a transaction.
The question is which table/column? There are many tables involved in a single transaction. This is a complex system.
In the bug’s context, there is very few information involved. I looked at the code that is responsible for the function. Nothing’s special.
Step 2 – Look for Variants
From the customer, I know that some users had the problem. Some have not. Therefore, logged in user plays an important role in the bug.
When looking at any problem, there are always a number of participants, there are input and output. I took a pencil and paper. I jotted down all the participants (actually the C# object/class, such as User, Workflow, …). Just jot them down. Then I drew relationships between them.
Thinking about both cases success and fail, what are the variant factors? Luckily in my case, I can quickly figure out that was the User. And in our system, we care most about 2 pieces of information: Username and Full Name.
This step is important because it will give us a good picture of what might cause the problem.
Step 3 – Look at Data
Luckily that we can access to the production database. Ran some queries. The data emerged an interesting observation:
If the username is 41 characters, it causes the issue. If the username is shorter (at that time I did not notice how many characters the shorter means). I knew that our system keeps the created/edited information. But those columns have more than 50 characters max. It should not be an issue anyway.
Let’s find all the columns having the limit of 41 characters
SELECT TOP 1000 [TABLE_CATALOG] ,[TABLE_NAME] ,[COLUMN_NAME] ,[DATA_TYPE] ,[CHARACTER_MAXIMUM_LENGTH] FROM [INFORMATION_SCHEMA].[COLUMNS] WHERE CHARACTER_MAXIMUM_LENGTH < 41 AND CHARACTER_MAXIMUM_LENGTH > 0 AND DATA_TYPE='nvarchar'
Hurray, the bad boy could not hide anymore. There I had a very small set of columns which might cause the issue. It did not take much time to check each of them. All I had to do is to check in the code which column accepts Username value.
The fix? Well easy. Just need to extend the column max length.
Dealing with production bugs is hard. There is very limited information. And for most developers, if they cannot reproduce, debug the bug, they give up. The first and most important thing is to have a right mindset:
There must be something wrong somewhere in the system.
Fix a production bug? Not a problem!
- Have the right mindset. Do not assume.
- Gather whatever we can. Use whatever we got. Do not complain.
- Take a deep breath.
- Apply basic logical thinking. The process I share here is one of them.
- Hey, need a bit of luck.
Either you like it or not, bugs happen in production. I suggest you embrace them, have fun with them.
BUGS COME AND GO. DEVELOPERS STAY!