I got a bug report that some records were not synced between the two systems. There is a system to input data, and there is another system to expose data to external systems. In between, there is a system to convert data. Combining them is a data pipeline. It is typical architecture these days.
After checking data in related databases, I found that the error occurred in a single database where the data moved from one table to another. There are a source table, a destination table, and an intermediate table. We store the progress in a state which allows the system to continue from the last run.
Whenever there are changes in the source table, a SQL trigger executes and moves the "EntityId" column into the intermediate table.
Whenever the function (a timer Azure Function) starts, it reads data from the intermediate table and performs the required logic to convert data.
Everything worked as expected until the production bug was reported.
Schema and Code
The intermediate table schema Entity_Changes
CREATE TABLE [dbo].[Entity_Changes]
(
[Ordering] [bigint] IDENTITY(1,1) NOT NULL PRIMARY KEY,
[EntityId] UNIQUEIDENTIFIER NOT NULL,
)
The execution flow is
- Read records from
Entity_Changes
table - Proceed them
- Delete them from
Entity_Changes
table
The simplified version of the code
private async Task<long> ReplaceRows(IDataSource dataSource, ConversionState resultState, CancellationToken cancellationToken)
{
var pagingFrom = resultState.NumberOfRowsConverted;
long recordsCount = 0;
const int batchSize = 1000;
do
{
cancellationToken.ThrowIfCancellationRequested();
pagingFrom = resultState.NumberOfRowsConverted;
var pagingTo = pagingFrom + batchSize;
var pagedResult = await dataSource.ReadData(pagingFrom, pagingTo);
if (pagedResult.MaxOrdering == 0)
{
_logger.LogInformation("No records to proceed");
break;
}
// Do the conversion logic here
recordsCount += distinctResults.Count;
_logger.LogInformation($"{recordsCount} records");
// NumberOfRowsConverted is used to set the PagingFrom value.
// It is the max ordering from the last batch
resultState.NumberOfRowsConverted = pagedResult.MaxOrdering;
resultState.State = (int)ConversionStateStatus.Running;
await _conversionStateService.SaveLastSyncedState(resultState);
var deletedRows = await dataSource.DeleteProceededRecords(pagingFrom, pagingTo);
_logger.LogInformation($"Delete {deletedRows} rows from temp table");
} while (true);
return recordsCount;
}
In one of the data sources, I found these interesting numbers. The resultState.NumberOfRowsConverted
was 5.885.592. The lowest Ordering
in the source table was 5.895.593. The difference was 10.001. It did not make sense. How could that happen?
With those numbers, the ReadData
returned empty result because the pagingTo
was 5.886.592 (=5.885.592 + 1.000). It is smaller than 5.895.593
From experience, I know that the SQL Identity column does not guarantee continuous values; for example, when a transaction fails, Identity values are lost. But 10K gap is too big.
Looking at the code for hours could not help me explain (btw, I found some other issues with the code); I asked Google this SQL Identity Jump.
Bingo! I found the answer SQLHack Avoid Identity Jump.
SQL Server is using a different cache size for the various data type of identity columns. The cache size of the INT data type column is 1000, and the Cache size of the BIGINT or Numeric data typed column is 10000.
Bingo! In some circumstances, there are jumps. The Ordering
column is a BIGINT, so the jump size is 10.000.
Solution
Once you know the root cause, the solution is quite simple. First off, the paging implementation was not accurate. The number of rows might not be the same between the reading and deleting data. How could it happen?
Let’s say, when reading data for paging from [0-1.000], there are 10 rows in the database. The system proceeds them. When it comes to deleting those 10 rows, there are 900 rows. The solution is simple.
// Delete the actual ordering range that were proceeded
var deletedRows = await dataSource.DeleteProceededRecords(pagingFrom, pagedResult.MaxOrdering);
To solve the first problem, query using SELECT TOP WHERE ORDERING > @pagingFrom
instead of querying by range.
var pagedResult = await dataSource.ReadData(pagingFrom, batchSize);
Production bugs are always interesting.