10,000 bugs squashed before they could ruin your Friday night plans.
2,500 code merges completed without a single “it works on my machine” excuse.
7,000 coffee runs avoided thanks to streamlined processes.
1,000 stress dreams about production downtime turned into peaceful sleep.
3,000 weekends actually spent at the beach, not staring at your laptop.
4,500 project deadlines met before the midnight panic.
8,000 Slack pings dodged during dinner with the family.
600 panicked calls from your boss never needed because everything just worked.
900 date nights saved from last-minute deployment drama.
5,000 JIRA tickets handled without the "urgent" tag.
11,000 sprints completed with fewer developer tears.
300 client meetings not hijacked by unexpected bugs
15,000 meetings were shortened because everything was already running smoothly.
4,000 system crashes avoided, preserving your Netflix binge nights.
500 database meltdowns were prevented, so your team's happy hour stayed happy.
9,000 times your team said, "Wow, this is actually working!"
1,200 reports delivered without that pesky "error 500."
2,300 unnecessary overtime hours transformed into leisure time.
1,800 product demos run smoothly, with no "let's fix that real quick" moments.
10,000 "quick fixes" that didn’t lead to all-nighters.

Eliminating “Transient Duplicates” with T‑SQL

 

Duplicate Data Problem

In a recent consulting engagement, a client’s product listings had a sneaky data quality issue — transient duplicatesThese weren’t simple copy-paste duplicates; they were records for the same ProductId with very close listing and sold-out dates (within 10 days of each other) that happened because of the multiple sales platforms. These duplicates skewed inventory counts and confused the sales team. We had to act quickly to clean up the data without losing legitimate records.

What is a duplicate in this scenario?

  1. Same [ProductId]
  2. [ProductListDate] same or within 10 days or [ProductSoldoutDate] same or within 10 days

The tricky part is that the 10-day rule creates what’s called a transient duplicate.

** A is similar to B, and B is similar to C, but A is not directly similar to C.
** We want to treat A, B, and C as part of the same *duplicate group* (Cluster).

See the example below.

T-SQL

  • Row 1 and Row 2 are duplicates because their [ProductListDate] are within 10 days.

  • Row 1 and Row 3 are not duplicates — they’re too far apart.

  • But Row 2 and Row 3 are also within 10 days of each other.

This means even though Row 1 and Row 3 aren’t directly duplicates, Row 3 still belongs to the same duplicate group because it connects to Row 1 through Row 2. See the example below..This is a classic connected component labeling algorithm done with SQL.

Step 1: Locate all duplicate rows

***************************************************************************************************

2025 05 17 21 18 14

We want to treat Row 1, Row 2, and Row 3 as part of the same *duplicate group* (Cluster)

The #Pairs table captures all row pairs that are within 10 days apart in ProductListDate or ProductSoldoutDate.
However, it does not include every possible transitive connection (e.g., Row 1 to Row 3 directly).

Step 2: While loop

I use a loop to propagate the smallest ClusterID across all related rows until convergence.
This ensures that all related rows, even through indirect paths, end up with the same ClusterID.

** HOW IT WORKS:
** —————————————————————
** 1. Each row initially assigns its own SourceId as its ClusterID.
** 2. The loop repeatedly compares RowID1 and RowID2 from #Pairs and updates RowID2’s ClusterID to match RowID1’s ClusterID if RowID1’s ClusterID is smaller.
** 3. The loop stops when no more ClusterIDs change (converged).

2025 05 17 21 27 31

WHILE Loop explained

2025 05 18 00 00 29

Step 3: Ranking and duplicate flagging

At this point, its easy. Use ranking and pick one as the non-duplicate and the rest is classified as duplicate rows.

2025 05 17 23 37 30

Performance tuning tip: if you’re dealing with millions of rows, consider indexing the temp table on ProductId and date, and test the query plan. Full query with the sample data is included below. 

Please share your experience with deduping data or other ways of doing this.

DROP TABLE IF EXISTS #Source,#ClusterWork,#Pairs
CREATE TABLE #Source(
[SourceId] INT IDENTITY(1,1) ,
[PlatformId] INT ,
[ProductId] INT ,
[ProductListDate] DATE ,
[ProductSoldoutDate] DATE ,
[IsDuplicate] BIT
)
INSERT INTO #Source ([ProductId],[PlatformId],[ProductListDate],[ProductSoldoutDate],[IsDuplicate])
SELECT 1,1, '01/01/2025', '02/01/2025',0 UNION
SELECT 1,2, '01/08/2025', '02/05/2025',0 UNION
SELECT 1,3, '01/15/2025', '03/01/2025',0 UNION
SELECT 2,1, '01/15/2025', '04/01/2025',0
SELECT *,
[SourceId] AS [RowID],
[SourceId] AS [ClusterID]
INTO #ClusterWork
FROM #Source
SELECT DISTINCT
A.[RowID] AS [RowID1],
B.[RowID] AS [RowID2]
INTO #Pairs
FROM #ClusterWork A
JOIN #ClusterWork B
ON A.[ProductId] = B.[ProductId]
AND A.[RowID] <> B.[RowID]
AND (
ABS(DATEDIFF(DAY, A.[ProductListDate], B.[ProductListDate] )) <= 10
OR ABS(DATEDIFF(DAY, A.[ProductSoldoutDate], B.[ProductSoldoutDate] )) <= 10
);
SELECT * FROM #ClusterWork
SELECT * FROM #Pairs
DECLARE @Updated INT = 1;
WHILE @Updated > 0
BEGIN
SET @Updated = 0;
UPDATE CW1
SET
CW1.[ClusterID] = CW2.[ClusterID],
@Updated = 1
FROM #ClusterWork CW1
JOIN #Pairs P ON CW1.[RowID] = P.[RowID2]
JOIN #ClusterWork CW2 ON CW2.[RowID] = P.[RowID1]
WHERE CW1.[ClusterID] > CW2.[ClusterID];
END;
;WITH [Ranked] AS
(SELECT
[SourceId],
ROW_NUMBER() OVER (PARTITION BY [ClusterID] ORDER BY [ProductListDate],[ProductSoldoutDate],[SourceId]) AS [Rank]
FROM #ClusterWork
)
UPDATE T
SET [IsDuplicate] = 1
FROM #Source t
INNER JOIN [Ranked] r ON t.[SourceId] = r.[SourceId]
WHERE [Rank] > 1
SELECT * FROM #Source

Comments (2)

  • 05/18/2025

    Midz

    Nice article, great way to cleanup duplicates

  • 05/18/2025

    Indhu Raphel

    Thank you for sharing this! The use of a connected component labeling algorithm to group related rows is an excellent approach for handling indirect duplicates.

Facebook
Twitter
LinkedIn