Disclaimer
This is a purely self serving article. I've had to do this a number of times and end up having to re-learn it all every time I've done it. SOOOOO this time I will drop some basics here as a reference for my failing memory but if it helps anybody else even better. It's not indented to be instructional or in anyway best practice so if you have issues or know of more appropriate ways please feel fee to drop things in the comments etc.
The Problem
The key problem being addressed is a number of duplicate records that have found their way into our datasets. These duplicates are matched on the master record id and the description used in the child example
id | parentId | description |
---|---|---|
1 | P1 | this is ok |
2 | P1 | I'm here twice |
3 | P1 | I'm here twice |
Breaking it down
So the first thing I did was a query that will actualy help me find the rows I was looking for and the proof there are actual duplicates.
select "parentId",
id,
description,
row_number()
over (partition by "parentId", description order by "parentId", overview) as RN
from "Detail"
This produces results like these
id | parentId | description | RN |
---|---|---|---|
1 | P1 | this is ok | 1 |
2 | P1 | I'm here twice | 1 |
3 | P1 | I'm here twice | 2 |
So we now have a row id/number based on the parentId and the description matching.
We then use this in the delete query to kill al the row numbers over 1
delete
from "Detail"
where id in (
select id
from (
select "parentId",
id,
description,
row_number()
over (partition by "parentId", description order by "parentId", overview) as RN
from "Detail"
) as F
where RN > 1
);
This is probably a convoluted way to achieve this and I'm sure there will be more efficient way but it has worked for me in this occasion so I hope it helps others.
Running the previous query then results in
id | parentId | description | RN |
---|---|---|---|
1 | P1 | this is ok | 1 |
2 | P1 | I'm here twice | 1 |
Top comments (0)