Detecting duplicate rows.

Daniel Walker · March 19th, 2004, 09:07 AM

Here's a question for you SQL heads...

I have a fairly mucky database I've inherited, here, with lots of duplicate rows. I want not only to be able to remove the duplicates, but to provide a means for the client I'm working for to be able to search for, and remove these duplicates in the future. I'm doing a self join, thus:

SELECT t1.id, t2.id, t1.name FROM table AS t1 LEFT JOIN table AS t2 ON t1.name=t2.name AND t1.id<>t2.id AND t2.id IS NOT NULL;

This works, to an extent (it spots duplicates), but my 'IS NOT NULL' clause isn't working. It's probably stinking obvious, but I can't see it. Maybe I'm running short of calories and need food.

Any assistance greatly appreciated.

Dan

Daniel Walker · March 19th, 2004, 10:08 AM

Well, now. Here's wierd. Do this:

SELECT t1.id, t2.id, t1.name FROM table AS t1 LEFT JOIN table AS t2 ON t1.ame=t2.name AND t2.id IS NOT NULL;

...and my 'IS NOT NULL' works, removing null results> However, I can't add a third clause to the effect:
e1.organiser_id<>e2.organiser_id

...since all the NULL values come back in, despite the fact that I'm stipulating (effectively) "clause AND clause AND clause".

I don't get this. I'm either seriously misunderstanding something about self-joins, or something about NULLs...

Daniel Walker · March 19th, 2004, 10:23 AM

Right. I'm stupid. Try:

SELECT t1.id, t2.id, t1.name FROM table AS t1 LEFT JOIN table AS t2 ON t1.name=t2.name AND t1.id<>t2.id WHERE t2.id IS NOT NULL;

This works - and is, incidentally, a lot simpler than the "temporary table" approach you usually see. (The optimiser seems to like it, too).

(AS I said, I was running short of blood sugar at the time of my original post. I owe this result to a Medium Roast Beef and Rocket sandwich... Very nice ;)

ruhin · May 19th, 2004, 07:02 PM

Make the Column as UNIQUE.

By making this it would not accept any
redundant values.

It is very easy to use "phpMyAdmin" for creating
and structuring MySql database.I suggest to use this.

Daniel Walker · May 20th, 2004, 04:04 AM

Well, no, the fields I needed to match were things like 'organiser name', where the same organiser name _may_ have been entered twice for the same organiser, but _may_ also have been entered for a national chain of organisers whoe ach share the same name; where the only way of testing the legitimacy of the duplicate field value, is to see it in the context of other fields, such as the 'city' portion of the address, or similar identifying characteristic. There are many tens of thousands of results, and the client really needs to be able to seek out _possible_ duplicates, not erradicate them. In any case, the records already existed, so even if the aim was a totally unique column of values, trying to force that upon the existing table would have resulted in a constraint failure, at best. In cases like this, you need to use a LEFT JOIN and filter against NULL values - which is somewhat counterintuitive, because you're actually aiming to generate false matches and filter _in favour_ of those false matches. In any case, the SELECT in my third post sorted it: it was fairly trivial, when I figured out what I was doing wrong.

I always try to remind myself that when a program isn't doing what I want it to, it's because it's doing precisely what I'm _telling_ it to do - and so all I need to do, is work out what I'm _telling_ it to do, in order to solve the problem... As I said, on the day in question, I think I was just suffering a crisis of faith over this doctrine, brought on by a lack of blood sugar during the latter part of a long morning's work. The roast beef fixed it. I may put a small notice on the finished product, when I'm done: "Some cows were injured in the creation of this product". :)

junemo · June 21st, 2004, 12:29 AM

Hi Daniel,

I not really understand the function JOIN in SQL statement.

My question: What is the best way to check if a record already exists in a oracle-dbase from php or in the case of when the same form is submited twice beacuse of the "refresh"-button?

Please attach a sample code (oracle in php SQL script)for detect the duplicate record.

Thanks in advance.

qazi_nomi · June 21st, 2004, 02:39 AM

You should have a unique primary key in the relations and when
you enter the record you check that weather the record is already there or not if not then insert else no insertion

Love 4 all