Using JOIN or EXISTS?

 

Quote

Using JOIN or EXISTS?

A common thing I see is the question:  "How do I find parent rows that have no children?"  There are three answers commonly given, and while all of them work, only two of them actually states the question being asked AND ends up with the right answer. Only one of them can work in all cases. 
 
Note that the question being asked does not imply that any child data is required in the output.  If the question was, what is the value of the parent and child columns for rows that do have parent rows.  This requires a join because the data in the child columns is required. 
 
Before starting, create the following tables (and not in master as I have a habit of doing on my laptop!)
 
create table parent
(
    parentId     int primary key,
    otherColumns varchar(10)
)
insert into parent
select 1, ‘blah’
UNION ALL
select 2, ‘more blah’
UNION ALL
select 3, ‘other blah’
 
create table child
(
    childId      int primary key,
    parentId     int references parent(parentId),
    otherColumns varchar(10)
)
insert into child
select 1,1, ‘blah’
UNION ALL
select 2,1, ‘more blah’
UNION ALL
select 3,2, ‘different’
To determine which rows in parent have no child rows, you can use either:
 
select parent.*
from   parent
                left outer join child
                    on child.parentId = parent.parentId
where  child.parentId is null
 
or
 
select *
from   parent
where  parentId not in (select parentId
                                       from  child)
 
or
 
select *
from   parent
where  not exists ( select 1
                               from   child
                               where  child.parentId = parent.parentId)
 
They all return the same data:
 
parentId    otherColumns
———– ————
3           other blah
 
So, they are all exactly the same, right?  If you think "yes," then I clearly have to work on my foreshadowing techniques.  The answer is more like "yes and no."  Mathematically, they are all equivalent, but they don’t all ask the same question, which makes one of them suffer when trying to read it. First up, using an outer join:
 
select parent.*
from   parent
                 left outer join child
                       on child.parentId = parent.parentId
where  child.parentId is null
 
The problem with this one is that you aren’t really asking the right question.  Here you are joining the sets together so you a set with all of the columns in the parent and the child, where the parentId of the parent equals that of the child.  Then you are looking for cases where a column is null. 
This is not the question you asked originally, but because the parentId can’t be null, it turns out that the only way the parentId can be null is if it has no matches in the child table. Since NULL <> NULL, there is no way that even if parentId was null in a row (if it wasn’t the primary key) it will always work.  It can suffer from performance issues in some cases as well (especially without indexing,) because it may perform the join and materialize the set.  The more complex the rest of the query, the more likely it will not end up asking the question as expected.
So what about the version using not in:
 
select *
from   parent
where  parentId not in (select parentId
                                       from   child)
 
This asks: Give me the rows that do not exist in the set of rows in child.  Perfect question, will give the correct answer (again because NULL <> NULL, no problems if parent could be NULL.)
It only suffers in that if a table has > 1 columns in its key, then you cannot write an "in" query that will work, without concatenating columns, which will send performance into the toilet.
It used to be that this could perform worse than the next example, but the optimizer is smart enough to deal with this now (at least it is supposed to be, your results may vary, so don’t blame me, tell it to the newsgroups 🙂  So what about using exists:
 
select *
from   parent
where  not exists ( select 1
                               from   child
                               where  child.parentId = parent.parentId)
 
This still states give me the parent rows where there doesn’t exist a join between the tables that will return values.  It also can be extended easily to mulitiple columns, or even different situations not involving equality by simply changing the where clause of the child table:
 
select *
from   parent
where  not exists ( select 1
                               from   child
                               where  child.parentId = parent.parentId
                                   and  child.otherColumns = parent.otherColumns)
 
Now the result changes to include the rows where parentId = 2, becuase the value for the otherColums column is different. If you have more than one child table to check, you can do this in the same statement:
 
create table child2
(
    childId      int primary key,
    parentId     int references parent(parentId),
    otherColumns varchar(10)
)
insert into child2
select 1,3, ‘Coolness’
 
Now just add other exists criteria to the where clause:
 
select *
from   parent
where  not exists ( select 1
                               from   child
                               where  child.parentId = parent.parentId)
  and  not exists (   select 1
                               from   child2
                               where  child2.parentId = parent.parentId)
 
Note that this now returns nothing, since now 1 and 2 are in child, 3 is in child2.  So in conclusion, the point of this post is to avoid using joins to try to compare two sets, since this is not its function.  In the end the join method doesn’t seem clear as to what you are asking.  I only touched on performance, but the join will often peform worse than the using in or exists, simply because the work involved to do the join is greater.
This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s