Ignite supports collocated and non-collocated distributed SQL joins. Moreover, if the data resides in different tables (aka. caches in Ignite), Ignite allows for cross-table joins as well:
SELECT from Person as p, Organization as org WHERE p.orgId = org.id AND lower(org.name) = lower('apple')
REPLICATED data sets always work without any limitations.
However, if you join
PARTITIONED data sets, then you have to make sure that the keys you are joining on are either collocated or you switched on the non-collocated joins parameter for a query. The two types of distributed joins modes are explained further below.
To learn more about data collocation concept and how to use it in practice refer to the dedicated documentation section
By default, if an SQL join has to be done across a number of Ignite tables, then all the tables have to be collocated. Otherwise, you will get an incomplete result at the end of query execution because at the join phase a node uses the data that is available locally only.
Referring to Picture 1. below you will see that, first, an SQL query is sent to all the nodes (
Q) where data, required for a join, is located. After that the query is executed right away by every node (
E(Q)) over the local data set and, finally, the overall execution result is aggregated on the client side (
Besides the fact that the affinity collocation is a powerful concept that, once set up for an application's business entities (tables), will let you execute cross-table joins in the most optimal way by returning a complete and consistent result set, there is always a chance that you won't be able to collocate all the data. Thus, you may not be able to execute the whole range of SQL queries that are needed to satisfy your use case.
The non-collocated distributed joins have been designed and supported by Apache Ignite for cases when it's extremely difficult or impossible to collocate all the data but you still need to execute a number of SQL queries over non-collocated tables.
Do not overuse the non-collocated distributed joins based approach in practice because the performance of this type of joins is worse then the performance of the affinity collocation based joins due to the fact that there will be much more network round-trips and data movement between the nodes to fulfill a query.
If you use a non-collocated join on a column from a replicated table, the column must have an index. Otherwise, you will get an exception.
When the non-collocated distributed joins setting is enabled for a specific SQL query, then a node will request missing data (that is not present locally) from remote nodes by sending either broadcast or unicast requests.
Enabling Non-collocated Joins
Refer to JDBC, ODBC, Java, .NET, C++ docs to see how to enable the non-collocated mode.
For instance, JDBC requires to add
distributedJoins=trueparameter to a connection URL.
This process is depicted on Picture 2. below as a potential data movement step (
D(Q)). The potential unicast requests are only sent in cases when a join is done on a primary or an affinity key, since the node performing the join knows the location of the missing data. The broadcast requests are sent in all the other cases.
For performance reasons, both broadcast and unicast requests are aggregated into batches. This batch size can be managed using
Refer to the non-collocated distributed joins blog post for more technical details.
Updated 3 months ago