Intelligent CIO North America Issue 35

t cht lk streaming data . Druid is the database-of-choice for analytics use cases including operational visibility of real-time events , rapid data exploration , customerfacing analytics and real-time decisioning .

Imply has announced a new capability that makes Druid the first analytics database that can provide the performance of a strongly typed data structure with the flexibility of a schemaless data structure . Schema autodiscovery , now available in Druid 26.0 , is a new feature that enables Druid to automatically discover data fields and data types and update tables to match changing data without an administrator .

Can you provide more details about the new features introduced in Milestone 3 of Project Shapeshift , such as schema auto-discovery and shuffle joins ?

Schema auto-discovery

Schema definition plays an essential role in query performance as a strongly typed data structure makes it possible to columnarize , index and optimize compression .

Defining the schema when loading data carries operational burden on engineering teams , especially with ever-changing event data flowing through Apache Kafka and Amazon Kinesis . Databases such as MongoDB utilize a schemaless data structure as it provides developer flexibility and ease of ingestion , but at a cost to query performance .

Shuffle joins

The second major feature is the expansion of Druid ’ s architecture to support large complex joins at ingestion via shuffle joins . Previous join capabilities were limited to maintain high CPU efficiency for query performance , so large tables had to be pre-joined in the data pipeline via other systems like Spark .

Druid has enhanced its ingestion capabilities to support large joins – architecturally via shuffle joins . This simplifies data preparation , minimizes reliance on external tools and adds to Druid ’ s capabilities for indatabase data transformation .

How does the automatic schema discovery feature in Druid 26.0 benefit developers and address the challenges of defining schemas when loading data ?

Developers rely on a strongly typed data format because of the query performance advantages that defined types per column provide in terms of query optimization , columnarization , compression , etc .

76 INTELLIGENTCIO NORTH AMERICA www . intelligentcio . com

Intelligent CIO North America Issue 35 | Page 76

t cht lk streaming data . Druid is the database-of-choice for analytics use cases including operational visibility of real-time events , rapid data exploration , customerfacing analytics and real-time decisioning .

t cht lk streaming data . Druid is the database-of-choice for analytics use cases including operational visibility of real-time events , rapid data exploration , customerfacing analytics and real-time decisioning .