Managing Data with Dyalog: The VecDb Workshop

Workshop W2: Managing Data with Dyalog

Morten Kromberg, CXO

•

Traditional relational databases focus on

being able to read a single record quickly,

 and preserve its integrity during updates

•

“Inverted” or databases organize data by

 column: All the data for one column is adjacent

Examples:

•

On the mainframe: Hydra, Mabra, …

•

In modern times:

–

The k language has

kdb+

–

J has

jd

–

Paul Mansour has

flibdb

–

.. and now there is

vecdb

VecDb - DYNA'16

┌───────────────────┐

│┌───

┬

─

┬

───────────┐│

││ABC│4│203.9034382││

│└───

┴

─

┴

───────────┘│

├

───────────────────

┤

│┌───

┬

─

┬

───────────┐│

││ABC│3│300.9898292││

│└───

┴

─

┴

───────────┘│

├

───────────────────

┤

│┌───

┬

─

┬

───────────┐│

││DEF│4│146.0736925││

│└───

┴

─

┴

───────────┘│

└───────────────────┘

┌─────

┬

─

┬

───────────┐

│┌───┐│4│203.9034382│

││ABC││3│300.9898292│

│

├

───

┤

│4│146.0736925│

││ABC││ │           │

│

├

───

┤

│ │           │

││DEF││ │           │

│└───┘│ │           │

└─────

┴

─

┴

───────────┘

•

Each column is a simple, memory-mappable structure

•

Much lower storage and memory requirements due to

simpler structure (typically an order of magnitude)

•

Searching and summarizing large numbers of records

is often several orders of magnitude faster

–

Record oriented DBs will sometimes invert or hash

selected “key” columns: In an inverted DB *all* columns

are fast

•

Array language primitives (APL, J, k) can operate

directly on memory-mapped arrays

–

Extremely simple implementation

–

Take advantage of all the clever work done by Hui, Foad,

Whitney and others

VecDb - DYNA'16

┌─────

┬

─

┬

───────────┐

│┌───┐│4│203.9034382│

││ABC││3│300.9898292│

│

├

───

┤

│4│146.0736925│

││ABC││ │           │

│

├

───

┤

│ │           │

││DEF││ │           │

│└───┘│ │           │

└─────

┴

─

┴

───────────┘

•

Typically do not fully support “transactions” (except sometimes

for append operations).

VecDb - DYNA'16

•

Provide simple, fast storage mechanism for “a few

gigabytes” of data

•

Distributed, “sharded” database

–

Allows (highly) parallel queries

•

Integrated with Dyalog APL /

Free to all users

•

Open source project:

https://github.com/Dyalog/vecdb

VecDb - DYNA'16

      date←100/

⍳

1E4

⍝

 100 trades/day

      key←?1E6

⍴

⍝

 10 different keys in random order

      volume←1000×?1E6

⍴

⍝

 lots of noise

⎕

←

⍪

¨5↑¨date key volume

 1   4  203.9034382

 1   3  300.9898292

 1   4  146.0736925

 1  10  303.0208711

 1   1    5.828660818

      columns←'date' 'key' 'volume'

      types←'I2' 'I1' 'F'

      options←

⎕

NS ''

⋄

 options.BlockSize←2E6

      folder←'c:\devt\vecdb\demodb'

      db←

⎕

NEW #.vecdb ('demo' folder columns types options (date key volume))

VecDb - DYNA'16

      where←('date' 1)('key' 1)

⍝

 date=1 and key=1

      select←'date' 'key' 'volume'

⍝

 columns to read

      db.Query where select

      db.Query

⍬

 'sum volume' 'key'

⍝

 select sum(volume) group by key

VecDb - DYNA'16

•

You can partition, or “shard” the database based on any

computation, for example:

    options.(ShardCols ShardFn)←1 '{

⌈

⊃⍵

)÷5000}'

    options.ShardFolders←'/history' '/recent'

•

The above uses column number 1 as input, and put the first

5000 values into the first shard, the next 5000 values in the

next shard, etc.

•

Shard folders can be located on separate machines

•

Parallel queries can run on the machine where each shard is

located

VecDb - DYNA'16

…

VecDb - DYNA'16

•

Available now

•

Unit Test Suite provides “Specification”

•

In “production” use in one development project

•

Under evaluation for a couple more

•

Open Source is working:

https://github.com/Dyalog/vecdb/graphs/contributors

VecDb - DYNA'16

•

Available now

•

Test Suite provides “Specification”

•

In “production” use in one development project

•

Under evaluation for a couple more

•

Open Source is working:

https://github.com/Dyalog/vecdb/graphs/contributors

VecDb - DYNA'16

…

•

Extend datatype support

–

Current: Bool, I1, I2, I4, Float, Char

–

Char type limited to 16,767 different string (I2 index into list

of strings).

–

More Char types next up

•

Simple joins of tables on a shared key (databases

must be “equally” sharded)

•

Parallel execution of queries

•

Hook up SQAPL Server for ODBC/ ADO/ JDBC

driver access

VecDb - DYNA'16

Slide Note

Embed Share

Download

The VecDb workshop discusses the concept of Inverted Databases, highlighting their advantages and weaknesses. It aims to provide a simple, fast storage mechanism for data, emphasizing parallel queries and integration with Dyalog APL. The workshop covers creating databases, querying data, and the goals of VecDb as an open source project.

kri_big Follow

Uploaded on Sep 15, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

vecdb The Dyalog Vector Database Workshop W2: Managing Data with Dyalog Morten Kromberg, CXO #DYNA16

1 ABC 4 203.9034382 ABC 3 300.9898292 DEF 4 146.0736925 Inverted Databases Traditional relational databases focus on being able to read a single record quickly, and preserve its integrity during updates Inverted or databases organize data by column: All the data for one column is adjacent 4 203.9034382 ABC 3 300.9898292 4 146.0736925 ABC DEF Examples: On the mainframe: Hydra, Mabra, In modern times: The k language has kdb+ J has jd Paul Mansour has flibdb .. and now there is vecdb #DYNA16 VecDb - DYNA'16

2 Advantages of Inverted DBs 4 203.9034382 ABC 3 300.9898292 4 146.0736925 ABC DEF Each column is a simple, memory-mappable structure Much lower storage and memory requirements due to simpler structure (typically an order of magnitude) Searching and summarizing large numbers of records is often several orders of magnitude faster Record oriented DBs will sometimes invert or hash selected key columns: In an inverted DB *all* columns are fast Array language primitives (APL, J, k) can operate directly on memory-mapped arrays Extremely simple implementation Take advantage of all the clever work done by Hui, Foad, Whitney and others #DYNA16 VecDb - DYNA'16

3 Weaknesses of Inverted DBs Typically do not fully support transactions (except sometimes for append operations). #DYNA16 VecDb - DYNA'16

4 Goals of vecdb Provide simple, fast storage mechanism for a few gigabytes of data Distributed, sharded database Allows (highly) parallel queries Integrated with Dyalog APL / Free to all users Open source project: https://github.com/Dyalog/vecdb #DYNA16 VecDb - DYNA'16

5 Create a Database date 100/ 1E4 key ?1E6 10 10 different keys in random order volume 1000 ?1E6 0 lots of noise 5 date key volume 1 4 203.9034382 1 3 300.9898292 1 4 146.0736925 1 10 303.0208711 1 1 5.828660818 columns 'date' 'key' 'volume' types 'I2' 'I1' 'F' options NS '' options.BlockSize 2E6 folder 'c:\devt\vecdb\demodb' 100 trades/day db NEW #.vecdb ('demo' folder columns types options (date key volume)) #DYNA16 VecDb - DYNA'16

6 Queries where ('date' 1)('key' 1) date=1 and key=1 select 'date' 'key' 'volume' columns to read db.Query where select db.Query 'sum volume' 'key' select sum(volume) group by key #DYNA16 VecDb - DYNA'16

7 Sharding You can partition, or shard the database based on any computation, for example: options.(ShardCols ShardFn) 1 '{ ( ) 5000}' options.ShardFolders '/history' '/recent' The above uses column number 1 as input, and put the first 5000 values into the first shard, the next 5000 values in the next shard, etc. Shard folders can be located on separate machines Parallel queries can run on the machine where each shard is located #DYNA16 VecDb - DYNA'16

8 Demo #DYNA16 VecDb - DYNA'16

9 Current Status Available now Unit Test Suite provides Specification In production use in one development project Under evaluation for a couple more Open Source is working: https://github.com/Dyalog/vecdb/graphs/contributors #DYNA16 VecDb - DYNA'16

10 Current Status Available now Test Suite provides Specification In production use in one development project Under evaluation for a couple more Open Source is working: https://github.com/Dyalog/vecdb/graphs/contributors #DYNA16 VecDb - DYNA'16

11 To Come Extend datatype support Current: Bool, I1, I2, I4, Float, Char Char type limited to 16,767 different string (I2 index into list of strings). More Char types next up Simple joins of tables on a shared key (databases must be equally sharded) Parallel execution of queries Hook up SQAPL Server for ODBC/ ADO/ JDBC driver access #DYNA16 VecDb - DYNA'16

Managing Data with Dyalog: The VecDb Workshop

Download Presentation

Presentation Transcript

Related

More Related Content