Managing Data with Dyalog: The VecDb Workshop

v
e
c
d
b
T
h
e
 
D
y
a
l
o
g
 
V
e
c
t
o
r
 
D
a
t
a
b
a
s
e
Workshop W2: Managing Data with Dyalog
Morten Kromberg, CXO
1
I
n
v
e
r
t
e
d
 
D
a
t
a
b
a
s
e
s
 
Traditional relational databases focus on
being able to read a single record quickly,
 and preserve its integrity during updates
“Inverted” or databases organize data by
 column: All the data for one column is adjacent
 
Examples:
On the mainframe: Hydra, Mabra, …
In modern times:
The k language has 
kdb+
J has 
jd
Paul Mansour has 
flibdb
.. and now there is 
vecdb
VecDb - DYNA'16
 
┌───────────────────┐
│┌───
───────────┐│
││ABC│4│203.9034382││
│└───
───────────┘│
───────────────────
│┌───
───────────┐│
││ABC│3│300.9898292││
│└───
───────────┘│
───────────────────
│┌───
───────────┐│
││DEF│4│146.0736925││
│└───
───────────┘│
└───────────────────┘
 
┌─────
───────────┐
│┌───┐│4│203.9034382│
││ABC││3│300.9898292│
───
│4│146.0736925│
││ABC││ │           │
───
│ │           │
││DEF││ │           │
│└───┘│ │           │
└─────
───────────┘
2
A
d
v
a
n
t
a
g
e
s
 
o
f
 
I
n
v
e
r
t
e
d
 
D
B
s
Each column is a simple, memory-mappable structure
Much lower storage and memory requirements due to
simpler structure (typically an order of magnitude)
Searching and summarizing large numbers of records
is often several orders of magnitude faster
Record oriented DBs will sometimes invert or hash
selected “key” columns: In an inverted DB *all* columns
are fast
Array language primitives (APL, J, k) can operate
directly on memory-mapped arrays
Extremely simple implementation
Take advantage of all the clever work done by Hui, Foad,
Whitney and others
VecDb - DYNA'16
 
┌─────
───────────┐
│┌───┐│4│203.9034382│
││ABC││3│300.9898292│
───
│4│146.0736925│
││ABC││ │           │
───
│ │           │
││DEF││ │           │
│└───┘│ │           │
└─────
───────────┘
3
W
e
a
k
n
e
s
s
e
s
 
o
f
 
I
n
v
e
r
t
e
d
 
D
B
s
Typically do not fully support “transactions” (except sometimes
for append operations).
VecDb - DYNA'16
4
G
o
a
l
s
 
o
f
 
v
e
c
d
b
Provide simple, fast storage mechanism for “a few
gigabytes” of data
Distributed, “sharded” database
Allows (highly) parallel queries
Integrated with Dyalog APL /
Free to all users
Open source project:
https://github.com/Dyalog/vecdb
VecDb - DYNA'16
5
C
r
e
a
t
e
 
a
 
D
a
t
a
b
a
s
e
      date←100/
1E4       
 100 trades/day
      key←?1E6
10         
 10 different keys in random order
      volume←1000×?1E6
0  
 lots of noise
      
¨5↑¨date key volume
 1   4  203.9034382
 1   3  300.9898292
 1   4  146.0736925
 1  10  303.0208711
 1   1    5.828660818
      columns←'date' 'key' 'volume'
      types←'I2' 'I1' 'F'
      options←
NS '' 
 options.BlockSize←2E6
      folder←'c:\devt\vecdb\demodb'
      db←
NEW #.vecdb ('demo' folder columns types options (date key volume))
VecDb - DYNA'16
6
Q
u
e
r
i
e
s
      where←('date' 1)('key' 1)   
 date=1 and key=1
      select←'date' 'key' 'volume' 
 columns to read
      db.Query where select
      db.Query 
 'sum volume' 'key' 
 select sum(volume) group by key
VecDb - DYNA'16
7
S
h
a
r
d
i
n
g
You can partition, or “shard” the database based on any
computation, for example:
    options.(ShardCols ShardFn)←1 '{
(
⊃⍵
)÷5000}'
    options.ShardFolders←'/history' '/recent'
The above uses column number 1 as input, and put the first
5000 values into the first shard, the next 5000 values in the
next shard, etc.
Shard folders can be located on separate machines
Parallel queries can run on the machine where each shard is
located
VecDb - DYNA'16
8
D
e
m
o
 
VecDb - DYNA'16
9
C
u
r
r
e
n
t
 
S
t
a
t
u
s
Available now
Unit Test Suite provides “Specification”
In “production” use in one development project
Under evaluation for a couple more
Open Source is working:
https://github.com/Dyalog/vecdb/graphs/contributors
VecDb - DYNA'16
10
C
u
r
r
e
n
t
 
S
t
a
t
u
s
Available now
Test Suite provides “Specification”
In “production” use in one development project
Under evaluation for a couple more
Open Source is working:
https://github.com/Dyalog/vecdb/graphs/contributors
VecDb - DYNA'16
11
T
o
 
C
o
m
e
Extend datatype support
Current: Bool, I1, I2, I4, Float, Char
Char type limited to 16,767 different string (I2 index into list
of strings).
More Char types next up
Simple joins of tables on a shared key (databases
must be “equally” sharded)
Parallel execution of queries
Hook up SQAPL Server for ODBC/ ADO/ JDBC
driver access
VecDb - DYNA'16
Slide Note
Embed
Share

The VecDb workshop discusses the concept of Inverted Databases, highlighting their advantages and weaknesses. It aims to provide a simple, fast storage mechanism for data, emphasizing parallel queries and integration with Dyalog APL. The workshop covers creating databases, querying data, and the goals of VecDb as an open source project.

  • Dyalog
  • VecDb Workshop
  • Inverted Databases
  • Parallel Queries
  • Open Source

Uploaded on Sep 15, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. vecdb The Dyalog Vector Database Workshop W2: Managing Data with Dyalog Morten Kromberg, CXO #DYNA16

  2. 1 ABC 4 203.9034382 ABC 3 300.9898292 DEF 4 146.0736925 Inverted Databases Traditional relational databases focus on being able to read a single record quickly, and preserve its integrity during updates Inverted or databases organize data by column: All the data for one column is adjacent 4 203.9034382 ABC 3 300.9898292 4 146.0736925 ABC DEF Examples: On the mainframe: Hydra, Mabra, In modern times: The k language has kdb+ J has jd Paul Mansour has flibdb .. and now there is vecdb #DYNA16 VecDb - DYNA'16

  3. 2 Advantages of Inverted DBs 4 203.9034382 ABC 3 300.9898292 4 146.0736925 ABC DEF Each column is a simple, memory-mappable structure Much lower storage and memory requirements due to simpler structure (typically an order of magnitude) Searching and summarizing large numbers of records is often several orders of magnitude faster Record oriented DBs will sometimes invert or hash selected key columns: In an inverted DB *all* columns are fast Array language primitives (APL, J, k) can operate directly on memory-mapped arrays Extremely simple implementation Take advantage of all the clever work done by Hui, Foad, Whitney and others #DYNA16 VecDb - DYNA'16

  4. 3 Weaknesses of Inverted DBs Typically do not fully support transactions (except sometimes for append operations). #DYNA16 VecDb - DYNA'16

  5. 4 Goals of vecdb Provide simple, fast storage mechanism for a few gigabytes of data Distributed, sharded database Allows (highly) parallel queries Integrated with Dyalog APL / Free to all users Open source project: https://github.com/Dyalog/vecdb #DYNA16 VecDb - DYNA'16

  6. 5 Create a Database date 100/ 1E4 key ?1E6 10 10 different keys in random order volume 1000 ?1E6 0 lots of noise 5 date key volume 1 4 203.9034382 1 3 300.9898292 1 4 146.0736925 1 10 303.0208711 1 1 5.828660818 columns 'date' 'key' 'volume' types 'I2' 'I1' 'F' options NS '' options.BlockSize 2E6 folder 'c:\devt\vecdb\demodb' 100 trades/day db NEW #.vecdb ('demo' folder columns types options (date key volume)) #DYNA16 VecDb - DYNA'16

  7. 6 Queries where ('date' 1)('key' 1) date=1 and key=1 select 'date' 'key' 'volume' columns to read db.Query where select db.Query 'sum volume' 'key' select sum(volume) group by key #DYNA16 VecDb - DYNA'16

  8. 7 Sharding You can partition, or shard the database based on any computation, for example: options.(ShardCols ShardFn) 1 '{ ( ) 5000}' options.ShardFolders '/history' '/recent' The above uses column number 1 as input, and put the first 5000 values into the first shard, the next 5000 values in the next shard, etc. Shard folders can be located on separate machines Parallel queries can run on the machine where each shard is located #DYNA16 VecDb - DYNA'16

  9. 8 Demo #DYNA16 VecDb - DYNA'16

  10. 9 Current Status Available now Unit Test Suite provides Specification In production use in one development project Under evaluation for a couple more Open Source is working: https://github.com/Dyalog/vecdb/graphs/contributors #DYNA16 VecDb - DYNA'16

  11. 10 Current Status Available now Test Suite provides Specification In production use in one development project Under evaluation for a couple more Open Source is working: https://github.com/Dyalog/vecdb/graphs/contributors #DYNA16 VecDb - DYNA'16

  12. 11 To Come Extend datatype support Current: Bool, I1, I2, I4, Float, Char Char type limited to 16,767 different string (I2 index into list of strings). More Char types next up Simple joins of tables on a shared key (databases must be equally sharded) Parallel execution of queries Hook up SQAPL Server for ODBC/ ADO/ JDBC driver access #DYNA16 VecDb - DYNA'16

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#