Run SQL over your Maven artifacts.
Slides from the talk "Talking SQL to Strangers".
camille-sql
allows you to explore Maven artifacts you have on your local hard drive.
Run the server:
$ bin/camille-server
Artifacts repository path: /Users/<user>/.m2/repository/
Running server on localhost:26727
...
The server understands PostreSQL wire protocol, so you can connect to it using standard psql
client:
$ PGPASSWORD=nopass psql "host=localhost port=26727 sslmode=disable"
psql (12.2, server 9.5.0)
Type "help" for help.
camille=>
As you can see, psql
is absolutely sure it talks to PostgreSQL version 9.5.0.
Now you have access to 2 tables: artifacts
and versions
. You can run any read-only SQL query: the server supports projections, filtering, grouping, joins, agg functions, sub-queries etc (pretty much all of SQL99).
Basic queries:
camille=> select * from artifacts limit 6;
uid | group_id | artifact_id | name | url
------------+--------------------------+--------------------------+---------------------+------------------------------------------
3227713579 | alandipert | desiderata | desiderata | https://github.com/alandipert/desiderata
3382955103 | aopalliance | aopalliance | AOP alliance | http://aopalliance.sourceforge.net
1507835947 | asm | asm-parent | ASM | http://asm.objectweb.org/
226341444 | backport-util-concurrent | backport-util-concurrent | Backport of JSR 166 | http://backport-jsr166.sourceforge.net/
1712481681 | biz.aQute | bndlib | BND Library | http://www.aQute.biz/Code/Bnd
2280883480 | biz.aQute.bnd | biz.aQute.bndlib | biz.aQute.bndlib | https://bnd.bndtools.org/
(6 rows)
camille=> select * from versions where filesize > 10000 limit 5;
uid | version | filesize | last_modified | sha1
------------+---------+----------+-------------------------+------------------------------------------
3345961009 | 1.3.2 | 337129 | 2019-07-04 23:36:26.464 | ff84d15cfeb0825935a170d7908fbfae00498050
1053708643 | 1.0.1 | 26514 | 2019-07-04 23:23:20.322 | 49c100caf72d658aca8e58bd74a4ba90fa2b0d70
2740841946 | 1.6.5 | 1034049 | 2019-07-05 05:37:10.953 | 7d18faf23df1a5c3a43613952e0e8a182664564b
925895164 | 0.4.4 | 42645 | 2020-02-01 06:45:59.599 | 2522f7f1b4bab169a2540406eb3eb71f7d6e3003
136773645 | 1.9 | 263965 | 2019-07-04 23:25:30.09 | 9ce04e34240f674bc72680f8b843b1457383161a
(5 rows)
Something more complicated:
camille=>
SELECT group_id, COUNT(*) AS n_files
FROM artifacts
LEFT JOIN versions ON artifacts.uid=versions.uid
GROUP BY group_id
ORDER BY n_files DESC
LIMIT 10;
group_id | n_files
--------------------------+---------
org.apache.flink | 391
org.apache.maven | 245
org.codehaus.plexus | 186
org.apache.hadoop | 121
org.apache.maven.doxia | 108
org.apache.maven.plugins | 82
io.netty | 67
org.apache.maven.shared | 65
org.apache.lucene | 64
org.apache.commons | 62
(10 rows)
The project is mainly done out of pure curiosity:
- figure out how does low-level PostgreSQL transport protocol (
pgwire
) look like - check on practice how simple or hard would it be to implement
pgwire
as a Netty codec - implement simple enough but not trivial example of defining relational algebra system using Apache Calcite
- it's just fun and looks cool
- Netty to run async I/O server
- Custom "codec" to encode/decode
pgwire
messages (seepgwire
package). The tricky part of the codec is that very first message has a different structure compared to all following messages (from PostgreSQL documentation: because of purely historical reasons). Channel initializer creates pipeline withPgwireStartupMessageDecoder
that will eventually remove itself after the first message is succesfully processed. - Server handler cycles over incomming SQL queries, decoding queries from bytes protocol and serializing result set into a proper sequence of messages (row descriptor -> row data -> command complete).
- "Database" that actually executes query is implemented in
m2sql
package. It exposes JDBC connection, so the server uses standardjava.sql
interface when talking to it (see documentation for Apache Avatica library). - Apache Calcite is used for query parsing, query planning, query optimizaiton. High-level API is used to declare catalog structure, tables, schemas, relations and scanning logic.
More details in the deck.
"Precision is the difference between a butcher and a surgeon" (tm)
Implemented optimization:
- prune unused fields: if "filesize" is not queried, we don't need to waste cpu/ram to calculate it
- push-down filtering predicates: jump into a subfolder if prefix is known
Work in progress:
- optimized join of artifacts and versions: we can walk files tree once to retrieve all the information we need
This is the project made for fun. Feel free to implement whatever feature you want and just drop a PR here ;) See TODO list below if you need ideas on what could be helpful (or what is critically missing).
- Network encoding logic baked into DTO object is such a bad idea... Instead of
toByteBuf
method for each message type, the logic should be implemented in a single encoder with dynamic type-based dispatch - Propage errors (like, wrong queries) to the client instead of re-openning the connection
- Additional PostgreSQL client features, like
\l
,show databases
,show tables
(need to registerpg_catalog
to make this happen) -
pgwire
protocol has way more message types that are currently implemented - Reject non-read queries (
insert
,update
etc) - Push-down predicates for folder traversal (e.g.
group_id LIKE com.apache.%
predicate might be optimited by going directly tocom/apache/
subfolder) - Better CLI for the server (logs, args parser help etc)
- SSL, password authentication
- Carry cancel flag around
-
DELETE
versions
Copyright © 2020 camille-sql
camille-sql
is licensed under the MIT license, available at MIT and also in the LICENSE file.