Welcome to the CrunchDB Kata! The purpose of this exercise is to evaluate how you approach software development, testing, documentation and problem solving.
Our goal is to build a system (including an on-disk storage format) that stores people's preferences for favourite singers and car brands from a given list of choices. Suppose we asked those people:
- "What's your favourite car brand?" (can pick only 1)
- "Which car brands do you like?" (can pick multiple)
- "Which car brands do you currently own?" (can pick multiple)
- "Which car brands have your ever owned?" (can pick multiple)
- "What's your favourite music artist?" (can pick only 1)
- "Which music artist would you vote at a music competition?" (can pick only 1)
- "Which music artists have you listened to?" (can pick multiple)
- "Which music artists do you known?" (can pick multiple)
- "Which music artists do you dislike?" (can pick multiple)
We want to be able to answer these questions:
- What's the most frequently owned car brand?
- What's the favourite car brand?
- What's the most listened to music artist?
- What's the favourite music artist?
- Defined constants - The pre-defined lists of car-brands, singers, allowed answers to questions (e.g.
"yes"
,"no"
, and"not_answered"
), and JSON keys for input data. - Random data generating script - Run this to create simulated, random chunks of input survey data containing the preferences.
- A python script (
acquisition.py
) to load the preferences as they come and queue them into a MongoDB database - A python script (
storage.py
) to fetch the preferences from MongoDB and move them into the system - A python script (
query.py
) to read the data stored into the system and provide answers to one of the questions
Each time one of our survey members responded to survey with the mentioned preferences, it sends us a file in JSON format with his/her answers. (For testing purposes these are the files generated by the data/generata_data.py
script mentioned above.)
We don't want to load into our system the preferences immediately when they come, we want to load them only when they might reasonably constitute a significant change to the outcome of the questions.
For this reason we want to queue the preferences into a MongoDB database until we accumulate a few of them and only flush them to the system on demand.
So the acquisition.py
script must load the preferences .jsonl
files into MongoDB.
Every time the storage.py
script is started it should check for new data in mongo and consume it in chunks of 50 preferences at the time. If we have less than 50 preferences left in mongo we won't consume them.
When the answers are consumed from mongo, they must be moved into the system which keeps data in whatever format you prefer as long as it's an on-disk format (as a single file or multiple files, whichever you find more convenient).
The format in which data is stored on disk must be optimized for minimum disk space consumption and fast answering of the questions by the query script. In general it should aim to consume less disk space than the format the answers originally came in and be quick to answer to queries, but it can sacrifice speed of writes.
The query interface to the system (the query.py
script) must allow software to get back on demand
the answer to one and only one of the questions on demand.
It must be possible to indicate which question (out of the previously mentioned four) we want answered, so the question answered must not be hard-coded into the system.
An HTTP GET endpoint, a HTTP/JSON API or a command line tool are all perfectly acceptable ways to provide an interface to the system, you are free to chose what you find more convenient.
Just keep in mind that in theory there will be another developer that has to write software that consumes the answers you provide, so it should not be too hard to programmatically interact with the query system.