Count Domain based stats of clickstream
30 mins
$ ~/apps/kafka/bin/kafka-topics.sh --bootstrap-server localhost:9092 --create --topic clickstream --replication-factor 1 --partitions 2
Use Kafkacat to see messages in the topic
$ kafkacat -q -C -b localhost:9092 -t clickstream -f 'Partition %t[%p], offset: %o, key: %k, value: %s\n'
Or use console consumer
$ ~/apps/kafka/bin/kafka-console-consumer.sh \
--bootstrap-server localhost:9092 \
--property print.key=true --property key.separator=":" \
--topic clickstream
- Inspect file and make any fixes :
src/main/java/x/utils/ClickStreamProducer.java
- Run the producer in Eclipse, Right click on the file and run as 'Java Application'
- Make sure it is sending messages as follows
- key : Domain
- value : clickstream data
- example :
key=facebook.com, value={"timestamp":1451635200005,"session":"session_251","domain":"facebook.com","cost":91,"user":"user_16","campaign":"campaign_5","ip":"ip_67","action":"clicked"}
Inspect 'console-consumer' output, it may look something like this
facebook.com:{"timestamp":1451635200005,"session":"session_251","domain":"facebook.com","cost":91,"user":"user_16","campaign":"campaign_5","ip":"ip_67","action":"clicked"}
cnn.com:{"timestamp":1451635200020,"session":"session_66","domain":"cnn.com","cost":31,"user":"user_29","campaign":"campaign_3","ip":"ip_49","action":"blocked"}
This consumer will keep an running total of domain count seen in clickstream.
- Inspect file :
src/main/java/x/lab06_domain_count/DomainCountConsumer.java
- Fix the TODO items
Use reference Java API for Consumer
- Run the
lab06_domain_count.DomainCountConsumer
in Eclipse, - Run the
utils.ClickStreamProducer
in Eclipse, - Expected output
Got 10 messages
Received message : ConsumerRecord(.....
Domain Count is
[facebook.com=1]
Received message : ConsumerRecord(.....
Domain Count is
[facebook.com=1, foxnews.com=1]
Received message : ConsumerRecord(.....
Domain Count is
[facebook.com=2, foxnews.com=1]
...
This is a bonus task for you to try.
Say we saw 10 records from facebook.com. 3 of them were clicks. That makes the click-ratio = 3 / 10 = 30%
Calculate click ratios for all domains.
And print out the domains with the highest click-ratio.
Hint:
- You will want to keep track of total traffic (which we are doing here)
- And also keep a click-count (you will need to implement this)
- You can use another hashmap to keep track of clicks.
- And print out the ratio