Skip to content

Commit 20b3653

Browse files
author
s153398
committed
Final
1 parent 8c14c4c commit 20b3653

4 files changed

+782
-744
lines changed

.ipynb_checkpoints/Project 2 - Web Traffic Analysis_v2-checkpoint.ipynb

+63-44
Original file line numberDiff line numberDiff line change
@@ -109,12 +109,11 @@
109109
"Our data structure is based on the Flajolet-Martin algorithm, which the Hyperloglog algorithm is also based on.\n",
110110
"To implement this algorithm, and estimate the number of unique IPs in the stream, we defined some helper functions: make_bucket, hash, and calc_prefix_length.\n",
111111
"Essentially, we are trying to estimate the number of unique elements in the stream by hashing the IP address into a bit string.\n",
112+
"We use a 64-bit hash function, which returns a tuple of two 32-bit hash values. The first is used to specify a bucket number, but taking modulus of BUCKET_SIZE. The second value is used to make the count estimate:\n",
112113
"We know that the random chance of having a 0 in the first position of the bit string is 1/2, thus we can assume that the probability will start with 'k' number of 0's is (1/2)^k. We then use 2^k as the estimate of the number of unique elements seen in the stream.\n",
113-
"Additionally, to reduce the variance of overestimating, we introduce X amount of buckets, which are different hash functions in the same hash familiy (murmurhash3).\n",
114-
"Each bucket represents the largest number of zeros seen from any element in the stream for the corresponding hash function.\n",
115-
"Finally, we take the average of the bucket values (k) and use this to calculate our estimate (2^k).\n",
116-
"\n",
117-
"To make the estimate better, it has been found in litterature (http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf?fbclid=IwAR3eFpcI7Tfg7IFS6T3RQO7eNjHQ1COgVo8hFnh7PTqPIM2K55hK4TFcoyg) that removing the top 30% of the bucket values will improve the results significantly. Therefore, this strategy was implemented, yielding us better estimates.\n"
114+
"Additionally, to reduce the variance of overestimating, we introduce X (BUCKET_SIZE) amount of different hash functions in the same hash familiy (murmurhash3).\n",
115+
"Each bucket represents the largest number of zeros in sequence seen from the start of any element in the stream for the corresponding bucket hash value.\n",
116+
"Finally, we take the harmonic mean of the bucket values (k) and use this to calculate our estimate (2^k). The harmonic mean is less influenced by large outliers, and is thus less prone to overestimation."
118117
]
119118
},
120119
{
@@ -138,9 +137,14 @@
138137
"import numpy as np\n",
139138
"\n",
140139
"\n",
141-
"# Function for making hashing more readable. Returning the absolute value of a 32-bit hashing.\n",
142-
"def hash(element, seed):\n",
143-
" return abs(mmh3.hash(element, seed = seed))\n",
140+
"# Function for making hashing more readable. Returning the absolute value of a 32-bit or 64-bit hashing.\n",
141+
"def hash(element, seed, bit_size = 32):\n",
142+
" if bit_size == 64:\n",
143+
" hash_val = mmh3.hash64(element, seed = seed)\n",
144+
" return (abs(hash_val[0]), str(bin(abs(hash_val[1]))))\n",
145+
" else:\n",
146+
" return abs(mmh3.hash(element, seed = seed))\n",
147+
" \n",
144148
"\n",
145149
"# Function to calculate the length of a bit string prefix of 0s:\n",
146150
"def calc_prefix_length(bit_string):\n",
@@ -152,11 +156,12 @@
152156
"\n",
153157
"# Function for creating or editing a bucket of prefix lengths, based on hashing in different seed values (0-BUCKET_SIZE).\n",
154158
"def make_bucket(element, bucket):\n",
155-
" for i in range(BUCKET_SIZE):\n",
156-
" hash_val = str(bin(hash(element, seed = i)))\n",
157-
" prefix_len = calc_prefix_length(hash_val)\n",
158-
" if prefix_len > bucket[i]:\n",
159-
" bucket[i] = prefix_len\n",
159+
" for i in range(BUCKET_SIZE): # Number of hash functions\n",
160+
" bucket_hash, estimate = hash(element, seed = i, bit_size = 64)\n",
161+
" bucket_hash = bucket_hash % BUCKET_SIZE # Bucket location\n",
162+
" prefix_len = calc_prefix_length(estimate)\n",
163+
" if prefix_len > bucket[bucket_hash]:\n",
164+
" bucket[bucket_hash] = prefix_len\n",
160165
" return bucket\n",
161166
"\n",
162167
"\n",
@@ -165,7 +170,7 @@
165170
" X_min = 10**10 # Some large initial value\n",
166171
" for i in range(d):\n",
167172
" # In order to distinguish IP and domain combinations we combine these in the hash function.\n",
168-
" wi = hash(IP+domain, seed = i) % w # w defines the size limit of returned hash values.\n",
173+
" wi = hash(IP+domain, seed = i) % w # w defines the size limit of returned hash values.\n",
169174
" X_val = M[i, wi]\n",
170175
" if X_val < X_min:\n",
171176
" X_min = X_val\n",
@@ -179,7 +184,7 @@
179184
"outputs": [],
180185
"source": [
181186
"# Initialize constants\n",
182-
"BUCKET_SIZE = 32 # Number of hash functions in Flajolet-Martin algorithm\n",
187+
"BUCKET_SIZE = 32 # Number of hash functions and number of buckets the prefix lens are inserted in (HyperLogLog)\n",
183188
"STREAM_SIZE = 100000 # Number of elements to go through\n",
184189
"d = 10 # Number of hash functions in Count-min sketch\n",
185190
"w = STREAM_SIZE # Number of possible hash values in Count-min sketch\n",
@@ -233,7 +238,7 @@
233238
"* We only consider a IP that is not already there.\n",
234239
"* 1) Estimate the count of a particular IP\n",
235240
"* 2) Sort max_ip and max_count in order to easily compare with lowest value and insert if higher.\n",
236-
"* 3) Add if it is higher than the total count of the current IP, or if there is not already an IP.\n",
241+
"* 3) Add if it is higher than the total count of the current IP.\n",
237242
"\n",
238243
"The needed data structures (domains, M, and max_ip) are now ready for our analysis."
239244
]
@@ -252,8 +257,7 @@
252257
"\n",
253258
"#### How many unique IPs are there in the stream?\n",
254259
"To answer this question, we define a function that exstracts the bucket of a specific domain from the domains dictionary.\n",
255-
"As explained in the beginning, better results are yielded when removing the top 30% of the bucket. However, when using a harmonic mean the outliers have less influence and therefore we reduced this to 20%, partially because it gave better results.\n",
256-
"The function returns 2^k, where k is the harmonic mean of the lowest 80% of the bucket."
260+
"The function returns 2^k, where k is the harmonic mean of the bucket."
257261
]
258262
},
259263
{
@@ -267,9 +271,8 @@
267271
" return n/np.sum([1/val for val in alist if val != 0])\n",
268272
"\n",
269273
"def get_count(domain):\n",
270-
" last_20 = round(len(domains[domain])*0.2)\n",
271-
" bucket = sorted(domains[domain])\n",
272-
" bucket = bucket[:-last_20]\n",
274+
" bucket = [val for val in domains[domain] if val != 0]\n",
275+
" bucket.sort()\n",
273276
" return int(round(2**harmonic_mean(bucket)))"
274277
]
275278
},
@@ -282,7 +285,7 @@
282285
"name": "stdout",
283286
"output_type": "stream",
284287
"text": [
285-
"The number of unique IPs are: 93892\n"
288+
"The number of unique IPs are: 87413\n"
286289
]
287290
}
288291
],
@@ -307,17 +310,17 @@
307310
"name": "stdout",
308311
"output_type": "stream",
309312
"text": [
310-
"total has 93892 unique IPs\n",
311-
"python.org has 34729 unique IPs\n",
312-
"wikipedia.org has 42035 unique IPs\n",
313-
"pandas.pydata.org has 8018 unique IPs\n",
314-
"dtu.dk has 1976 unique IPs\n",
315-
"google.com has 1875 unique IPs\n",
316-
"databricks.com has 908 unique IPs\n",
317-
"github.com has 949 unique IPs\n",
318-
"spark.apache.org has 323 unique IPs\n",
319-
"datarobot.com has 195 unique IPs\n",
320-
"scala-lang.org has 20 unique IPs\n"
313+
"total has 87413 unique IPs\n",
314+
"python.org has 33523 unique IPs\n",
315+
"wikipedia.org has 59177 unique IPs\n",
316+
"pandas.pydata.org has 12100 unique IPs\n",
317+
"dtu.dk has 2645 unique IPs\n",
318+
"google.com has 2647 unique IPs\n",
319+
"databricks.com has 1282 unique IPs\n",
320+
"github.com has 1433 unique IPs\n",
321+
"spark.apache.org has 737 unique IPs\n",
322+
"datarobot.com has 203 unique IPs\n",
323+
"scala-lang.org has 3 unique IPs\n"
321324
]
322325
}
323326
],
@@ -375,10 +378,10 @@
375378
"name": "stdout",
376379
"output_type": "stream",
377380
"text": [
378-
"The number of unique IPs visiting python.org is: 34729\n",
379-
"The number of unique IPs visiting wikipedia.org is: 42035\n",
380-
"The number of unique IPs visiting pandas.pydata.org is: 8018\n",
381-
"The number of unique IPs visiting github.com is: 949\n"
381+
"The number of unique IPs visiting python.org is: 33523\n",
382+
"The number of unique IPs visiting wikipedia.org is: 59177\n",
383+
"The number of unique IPs visiting pandas.pydata.org is: 12100\n",
384+
"The number of unique IPs visiting github.com is: 1433\n"
382385
]
383386
}
384387
],
@@ -433,7 +436,7 @@
433436
},
434437
{
435438
"cell_type": "code",
436-
"execution_count": 11,
439+
"execution_count": 13,
437440
"metadata": {},
438441
"outputs": [
439442
{
@@ -456,7 +459,7 @@
456459
],
457460
"source": [
458461
"print('Top', X, 'IPs and their frequency:')\n",
459-
"for ip in max_ip:\n",
462+
"for ip in max_ip: # Not sorted\n",
460463
" print(ip, count_ip(ip,'total'))"
461464
]
462465
},
@@ -465,7 +468,12 @@
465468
"metadata": {},
466469
"source": [
467470
"##### Document the accuracy of your answers when using algorithms that give approximate answers\n",
468-
"The Count-min sketch error analysis is derived from this article: http://dimacs.rutgers.edu/~graham/pubs/papers/cmencyc.pdf\n"
471+
"The Count-min sketch error analysis is derived from this article: http://dimacs.rutgers.edu/~graham/pubs/papers/cmencyc.pdf\n",
472+
"\n",
473+
"HyperLogLog accuracy:\n",
474+
"The accuracy is determined as 1/sqrt(m), where m is the number of buckets.\n",
475+
"We assume the hash function will uniformly distribute the element hash values to the 32 buckets, and we therefore set the number of buckets, m, to BUCKET_SIZE.\n",
476+
"Reference: https://storage.googleapis.com/pub-tools-public-publication-data/pdf/40671.pdf"
469477
]
470478
},
471479
{
@@ -478,22 +486,33 @@
478486
"output_type": "stream",
479487
"text": [
480488
"Error added with each element added to the M matrix: 2.7182818284590452e-05\n",
481-
"Probability of allowing a count estimate outside the error above: 4.5399929762484854e-05\n"
489+
"Probability of allowing a count estimate outside the error above: 4.5399929762484854e-05\n",
490+
"\n",
491+
"Error rate of HyperLogLog implementation: ±0.17677669529663687\n",
492+
"The variance is thus: 0.031249999999999993\n"
482493
]
483494
}
484495
],
485496
"source": [
486497
"import math\n",
487498
"# Count-min sketch error analysis:\n",
488499
"print('Error added with each element added to the M matrix:', math.e/w)\n",
489-
"print('Probability of allowing a count estimate outside the error above:', math.exp(-d))\n"
500+
"print('Probability of allowing a count estimate outside the error above:', math.exp(-d))\n",
501+
"print('')\n",
502+
"\n",
503+
"# HyperLogLog accuracy:\n",
504+
"error = 1/math.sqrt(BUCKET_SIZE)\n",
505+
"print('Error rate of HyperLogLog implementation: ±{}'.format(error))\n",
506+
"print('The variance is thus: {}'.format(error**2))"
490507
]
491508
},
492509
{
493510
"cell_type": "markdown",
494511
"metadata": {},
495512
"source": [
496-
"We can thus conclude that with the chosen parameters for d and w, we have a very high chance of hitting the true value, i.e. a high accuracy."
513+
"We can thus conclude that with the chosen parameters for d and w, we have a very high chance of hitting the true value, i.e. a high accuracy.\n",
514+
"\n",
515+
"Our HyperLogLog implementation has a theoretical variance of ~3%."
497516
]
498517
},
499518
{
@@ -505,7 +524,7 @@
505524
"* **BUCKET_SIZE = 32** -- We use 32 to minimize the risk of having a string hashing to a bit value with many initial 0s. With a higher bucket size, the chance of hitting critically wrong is diminished, as the harmonic mean of the buckets are used.\n",
506525
"* **STREAM_SIZE = 100000** -- A stream size of 100000 is used to run through a significant number of elements in the stream. \n",
507526
"* **d = 10** -- For the Count-min sketch, the chance of two strings hashing to the same position reduces with a higher d and w. In our case, a d of 10 gave very good estimates when compared to actual counts.\n",
508-
"* **w = STREAM_SIZE** -- A w of 100000 means there are enough possible hash values for each single element in the stream. This combined with the d of 30 gives a total of 30 * 100000 positions in the M matrix and thus when using the Count-min sketch, a very good estimate of a count is achieved."
527+
"* **w = STREAM_SIZE** -- A w of 100000 means there are enough possible hash values for each single element in the stream. This combined with the d of 10 gives a total of 10 * 100000 positions in the M matrix and thus when using the Count-min sketch, a very good estimate of a count is achieved."
509528
]
510529
}
511530
],

0 commit comments

Comments
 (0)