|
109 | 109 | "Our data structure is based on the Flajolet-Martin algorithm, which the Hyperloglog algorithm is also based on.\n",
|
110 | 110 | "To implement this algorithm, and estimate the number of unique IPs in the stream, we defined some helper functions: make_bucket, hash, and calc_prefix_length.\n",
|
111 | 111 | "Essentially, we are trying to estimate the number of unique elements in the stream by hashing the IP address into a bit string.\n",
|
| 112 | + "We use a 64-bit hash function, which returns a tuple of two 32-bit hash values. The first is used to specify a bucket number, but taking modulus of BUCKET_SIZE. The second value is used to make the count estimate:\n", |
112 | 113 | "We know that the random chance of having a 0 in the first position of the bit string is 1/2, thus we can assume that the probability will start with 'k' number of 0's is (1/2)^k. We then use 2^k as the estimate of the number of unique elements seen in the stream.\n",
|
113 |
| - "Additionally, to reduce the variance of overestimating, we introduce X amount of buckets, which are different hash functions in the same hash familiy (murmurhash3).\n", |
114 |
| - "Each bucket represents the largest number of zeros seen from any element in the stream for the corresponding hash function.\n", |
115 |
| - "Finally, we take the average of the bucket values (k) and use this to calculate our estimate (2^k).\n", |
116 |
| - "\n", |
117 |
| - "To make the estimate better, it has been found in litterature (http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf?fbclid=IwAR3eFpcI7Tfg7IFS6T3RQO7eNjHQ1COgVo8hFnh7PTqPIM2K55hK4TFcoyg) that removing the top 30% of the bucket values will improve the results significantly. Therefore, this strategy was implemented, yielding us better estimates.\n" |
| 114 | + "Additionally, to reduce the variance of overestimating, we introduce X (BUCKET_SIZE) amount of different hash functions in the same hash familiy (murmurhash3).\n", |
| 115 | + "Each bucket represents the largest number of zeros in sequence seen from the start of any element in the stream for the corresponding bucket hash value.\n", |
| 116 | + "Finally, we take the harmonic mean of the bucket values (k) and use this to calculate our estimate (2^k). The harmonic mean is less influenced by large outliers, and is thus less prone to overestimation." |
118 | 117 | ]
|
119 | 118 | },
|
120 | 119 | {
|
|
138 | 137 | "import numpy as np\n",
|
139 | 138 | "\n",
|
140 | 139 | "\n",
|
141 |
| - "# Function for making hashing more readable. Returning the absolute value of a 32-bit hashing.\n", |
142 |
| - "def hash(element, seed):\n", |
143 |
| - " return abs(mmh3.hash(element, seed = seed))\n", |
| 140 | + "# Function for making hashing more readable. Returning the absolute value of a 32-bit or 64-bit hashing.\n", |
| 141 | + "def hash(element, seed, bit_size = 32):\n", |
| 142 | + " if bit_size == 64:\n", |
| 143 | + " hash_val = mmh3.hash64(element, seed = seed)\n", |
| 144 | + " return (abs(hash_val[0]), str(bin(abs(hash_val[1]))))\n", |
| 145 | + " else:\n", |
| 146 | + " return abs(mmh3.hash(element, seed = seed))\n", |
| 147 | + " \n", |
144 | 148 | "\n",
|
145 | 149 | "# Function to calculate the length of a bit string prefix of 0s:\n",
|
146 | 150 | "def calc_prefix_length(bit_string):\n",
|
|
152 | 156 | "\n",
|
153 | 157 | "# Function for creating or editing a bucket of prefix lengths, based on hashing in different seed values (0-BUCKET_SIZE).\n",
|
154 | 158 | "def make_bucket(element, bucket):\n",
|
155 |
| - " for i in range(BUCKET_SIZE):\n", |
156 |
| - " hash_val = str(bin(hash(element, seed = i)))\n", |
157 |
| - " prefix_len = calc_prefix_length(hash_val)\n", |
158 |
| - " if prefix_len > bucket[i]:\n", |
159 |
| - " bucket[i] = prefix_len\n", |
| 159 | + " for i in range(BUCKET_SIZE): # Number of hash functions\n", |
| 160 | + " bucket_hash, estimate = hash(element, seed = i, bit_size = 64)\n", |
| 161 | + " bucket_hash = bucket_hash % BUCKET_SIZE # Bucket location\n", |
| 162 | + " prefix_len = calc_prefix_length(estimate)\n", |
| 163 | + " if prefix_len > bucket[bucket_hash]:\n", |
| 164 | + " bucket[bucket_hash] = prefix_len\n", |
160 | 165 | " return bucket\n",
|
161 | 166 | "\n",
|
162 | 167 | "\n",
|
|
165 | 170 | " X_min = 10**10 # Some large initial value\n",
|
166 | 171 | " for i in range(d):\n",
|
167 | 172 | " # In order to distinguish IP and domain combinations we combine these in the hash function.\n",
|
168 |
| - " wi = hash(IP+domain, seed = i) % w # w defines the size limit of returned hash values.\n", |
| 173 | + " wi = hash(IP+domain, seed = i) % w # w defines the size limit of returned hash values.\n", |
169 | 174 | " X_val = M[i, wi]\n",
|
170 | 175 | " if X_val < X_min:\n",
|
171 | 176 | " X_min = X_val\n",
|
|
179 | 184 | "outputs": [],
|
180 | 185 | "source": [
|
181 | 186 | "# Initialize constants\n",
|
182 |
| - "BUCKET_SIZE = 32 # Number of hash functions in Flajolet-Martin algorithm\n", |
| 187 | + "BUCKET_SIZE = 32 # Number of hash functions and number of buckets the prefix lens are inserted in (HyperLogLog)\n", |
183 | 188 | "STREAM_SIZE = 100000 # Number of elements to go through\n",
|
184 | 189 | "d = 10 # Number of hash functions in Count-min sketch\n",
|
185 | 190 | "w = STREAM_SIZE # Number of possible hash values in Count-min sketch\n",
|
|
233 | 238 | "* We only consider a IP that is not already there.\n",
|
234 | 239 | "* 1) Estimate the count of a particular IP\n",
|
235 | 240 | "* 2) Sort max_ip and max_count in order to easily compare with lowest value and insert if higher.\n",
|
236 |
| - "* 3) Add if it is higher than the total count of the current IP, or if there is not already an IP.\n", |
| 241 | + "* 3) Add if it is higher than the total count of the current IP.\n", |
237 | 242 | "\n",
|
238 | 243 | "The needed data structures (domains, M, and max_ip) are now ready for our analysis."
|
239 | 244 | ]
|
|
252 | 257 | "\n",
|
253 | 258 | "#### How many unique IPs are there in the stream?\n",
|
254 | 259 | "To answer this question, we define a function that exstracts the bucket of a specific domain from the domains dictionary.\n",
|
255 |
| - "As explained in the beginning, better results are yielded when removing the top 30% of the bucket. However, when using a harmonic mean the outliers have less influence and therefore we reduced this to 20%, partially because it gave better results.\n", |
256 |
| - "The function returns 2^k, where k is the harmonic mean of the lowest 80% of the bucket." |
| 260 | + "The function returns 2^k, where k is the harmonic mean of the bucket." |
257 | 261 | ]
|
258 | 262 | },
|
259 | 263 | {
|
|
267 | 271 | " return n/np.sum([1/val for val in alist if val != 0])\n",
|
268 | 272 | "\n",
|
269 | 273 | "def get_count(domain):\n",
|
270 |
| - " last_20 = round(len(domains[domain])*0.2)\n", |
271 |
| - " bucket = sorted(domains[domain])\n", |
272 |
| - " bucket = bucket[:-last_20]\n", |
| 274 | + " bucket = [val for val in domains[domain] if val != 0]\n", |
| 275 | + " bucket.sort()\n", |
273 | 276 | " return int(round(2**harmonic_mean(bucket)))"
|
274 | 277 | ]
|
275 | 278 | },
|
|
282 | 285 | "name": "stdout",
|
283 | 286 | "output_type": "stream",
|
284 | 287 | "text": [
|
285 |
| - "The number of unique IPs are: 93892\n" |
| 288 | + "The number of unique IPs are: 87413\n" |
286 | 289 | ]
|
287 | 290 | }
|
288 | 291 | ],
|
|
307 | 310 | "name": "stdout",
|
308 | 311 | "output_type": "stream",
|
309 | 312 | "text": [
|
310 |
| - "total has 93892 unique IPs\n", |
311 |
| - "python.org has 34729 unique IPs\n", |
312 |
| - "wikipedia.org has 42035 unique IPs\n", |
313 |
| - "pandas.pydata.org has 8018 unique IPs\n", |
314 |
| - "dtu.dk has 1976 unique IPs\n", |
315 |
| - "google.com has 1875 unique IPs\n", |
316 |
| - "databricks.com has 908 unique IPs\n", |
317 |
| - "github.com has 949 unique IPs\n", |
318 |
| - "spark.apache.org has 323 unique IPs\n", |
319 |
| - "datarobot.com has 195 unique IPs\n", |
320 |
| - "scala-lang.org has 20 unique IPs\n" |
| 313 | + "total has 87413 unique IPs\n", |
| 314 | + "python.org has 33523 unique IPs\n", |
| 315 | + "wikipedia.org has 59177 unique IPs\n", |
| 316 | + "pandas.pydata.org has 12100 unique IPs\n", |
| 317 | + "dtu.dk has 2645 unique IPs\n", |
| 318 | + "google.com has 2647 unique IPs\n", |
| 319 | + "databricks.com has 1282 unique IPs\n", |
| 320 | + "github.com has 1433 unique IPs\n", |
| 321 | + "spark.apache.org has 737 unique IPs\n", |
| 322 | + "datarobot.com has 203 unique IPs\n", |
| 323 | + "scala-lang.org has 3 unique IPs\n" |
321 | 324 | ]
|
322 | 325 | }
|
323 | 326 | ],
|
|
375 | 378 | "name": "stdout",
|
376 | 379 | "output_type": "stream",
|
377 | 380 | "text": [
|
378 |
| - "The number of unique IPs visiting python.org is: 34729\n", |
379 |
| - "The number of unique IPs visiting wikipedia.org is: 42035\n", |
380 |
| - "The number of unique IPs visiting pandas.pydata.org is: 8018\n", |
381 |
| - "The number of unique IPs visiting github.com is: 949\n" |
| 381 | + "The number of unique IPs visiting python.org is: 33523\n", |
| 382 | + "The number of unique IPs visiting wikipedia.org is: 59177\n", |
| 383 | + "The number of unique IPs visiting pandas.pydata.org is: 12100\n", |
| 384 | + "The number of unique IPs visiting github.com is: 1433\n" |
382 | 385 | ]
|
383 | 386 | }
|
384 | 387 | ],
|
|
433 | 436 | },
|
434 | 437 | {
|
435 | 438 | "cell_type": "code",
|
436 |
| - "execution_count": 11, |
| 439 | + "execution_count": 13, |
437 | 440 | "metadata": {},
|
438 | 441 | "outputs": [
|
439 | 442 | {
|
|
456 | 459 | ],
|
457 | 460 | "source": [
|
458 | 461 | "print('Top', X, 'IPs and their frequency:')\n",
|
459 |
| - "for ip in max_ip:\n", |
| 462 | + "for ip in max_ip: # Not sorted\n", |
460 | 463 | " print(ip, count_ip(ip,'total'))"
|
461 | 464 | ]
|
462 | 465 | },
|
|
465 | 468 | "metadata": {},
|
466 | 469 | "source": [
|
467 | 470 | "##### Document the accuracy of your answers when using algorithms that give approximate answers\n",
|
468 |
| - "The Count-min sketch error analysis is derived from this article: http://dimacs.rutgers.edu/~graham/pubs/papers/cmencyc.pdf\n" |
| 471 | + "The Count-min sketch error analysis is derived from this article: http://dimacs.rutgers.edu/~graham/pubs/papers/cmencyc.pdf\n", |
| 472 | + "\n", |
| 473 | + "HyperLogLog accuracy:\n", |
| 474 | + "The accuracy is determined as 1/sqrt(m), where m is the number of buckets.\n", |
| 475 | + "We assume the hash function will uniformly distribute the element hash values to the 32 buckets, and we therefore set the number of buckets, m, to BUCKET_SIZE.\n", |
| 476 | + "Reference: https://storage.googleapis.com/pub-tools-public-publication-data/pdf/40671.pdf" |
469 | 477 | ]
|
470 | 478 | },
|
471 | 479 | {
|
|
478 | 486 | "output_type": "stream",
|
479 | 487 | "text": [
|
480 | 488 | "Error added with each element added to the M matrix: 2.7182818284590452e-05\n",
|
481 |
| - "Probability of allowing a count estimate outside the error above: 4.5399929762484854e-05\n" |
| 489 | + "Probability of allowing a count estimate outside the error above: 4.5399929762484854e-05\n", |
| 490 | + "\n", |
| 491 | + "Error rate of HyperLogLog implementation: ±0.17677669529663687\n", |
| 492 | + "The variance is thus: 0.031249999999999993\n" |
482 | 493 | ]
|
483 | 494 | }
|
484 | 495 | ],
|
485 | 496 | "source": [
|
486 | 497 | "import math\n",
|
487 | 498 | "# Count-min sketch error analysis:\n",
|
488 | 499 | "print('Error added with each element added to the M matrix:', math.e/w)\n",
|
489 |
| - "print('Probability of allowing a count estimate outside the error above:', math.exp(-d))\n" |
| 500 | + "print('Probability of allowing a count estimate outside the error above:', math.exp(-d))\n", |
| 501 | + "print('')\n", |
| 502 | + "\n", |
| 503 | + "# HyperLogLog accuracy:\n", |
| 504 | + "error = 1/math.sqrt(BUCKET_SIZE)\n", |
| 505 | + "print('Error rate of HyperLogLog implementation: ±{}'.format(error))\n", |
| 506 | + "print('The variance is thus: {}'.format(error**2))" |
490 | 507 | ]
|
491 | 508 | },
|
492 | 509 | {
|
493 | 510 | "cell_type": "markdown",
|
494 | 511 | "metadata": {},
|
495 | 512 | "source": [
|
496 |
| - "We can thus conclude that with the chosen parameters for d and w, we have a very high chance of hitting the true value, i.e. a high accuracy." |
| 513 | + "We can thus conclude that with the chosen parameters for d and w, we have a very high chance of hitting the true value, i.e. a high accuracy.\n", |
| 514 | + "\n", |
| 515 | + "Our HyperLogLog implementation has a theoretical variance of ~3%." |
497 | 516 | ]
|
498 | 517 | },
|
499 | 518 | {
|
|
505 | 524 | "* **BUCKET_SIZE = 32** -- We use 32 to minimize the risk of having a string hashing to a bit value with many initial 0s. With a higher bucket size, the chance of hitting critically wrong is diminished, as the harmonic mean of the buckets are used.\n",
|
506 | 525 | "* **STREAM_SIZE = 100000** -- A stream size of 100000 is used to run through a significant number of elements in the stream. \n",
|
507 | 526 | "* **d = 10** -- For the Count-min sketch, the chance of two strings hashing to the same position reduces with a higher d and w. In our case, a d of 10 gave very good estimates when compared to actual counts.\n",
|
508 |
| - "* **w = STREAM_SIZE** -- A w of 100000 means there are enough possible hash values for each single element in the stream. This combined with the d of 30 gives a total of 30 * 100000 positions in the M matrix and thus when using the Count-min sketch, a very good estimate of a count is achieved." |
| 527 | + "* **w = STREAM_SIZE** -- A w of 100000 means there are enough possible hash values for each single element in the stream. This combined with the d of 10 gives a total of 10 * 100000 positions in the M matrix and thus when using the Count-min sketch, a very good estimate of a count is achieved." |
509 | 528 | ]
|
510 | 529 | }
|
511 | 530 | ],
|
|
0 commit comments