shdam
diff --git a/‎.ipynb_checkpoints/Project 2 - Web Traffic Analysis_v2-checkpoint.ipynb
+63-44 b/‎.ipynb_checkpoints/Project 2 - Web Traffic Analysis_v2-checkpoint.ipynb
+63-44
@@ -109,12 +109,11 @@
     "Our data structure is based on the Flajolet-Martin algorithm, which the Hyperloglog algorithm is also based on.\n",
     "To implement this algorithm, and estimate the number of unique IPs in the stream, we defined some helper functions: make_bucket, hash, and calc_prefix_length.\n",
     "Essentially, we are trying to estimate the number of unique elements in the stream by hashing the IP address into a bit string.\n",
+    "We use a 64-bit hash function, which returns a tuple of two 32-bit hash values. The first is used to specify a bucket number, but taking modulus of BUCKET_SIZE. The second value is used to make the count estimate:\n",
     "We know that the random chance of having a 0 in the first position of the bit string is 1/2, thus we can assume that the probability will start with 'k' number of 0's is (1/2)^k. We then use 2^k as the estimate of the number of unique elements seen in the stream.\n",
-    "Additionally, to reduce the variance of overestimating, we introduce X amount of buckets, which are different hash functions in the same hash familiy (murmurhash3).\n",
-    "Each bucket represents the largest number of zeros seen from any element in the stream for the corresponding hash function.\n",
-    "Finally, we take the average of the bucket values (k) and use this to calculate our estimate (2^k).\n",
-    "\n",
-    "To make the estimate better, it has been found in litterature (http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf?fbclid=IwAR3eFpcI7Tfg7IFS6T3RQO7eNjHQ1COgVo8hFnh7PTqPIM2K55hK4TFcoyg) that removing the top 30% of the bucket values will improve the results significantly. Therefore, this strategy was implemented, yielding us better estimates.\n"
+    "Additionally, to reduce the variance of overestimating, we introduce X (BUCKET_SIZE) amount of different hash functions in the same hash familiy (murmurhash3).\n",
+    "Each bucket represents the largest number of zeros in sequence seen from the start of any element in the stream for the corresponding bucket hash value.\n",
+    "Finally, we take the harmonic mean of the bucket values (k) and use this to calculate our estimate (2^k). The harmonic mean is less influenced by large outliers, and is thus less prone to overestimation."
    ]
   },
   {
@@ -138,9 +137,14 @@
     "import numpy as np\n",
     "\n",
     "\n",
-    "# Function for making hashing more readable. Returning the absolute value of a 32-bit hashing.\n",
-    "def hash(element, seed):\n",
-    "    return abs(mmh3.hash(element, seed = seed))\n",
+    "# Function for making hashing more readable. Returning the absolute value of a 32-bit or 64-bit hashing.\n",
+    "def hash(element, seed, bit_size = 32):\n",
+    "    if bit_size == 64:\n",
+    "        hash_val = mmh3.hash64(element, seed = seed)\n",
+    "        return (abs(hash_val[0]), str(bin(abs(hash_val[1]))))\n",
+    "    else:\n",
+    "        return abs(mmh3.hash(element, seed = seed))\n",
+    "        \n",
     "\n",
     "# Function to calculate the length of a bit string prefix of 0s:\n",
     "def calc_prefix_length(bit_string):\n",
@@ -152,11 +156,12 @@
     "\n",
     "# Function for creating or editing a bucket of prefix lengths, based on hashing in different seed values (0-BUCKET_SIZE).\n",
     "def make_bucket(element, bucket):\n",
-    "    for i in range(BUCKET_SIZE):\n",
-    "        hash_val = str(bin(hash(element, seed = i)))\n",
-    "        prefix_len = calc_prefix_length(hash_val)\n",
-    "        if prefix_len > bucket[i]:\n",
-    "            bucket[i] = prefix_len\n",
+    "    for i in range(BUCKET_SIZE):                      # Number of hash functions\n",
+    "        bucket_hash, estimate = hash(element, seed = i, bit_size = 64)\n",
+    "        bucket_hash = bucket_hash % BUCKET_SIZE       # Bucket location\n",
+    "        prefix_len = calc_prefix_length(estimate)\n",
+    "        if prefix_len > bucket[bucket_hash]:\n",
+    "            bucket[bucket_hash] = prefix_len\n",
     "    return bucket\n",
     "\n",
     "\n",
@@ -165,7 +170,7 @@
     "    X_min = 10**10  # Some large initial value\n",
     "    for i in range(d):\n",
     "        # In order to distinguish IP and domain combinations we combine these in the hash function.\n",
-    "        wi = hash(IP+domain, seed = i) % w # w defines the size limit of returned hash values.\n",
+    "        wi = hash(IP+domain, seed = i) % w            # w defines the size limit of returned hash values.\n",
     "        X_val = M[i, wi]\n",
     "        if X_val < X_min:\n",
     "            X_min = X_val\n",
@@ -179,7 +184,7 @@
    "outputs": [],
    "source": [
     "# Initialize constants\n",
-    "BUCKET_SIZE = 32        # Number of hash functions in Flajolet-Martin algorithm\n",
+    "BUCKET_SIZE = 32        # Number of hash functions and number of buckets the prefix lens are inserted in (HyperLogLog)\n",
     "STREAM_SIZE = 100000    # Number of elements to go through\n",
     "d = 10                  # Number of hash functions in Count-min sketch\n",
     "w = STREAM_SIZE         # Number of possible hash values in Count-min sketch\n",
@@ -233,7 +238,7 @@
     "* We only consider a IP that is not already there.\n",
     "* 1) Estimate the count of a particular IP\n",
     "* 2) Sort max_ip and max_count in order to easily compare with lowest value and insert if higher.\n",
-    "* 3) Add if it is higher than the total count of the current IP, or if there is not already an IP.\n",
+    "* 3) Add if it is higher than the total count of the current IP.\n",
     "\n",
     "The needed data structures (domains, M, and max_ip) are now ready for our analysis."
    ]
@@ -252,8 +257,7 @@
     "\n",
     "#### How many unique IPs are there in the stream?\n",
     "To answer this question, we define a function that exstracts the bucket of a specific domain from the domains dictionary.\n",
-    "As explained in the beginning, better results are yielded when removing the top 30% of the bucket. However, when using a harmonic mean the outliers have less influence and therefore we reduced this to 20%, partially because it gave better results.\n",
-    "The function returns 2^k, where k is the harmonic mean of the lowest 80% of the bucket."
+    "The function returns 2^k, where k is the harmonic mean of the bucket."
    ]
   },
   {
@@ -267,9 +271,8 @@
     "    return n/np.sum([1/val for val in alist if val != 0])\n",
     "\n",
     "def get_count(domain):\n",
-    "    last_20 = round(len(domains[domain])*0.2)\n",
-    "    bucket = sorted(domains[domain])\n",
-    "    bucket = bucket[:-last_20]\n",
+    "    bucket = [val for val in domains[domain] if val != 0]\n",
+    "    bucket.sort()\n",
     "    return int(round(2**harmonic_mean(bucket)))"
    ]
   },
@@ -282,7 +285,7 @@
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "The number of unique IPs are: 93892\n"
+      "The number of unique IPs are: 87413\n"
      ]
     }
    ],
@@ -307,17 +310,17 @@
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "total has 93892 unique IPs\n",
-      "python.org has 34729 unique IPs\n",
-      "wikipedia.org has 42035 unique IPs\n",
-      "pandas.pydata.org has 8018 unique IPs\n",
-      "dtu.dk has 1976 unique IPs\n",
-      "google.com has 1875 unique IPs\n",
-      "databricks.com has 908 unique IPs\n",
-      "github.com has 949 unique IPs\n",
-      "spark.apache.org has 323 unique IPs\n",
-      "datarobot.com has 195 unique IPs\n",
-      "scala-lang.org has 20 unique IPs\n"
+      "total has 87413 unique IPs\n",
+      "python.org has 33523 unique IPs\n",
+      "wikipedia.org has 59177 unique IPs\n",
+      "pandas.pydata.org has 12100 unique IPs\n",
+      "dtu.dk has 2645 unique IPs\n",
+      "google.com has 2647 unique IPs\n",
+      "databricks.com has 1282 unique IPs\n",
+      "github.com has 1433 unique IPs\n",
+      "spark.apache.org has 737 unique IPs\n",
+      "datarobot.com has 203 unique IPs\n",
+      "scala-lang.org has 3 unique IPs\n"
      ]
     }
    ],
@@ -375,10 +378,10 @@
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "The number of unique IPs visiting python.org is: 34729\n",
-      "The number of unique IPs visiting wikipedia.org is: 42035\n",
-      "The number of unique IPs visiting pandas.pydata.org is: 8018\n",
-      "The number of unique IPs visiting github.com is: 949\n"
+      "The number of unique IPs visiting python.org is: 33523\n",
+      "The number of unique IPs visiting wikipedia.org is: 59177\n",
+      "The number of unique IPs visiting pandas.pydata.org is: 12100\n",
+      "The number of unique IPs visiting github.com is: 1433\n"
      ]
     }
    ],
@@ -433,7 +436,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 11,
+   "execution_count": 13,
    "metadata": {},
    "outputs": [
     {
@@ -456,7 +459,7 @@
    ],
    "source": [
     "print('Top', X, 'IPs and their frequency:')\n",
-    "for ip in max_ip:\n",
+    "for ip in max_ip: # Not sorted\n",
     "    print(ip, count_ip(ip,'total'))"
    ]
   },
@@ -465,7 +468,12 @@
    "metadata": {},
    "source": [
     "##### Document the accuracy of your answers when using algorithms that give approximate answers\n",
-    "The Count-min sketch error analysis is derived from this article: http://dimacs.rutgers.edu/~graham/pubs/papers/cmencyc.pdf\n"
+    "The Count-min sketch error analysis is derived from this article: http://dimacs.rutgers.edu/~graham/pubs/papers/cmencyc.pdf\n",
+    "\n",
+    "HyperLogLog accuracy:\n",
+    "The accuracy is determined as 1/sqrt(m), where m is the number of buckets.\n",
+    "We assume the hash function will uniformly distribute the element hash values to the 32 buckets, and we therefore set the number of buckets, m, to BUCKET_SIZE.\n",
+    "Reference: https://storage.googleapis.com/pub-tools-public-publication-data/pdf/40671.pdf"
    ]
   },
   {
@@ -478,22 +486,33 @@
      "output_type": "stream",
      "text": [
       "Error added with each element added to the M matrix: 2.7182818284590452e-05\n",
-      "Probability of allowing a count estimate outside the error above: 4.5399929762484854e-05\n"
+      "Probability of allowing a count estimate outside the error above: 4.5399929762484854e-05\n",
+      "\n",
+      "Error rate of HyperLogLog implementation: ±0.17677669529663687\n",
+      "The variance is thus: 0.031249999999999993\n"
      ]
     }
    ],
    "source": [
     "import math\n",
     "# Count-min sketch error analysis:\n",
     "print('Error added with each element added to the M matrix:', math.e/w)\n",
-    "print('Probability of allowing a count estimate outside the error above:', math.exp(-d))\n"
+    "print('Probability of allowing a count estimate outside the error above:', math.exp(-d))\n",
+    "print('')\n",
+    "\n",
+    "# HyperLogLog accuracy:\n",
+    "error = 1/math.sqrt(BUCKET_SIZE)\n",
+    "print('Error rate of HyperLogLog implementation: ±{}'.format(error))\n",
+    "print('The variance is thus: {}'.format(error**2))"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We can thus conclude that with the chosen parameters for d and w, we have a very high chance of hitting the true value, i.e. a high accuracy."
+    "We can thus conclude that with the chosen parameters for d and w, we have a very high chance of hitting the true value, i.e. a high accuracy.\n",
+    "\n",
+    "Our HyperLogLog implementation has a theoretical variance of ~3%."
    ]
   },
   {
@@ -505,7 +524,7 @@
     "* **BUCKET_SIZE = 32**        -- We use 32 to minimize the risk of having a string hashing to a bit value with many initial 0s. With a higher bucket size, the chance of hitting critically wrong is diminished, as the harmonic mean of the buckets are used.\n",
     "* **STREAM_SIZE = 100000**    -- A stream size of 100000 is used to run through a significant number of elements in the stream. \n",
     "* **d = 10**         -- For the Count-min sketch, the chance of two strings hashing to the same position reduces with a higher d and w. In our case, a d of 10 gave very good estimates when compared to actual counts.\n",
-    "* **w = STREAM_SIZE**         -- A w of 100000 means there are enough possible hash values for each single element in the stream. This combined with the d of 30 gives a total of 30 * 100000 positions in the M matrix and thus when using the Count-min sketch, a very good estimate of a count is achieved."
+    "* **w = STREAM_SIZE**         -- A w of 100000 means there are enough possible hash values for each single element in the stream. This combined with the d of 10 gives a total of 10 * 100000 positions in the M matrix and thus when using the Count-min sketch, a very good estimate of a count is achieved."
    ]
   }
  ],