-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathDataLake.html
167 lines (136 loc) · 6.94 KB
/
DataLake.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
<!DOCTYPE html>
<!--[if lt IE 9 ]><html class="no-js oldie" lang="en"> <![endif]-->
<!--[if IE 9 ]><html class="no-js oldie ie9" lang="en"> <![endif]-->
<!--[if (gte IE 9)|!(IE)]><!-->
<html class="no-js" lang="en">
<!--<![endif]-->
<head>
<!--- basic page needs
================================================== -->
<meta charset="utf-8">
<title>Data lake - Introduction</title>
<meta name="description" content="datalake">
<meta name="author" content="Sourabh Joshi">
<!-- mobile specific metas
================================================== -->
<meta name="viewport" content="width=device-width, initial-scale=1">
<!-- CSS
================================================== -->
<link rel="stylesheet" href="css/base.css">
<link rel="stylesheet" href="css/vendor.css">
<link rel="stylesheet" href="css/main.css">
<!-- script
================================================== -->
<script src="js/modernizr.js"></script>
<script src="js/pace.min.js"></script>
</head>
<body id="top">
<!-- header
================================================== -->
<header class="s-header">
<nav class="header-nav-wrap">
<ul class="header-nav">
<li class="current"><a href="index.html#home" title="home">Home</a></li>
<li><a href="index.html#about" title="about">AboutMe</a></li>
<li><a href="index.html#works" title="works">Works</a></li>
<li><a class="current" href="blog.html" title="blog">Blog-Moolaa</a></li>
<li><a href="index.html#contact" title="contact">Contact</a></li>
</ul>
</nav>
<a class="header-menu-toggle" href="#0"><span>Menu</span></a>
</header> <!-- end s-header -->
<article class="blog-single">
<!-- page header/blog hero
================================================== -->
<div class="page-header page-header--single page-hero" style="background-image:url(images/blog/bg-05.jpeg)">
<div class="row page-header__content narrow">
<article class="col-full">
<div class="page-header__info">
<div class="page-header__cat">
<a href="#0">DataLake</a>
</div>
</div>
<h1 class="page-header__title">
<a href="#0" title="">
Data Lake - Introduction
</a>
</h1>
<ul class="page-header__meta">
<li class="date">Jul 17, 2021</li>
<li class="author">
By
<span>Sourabh Joshi</span>
</li>
</ul>
</article>
</div>
</div>
<div class="row blog-content">
<div class="col-full blog-content__main">
<p class="lead">
This article records the preliminary understanding of what data lake is and Why is it Used.
</p>
<h1>
Concept of Data Lake
</h1>
<p>What is the concept of a data lake? Generally speaking, the data generated by an organization is maintained in a storage platform, which we called the <b>"data lake"</b>.</p>
<p>I personally think that the data lake should be an evolving and scalable infrastructure for big data storage, processing and analysis. To achieve full acquisition, full storage, multi-mode processing and full life cycle management of any source, any scale, and any type of data, It must have interaction and integration with various external heterogeneous source systems, making it a goto place for any data in an organization.</p>
<p>The data sources of a lake are diverse. Some may be structured data,
some may be unstructured data, and some may even be binary data.The data can be of the form of batch or streaming form. As the lake accepts data from various sources it can preserve both the original data and also be used for lineage of data transformations</p>
<p>
Data Engineers or Transformation Engineers stand at the entrance of the lake,
using equipments check the water quality, and pump water out of the lake.<br>
The Lake can serve as a staging area for the data warehouse.
</p>
<p>
Data scientists or Analysts use the lake for discovery and ideation. They extract value from the data lake through machine learning.
</p>
<img src="images/blog/bg-03.png" alt="pump" align="middle" width="1000" height="600">
<p>In summary, the data lake has four main characteristics -</p>
<h3>Store raw data</h3>
The source of these raw data is very rich.
<li>structured data.</li>
<li>Semi-structured data.</li>
<li>Unstructured data.</li>
<li>Binary data (pictures,videos etc.)</li>
<h3>Support multiple computing models</h3>
<li>batch processing</li>
<li>stream computing</li>
<li>interactive analysis</li>
<li>machine learning</li>
<h3>Data Management capabilities</h3>
<li>connect to multiple data sources with different access times</li>
<li>support Schema management</li>
<li>support authority management</li>
<h3>Flexible underlying storage</h3>
<li>generally uses S3/OSS/HDFS, a cheap distributed file system</li>
<li>supports Parquet/Avro/Orc file formats</li>
<li>supports data cache acceleration</li>
</p>
<p style="font-family: 'Courier New', monospace;font-size: 50px;">LEARN, SHARE AND GROW</p>
</div>
</div>
</article>
<footer>
<div class="row footer-bottom">
<div class="col-twelve">
<div class="copyright">
<span>© Copyright Hola 2021</span>
<span>Design by <a href="https://www.styleshout.com/">styleshout</a></span>
</div>
<div class="go-top">
<a class="smoothscroll" title="Back to Top" href="#top"><i class="im im-arrow-up" aria-hidden="true"></i></a>
</div>
</div>
</div> <!-- end footer-bottom -->
</footer> <!-- end footer -->
<div id="preloader">
<div id="loader"></div>
</div>
<!-- Java Script
================================================== -->
<script src="js/jquery-3.2.1.min.js"></script>
<script src="js/plugins.js"></script>
<script src="js/main.js"></script>
</body>
</html>