A data engineer conceives, builds and maintains the data infrastructure that holds your enterprise’s advanced analytics capacities together.
A data engineer is responsible for building and maintaining the data architecture of a data science project. Data Engineers are responsible for the creation and maintenance of analytics infrastructure that enables almost every other function in the data world. They are responsible for the development, construction, maintenance and testing of architectures, such as databases and large-scale processing systems. As part of this, Data Engineers are also responsible for the creation of data set processes used in modeling, mining, acquisition, and verification.
Data engineering is a software engineering practice with a focus on design, development, and productionizing of data processing systems. Data processing includes all the practical aspects of data handling, including:Data acquisition, transfer, transformation, and storage on-prem or in the cloud. In many cases, data can be categorized as Big Data.
Gartner’s Definition of Big Data
Gartner’s analyst Doug Laney defined three dimensions to data growth challenges: increasing volume (amount of data), velocity (speed of data in and out), and variety (range of data types and sources).
In 2012, Gartner updated its definition as follows: “Big data are high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization.”
Volume
Data sizes accumulated in many organizations come to hundreds of terabytes, approaching the petabyte levels.
Variety
Big Data comes in different formats as well as unformatted (unstructured) and various types like text, audio, voice, VoIP, images, video, e-mails, web traffic log files entries, sensor byte streams, etc.
Velocity
High traffic on-line banking web site can generate hundreds of TPS (transactions per second) each of which may be required to be subjected to fraud detection analysis in real or near-real time.
