-
Big Data File Formats
Silent Reading Time: 5 mins 44 secs
This blog summarises three major file formats for Big Data primarily used in the Hadoop File System.
Why do we need different file formats?
A key consideration for HDFS-enabled applications, such as MapReduce or Spark, is the time it takes to find data in one location and then to write the data to another location. This is further complicated by changing schemas and storage considerations.
Big Data storage costs tend to be higher due to storing files redundantly: add in the processing costs and the requirement to scale all of this as your data increases and your file formats can become a pretty big deal.
Read More > -
What is Git?
Silent Reading Time: 10 mins 15 secs
Git is free and open source software for distributed version control: meaning, it tracks changes to any set of files across multiple different machines and users. Usually used for coordinating work among programmers collaboratively developing source code during software development.
Read More > -
Creating my Blog
Silent Reading Time: 9 mins 21 secs
In my current role I lead an agile chapter for Data Engineers focussing on data modelling and design. In that role I started writing and sharing short blogs and decided one of them could be about setting up a blog using GitHub Pages and Jekyll.
So here we go…
Read More > -
What is BigQuery?
Silent Reading Time: 2 mins 5 secs
Checking the GCP documentation you would see:
BigQuery is a fully managed enterprise data warehouse … BigQuery’s serverless architecture lets you use SQL queries to answer your organisations biggest questions with zero infrastructure management. BigQuery’s scalable, distributed analysis engine lets you query terabytes in seconds and petabytes in minutes.