Blog

Nov 16, 2022
Big Data File Formats

Silent Reading Time: 5 mins 44 secs

This blog summarises three major file formats for Big Data primarily used in the Hadoop File System.

Why do we need different file formats?

A key consideration for HDFS-enabled applications, such as MapReduce or Spark, is the time it takes to find data in one location and then to write the data to another location. This is further complicated by changing schemas and storage considerations.

Big Data storage costs tend to be higher due to storing files redundantly: add in the processing costs and the requirement to scale all of this as your data increases and your file formats can become a pretty big deal.
Read More >
Oct 26, 2022
What is Git?

Silent Reading Time: 10 mins 15 secs

Git is free and open source software for distributed version control: meaning, it tracks changes to any set of files across multiple different machines and users. Usually used for coordinating work among programmers collaboratively developing source code during software development.
Read More >
Oct 10, 2022
Creating my Blog

Silent Reading Time: 9 mins 21 secs

In my current role I lead an agile chapter for Data Engineers focussing on data modelling and design. In that role I started writing and sharing short blogs and decided one of them could be about setting up a blog using GitHub Pages and Jekyll.

So here we go…
Read More >
Sep 30, 2022
What is BigQuery?

Silent Reading Time: 2 mins 5 secs

Checking the GCP documentation you would see:

BigQuery is a fully managed enterprise data warehouse … BigQuery’s serverless architecture lets you use SQL queries to answer your organisations biggest questions with zero infrastructure management. BigQuery’s scalable, distributed analysis engine lets you query terabytes in seconds and petabytes in minutes.

Read More >

Blog

Big Data File Formats

Why do we need different file formats?

What is Git?

Creating my Blog

What is BigQuery?