Blog

  • Big Data File Formats

    Silent Reading Time: 5 mins 44 secs

    This blog summarises three major file formats for Big Data primarily used in the Hadoop File System.

    Why do we need different file formats?

    A key consideration for HDFS-enabled applications, such as MapReduce or Spark, is the time it takes to find data in one location and then to write the data to another location. This is further complicated by changing schemas and storage considerations.

    Big Data storage costs tend to be higher due to storing files redundantly: add in the processing costs and the requirement to scale all of this as your data increases and your file formats can become a pretty big deal.

    Read More >
  • What is Git?

    Silent Reading Time: 10 mins 15 secs

    Git is free and open source software for distributed version control: meaning, it tracks changes to any set of files across multiple different machines and users. Usually used for coordinating work among programmers collaboratively developing source code during software development.

    Read More >
  • Creating my Blog

    Silent Reading Time: 9 mins 21 secs

    In my current role I lead an agile chapter for Data Engineers focussing on data modelling and design. In that role I started writing and sharing short blogs and decided one of them could be about setting up a blog using GitHub Pages and Jekyll.

    So here we go…

    Read More >
  • What is BigQuery?

    Silent Reading Time: 2 mins 5 secs

    Checking the GCP documentation you would see:

    BigQuery is a fully managed enterprise data warehouse … BigQuery’s serverless architecture lets you use SQL queries to answer your organisations biggest questions with zero infrastructure management. BigQuery’s scalable, distributed analysis engine lets you query terabytes in seconds and petabytes in minutes.

    Read More >