As a marketer, you likely have a rather extensive vocabulary when writing for your industry, and you’re always hot on the tail of any new marketing trends or buzzwords. We know this to be true because here at Hurree, we’re obsessed with marketers and how they work. More specifically, we’re obsessed with making their lives easier.
So we thought, why not help out with one of the most essential and most jargon-heavy elements of marketing today: big data. It’s something that, before recently, many marketers may not have known much about, but now they are faced with the urgent need to understand.
That’s why we’ve created this bumper list of big data terminology that every marketer should know, from beginner-level phrases to highly technical definitions. It’s a handy guide to take with you on your travels throughout the big bad world of data-driven marketing so that you always have a marketer-friendly explanation for big data terms.
So, let's get started...
Big Data Terminology: Definitions Every Marketer Should Know
1. Abstraction layer
A translation layer that transforms high-level requests into low-level functions and actions. Data abstraction sees the essential details needed to perform a function removed, leaving behind the complex, unnecessary data in the system. The complex, unneeded data is hidden from the client, and a simplified representation is presented.
A typical example of an abstraction layer is an API (application programming interface) between an application and an operating system.
API is an acronym used for Application Programming Interface, a software connection between computers or computer programs. APIs are not databases or servers but rather the code and rules that allow access to and sharing of information between servers, applications, etc.
Data aggregation refers to the process of collecting data and presenting it in a summarised format. The data can be gathered from multiple sources to be combined for a summary.
In computer science, an algorithm is a set of well-defined rules that solve a mathematical or computational problem when implemented. Algorithms are used to carry out calculations, data processing, machine learning, search engine optimisation, and more.
Systems and techniques of computational analysis and interpretation of large amounts of data or statistics. Analytics are used to derive insights, spot patterns, and optimise business performance.
76% of marketing leaders say they base decisions on analytics - Gartner
An application is any computer software or program designed to be used by end-users to perform specific tasks. Applications or apps can be desktop, web, or mobile-based.
Avro (or Apache Avro) is an open-source data serialisation system.
Binary classification is a technique used to identify whether a set of two elements are in one group or another based on classification rules. For example, binary classification techniques are used to determine whether a disease is present in medical data. In computing, it determines whether a piece of content should be included in search results based on its relevance or value to the users.
Business intelligence is a process of collecting and preparing internal and external data for analysis; this often includes data visualisation techniques (graphs, pie charts, scatter plots, etc.) presented on business intelligence dashboards. By harnessing business intelligence, organisations can make faster, more informed business decisions.
Source: MPercent Academy
In computing, a byte is a unit of data that is eight binary digits (bits) long. A byte is a unit of memory size; a single byte is the smallest unit of storage; thus, in computing, we usually refer to gigabytes (GB, one billion bytes) and terabytes (TB, one trillion bytes).
C is a programming language, and it’s one of the oldest programming languages around. Despite its age, it continues to be one of the most prevalent as it powers systems like Microsoft Windows and Mac.
This acronym stands for Central Processing Unit. A CPU is often referred to as the brains of a computer - you will find one in your phone, smartwatch, tablet, etc. Despite being one of many processing systems within a computer, a CPU is vitally important as it controls the ability to perform calculations, take actions and run programs.
Cascading is a type of software designed for use with Hadoop for the creation of data-driven applications. Cascading software creates an abstraction layer that enables complex data processing workflows and masks the underlying complexity of MapReduce processes.
14. Cleaning data
Cleaning data improves data quality by removing errors, corruptions, duplications, and formatting inconsistencies from datasets.
Cloud technology, or The Cloud as it is often referred to, is a network of servers that users access via the internet and the applications and software that run on those servers. Cloud computing has removed the need for companies to manage physical data servers or run software applications on their own devices - meaning that users can now access files from almost any location or device.
The cloud is made possible through virtualisation - a technology that mimics a physical server but in virtual, digital form, A.K.A virtual machine.
In computing, a command is a direction sent to a computer program ordering it to perform a specific action. Commands can be facilitated by command-line interfaces, via a network service protocol, or as an event in a graphical user interface.
17. Computer architecture
Computer architecture specifies the rules, standards, and formats of the hardware and software that makes up a computer system or platform. The architecture acts as a blueprint for how a computer system is designed and what other systems it is compatible with.
18. Connected devices
Physical objects that connect with each other and other systems via the internet. Connected devices are most commonly monitored and controlled remotely by mobile applications, for example, via Bluetooth, WiFi, LTE or wired connection.
19. Data access
Data access is the ability to access, modify, move or copy data on-demand and on a self-service basis. Specifically, data access refers to IT systems, wherein the data may be sensitive and require authentication and authorisation from the organisation that holds the data to access.
There are two forms of data access:
- Random access
- Sequential access
20. Data capture
Data capture refers to collecting information from either paper or electronic documents and converting it into a format that a computer can read. Data capture can be automated to reduce the need for manual data entry and accelerate the process.
21. Data ingestion
Data ingestion is the process of moving data from various sources into a central repository such as a data warehouse where it can be stored, accessed, analysed, and used by an organisation.
22. Data integrity
The practice of ensuring data remains accurate, valid and consistent throughout the entire data life cycle. Data integrity incorporates logical integrity (a process) and physical integrity (a state).
23. Data lake
A data lake is a centralised repository that stores vast amounts of raw data - data that has not been prepared, processed, or manipulated to fit a particular schema. Data lakes house both structured and unstructured data and use an ‘on-read’ schema during data analysis.
24. Data management
Data management is an overarching strategy of data use that guides organisations to collect, store, analyse and use their data securely and cost-effectively via policies and regulations.
25. Data processing
The process of transforming raw data into a format that can be read by a machine or, in other words, turning data into something usable. Once processed, businesses can use data to glean insights and make decisions.
26. Data serialisation
Data serialisation is a data translation process that enables complex or large data structures or object states to be changed to formats that can be more easily stored, transferred and distributed. After serialisation and the chosen data action, the byte sequence can create an identical clone of the original - a process known as deserialisation.
27. Data storage
Refers to collecting and recording data to be retained for future use on computers or other devices. In its most common form, data storage occurs in three ways: file storage, block storage, and object storage.
Data storage on endpoint devices is projected to plummet by 2024 (despite the advent of super-fast 5G networks) as organisations move data storage to in-house and cloud data centres - Datanami
28. Data tagging
Data tagging is a type of categorisation process that allows users to better organise types of data (websites, blog posts, photos, etc.) using tags or keywords.
This process sees large amounts of data translated into visual formats such as graphs, pie charts, scatter charts, etc. Visualisations can be better understood by the human brain and accelerate the rate of insight retrieval for organisations.
A centralised repository of information that enterprises can use to support business intelligence (BI) activities such as analytics. Data warehouses typically integrate historical data from various sources.
Decision trees are visual representations of processes and options that help machines mark complex predictions or decisions when faced with many choices and outcomes. Decision trees are directional acyclic graphs made up of branch nodes, edges, and leaf nodes with all data flowing in one direction.
Source: Edureka!32. Deep learning
Deep learning is a function of artificial intelligence and machine learning that mimics the processes of the human brain to make decisions, process data, and create patterns. It can be used to process huge amounts of unstructured data that would take human brains years to understand. Deep learning algorithms can recognise objects and speech, translate languages, etc.
An acronym used to describe a process within data integration: Extract, Transform and Load.
An acronym used to describe a process within data integration: Extract, Load, and Transform.
In computing, encoding refers to assigning numerical values to categories. For example, male and female would be encoded to be represented by 1 and 2.
There are two main types of encoding:
36. Fault tolerance
The term fault tolerance describes the ability of a system, for example, a computer or a cloud cluster, to continue operating uninterrupted despite one or more of its components failing.
Fault tolerance is developed to ensure a high level of availability and that no business is impacted by a loss of critical systems or continuity. Fault tolerance is achieved by utilising backup components in hardware, software, and power solutions.
Flume is open-source software that facilitates the collecting, aggregating and moving of huge amounts of unstructured, streaming data such as log data and events. Flume has a simple and flexible architecture, moving data from various servers to a centralised data store.
GPS is an acronym for Global Positioning System, which is a navigation system that uses data from satellites and algorithms to synchronise location, space, and time data. GPS utilises three key segments: satellites, ground control, and user equipment.
39. Granular Computing (GrC)
An emerging concept and technique of information processing within big data, granular computing sees data divided down into information granules or ‘collection of entities’ as it is referred to. The point of this division is to discover whether data is different on a granular level.
An API from Apache Spark that is used for graphs and graph-parallel computing. GraphX facilitates faster, more flexible data analytics.
In its simplest form, HCatalog exists to provide an interface between Apache Hive, Pig and MapReduce. Since all three data processing tools have different systems for processing data, HCatalog ensures consistency. HCatalog supports users reading and writing on the grid in any format that a SerDe (serialiser-deserialiser) can be written.
Hadoop is an open-source software framework of programs and procedures that are commonly used as the backbone for big data development projects. Hadoop is made up of 4 modules, each with its own distinct purpose:
- Distributed-File System - allows data to be easily stored in any format across a large number of storage devices.
- MapReduce - reads and translates data into the right format for analysis (map) and carrying out mathematical calculations (reduce).
- Hadoop Common - provides the baseline tools needed for users systems, e.g. Windows, to retrieve data from Hadoop.
- YARN - a management module that handles the systems that carry out storage and analysis.
Hardware is the physical component of any computer system, for example, the wiring, circuit board, monitor, keyboard, mouse, desktop, etc.
In statistics, dimensionality refers to how many attributes a dataset has. Thus, high dimensionality refers to a dataset with an exceedingly large amount of attributes. When high-dimensional data occurs, calculations become extremely difficult because the number of features outweighs the number of observations.
Website analysis (e.g. ranking, advertising and crawling) is a good example of high dimensionality.
Hive is an open-source data warehouse software system that allows developers to carry out advanced work on Hadoop distributed file systems (HDFS) and MapReduce. Hive makes working with these tools easier by facilitating the use of a more simplified Hive-Query Language (HQL), thus, reducing the need for developers to know or write complex java code.
46. Information retrieval (IR)
A software program that handles the organisation, storage, and retrieval of information, usually of a text-based format, from large documentation repositories. A simple example of IR is search engine queries that we all carry out on Google.
Integration is the process of combining data from multiple disparate sources to achieve a unified view of the data for easier, more valuable operations or business intelligence.
There are five main forms of data integration:
- Data warehouse
- Uniform access.
The internet of things (IoT) refers to an ecosystem of physical objects that are connected to the internet and generate, collect, and share data. With advancing technologies enabling smaller and smaller microchips, the IoT has transformed previously benign objects into smart devices that can submit insights without the need for human interaction.
Java is a high-level programming language that is specifically designed to reduce programming dependencies. However, it is also used as a computing platform. Java is widely regarded as fast, secure, and reliable.
Data latency refers to the time it takes for a data query to be fully processed by a data warehouse or business intelligence platform. There are three main types of data latency: zero-data latency (real-time), near-time data latency (batch consolidation), some-time data latency (data is only accessed and updated when needed).
51. Machine learning
Machine learning is a branch of a technique that sees computers automatically assess problems and configure algorithmic models to solve them without the need for human interaction.
Mining or data mining, as it is commonly known, refers to the practice of using computer programs to identify patterns, trends and anomalies within large amounts of data and using these findings to predict future outcomes.
NoSQL is also referred to as non-SQL or not-only SQL. It is a database design approach that extends storage and querying capabilities beyond what is possible from the traditional tabular structures found in a regular relational database.
Instead, NoSQL databases use a JSON document to house data within one structure. This is a non-relational design that can handle unstructured data as it does not require a schema.
54. Non-relational database
A database system that does not use the tabular system of rows and columns.
55. Neural networks
A set of algorithms that work to recognise relationships between huge sets of data by mimicking the processes of the brain. The word neural refers to neurons in the brain which act as information messengers.
Neural networks automatically adapt to change without the need to redesign their algorithms and thus have been widely taken up in the design of financial trading software.
Open-source refers to the availability of certain types of code to be used, redistributed and even modified for free by other developers. This decentralised software development model encourages collaboration and peer production.
57. Pattern recognition
One of the cornerstones of computer science, pattern recognition, uses algorithms and machine learning to identify patterns in large amounts of data.
Pig is a high-level scripting language that is used to create programs that run on Hadoop.
Pixels are small pieces of HTML code that are used to track users' behaviours online, for example, when they visit a website or open an email.
60. Programming language
A programming language is a set of formal language formatted using sets of strings that instruct a computer to perform specific tasks. Programmers use languages to develop applications. There are numerous programming languages, the most common of which are Python and Java.
Python is a high-level programming language with dynamic semantics used to develop applications at a rapid pace. Python prioritises readability making it easy to learn and cheaper due to a lessened need for program maintenance.
In computing, a query is a request for information or a question directed toward a database. The queried data may be returned in the form of SQL (structured query language) or data visualisations such as graphs, pictorial representations, etc.
R is a free software environment for statistical computing and graphics.
An acronym used for Random Access Memory, which essentially refers to the short-term memory of a computer. RAM stores all of the information that a computer may need in the present and near future; this information is everything currently running on a device for example any web browser in use or game that you’re currently playing.
RAM’s fast-access capabilities make it beneficial for short-term storage, unlike a hard drive device which is slower but preferred for long term storage.
65. Relational database
A relational database exists to house and identify data items that have pre-defined relationships with one another. Relational databases can be used to gain insights into data in relation to other data via sets of tables with columns and rows. In a relational database, each row in the table has a unique ID referred to as a key.
SQL stands for Structured Query Language and is used to communicate with a database. SQL is the standard language used for a relational database.
Scalability in databases refers to the ability to accommodate rapidly changing amounts of data processing needs. Scalability concerns both rapid increases in data (scaling-up) and decreases in demand for data processing (scaling-down). Scalability ensures that the rate of processing is consistent despite the volume of data being handled.
68. Schema on-read
A method of data analysis that applies a schema to data sets as they are extracted from a database rather than when they are pulled into that database. A data lake applies an on-read schema, allowing it to house unstructured data.
69. Schema on-write
A method of data analysis that applies a schema to data sets as they are ingested into a database. A data warehouse uses an on-write schema, meaning that data is transformed into a standardised format for storage and is ready for analysis.
70. Semi-structured data
Semi-structured data does not reside in a relational database (rows and columns); however, it still has some form of organisational formatting that enables it to be more easily processed, such as semantic tags.
The opposite of hardware, software is a virtual set of instructions, codes, data, or programs used to perform operations via a computer.
Spark is a data processing and analysis framework that can quickly perform processing tasks on very large data sets or distribute tasks across multiple computers.
Spark’s architecture consists of two main components:
- Drivers - convert the user’s code into tasks to be distributed across worker nodes
- Executors - run on those nodes and carry out the tasks assigned to them
Data that can be formatted into rows and columns, and whose elements can be mapped into clear, pre-defined fields. Typical examples of structured data are names, addresses, telephone numbers, geolocations, etc.
74. Unstructured data
Unstructured data does not have a pre-defined structure or data model and is not organised in a predefined format. Examples include images, video files, audio files, etc.
75. User Interface (UI)
A user interface or UI is the location of human-computer interaction; they are the display screens at the front-end of applications that mask the code that works behind the scenes. A user interface is designed with usability in mind to ensure that any user can easily understand and navigate the interface as this impacts user experience.
Part of the 4 Vs of big data, variety refers to the huge variety of data formats that data can now exist in.
Part of the 4Vs of big data, velocity refers to the rapid speed at which large amounts of data can be processed.
Part of the 4Vs of big data, veracity refers to the trustworthiness of big data in terms of integrity, accuracy, privacy, etc.
Part of the 4Vs of big data, volume refers to the huge amount of data being generated globally each data.
A data science workflow defines the phases or steps to be carried out to complete a development project. In data-driven business fields, workflows are also used and referred to in terms of automating processes, marketing or sales campaigns, or internal communications.
Big data is a vast and complex field that is constantly evolving, and for that reason, it’s important to understand the basic terms and the more technical vocabulary so that your marketing can evolve with it.
Now go forth and flaunt your new knowledge to impress your colleagues and improve your content.
Want to get more of Hurree's content on the go? Why not subscribe to our podcast, The marketing slice and be the first to hear new episodes. Sign up right here 🍕As always, if you have any questions, you can email us at email@example.com.