Why Cloud Platforms Matter for Data Analysts
Cloud platforms have fundamentally changed how data analysts work. Instead of managing on-premises servers, analysts today use cloud infrastructure to store petabytes of data, run distributed queries in seconds, build scalable pipelines, and share interactive dashboards with global teams — all without touching hardware. Understanding the major cloud providers and their data services has become as essential as knowing SQL.
This guide covers the three dominant platforms — AWS, Google Cloud, and Azure — along with the core services analysts use daily, cost management principles, and how to choose the right tool for each job.
The Three Major Cloud Providers
Provider | Market Position | Strengths for Data Analytics | Key Analytics Services |
|---|---|---|---|
Amazon Web Services (AWS) | Largest by market share (~33%) | Broadest service catalog, mature ecosystem, extensive third-party integrations | Redshift, Athena, Glue, S3, QuickSight, EMR |
Google Cloud Platform (GCP) | Third (~11%), strong analytics focus | BigQuery's serverless SQL, tight integration with Google products, ML leadership | BigQuery, Dataflow, Looker, Pub/Sub, Vertex AI |
Microsoft Azure | Second (~22%), enterprise focus | Deep Microsoft/Office integration, strong hybrid cloud, Power BI dominance | Synapse Analytics, Azure Data Factory, Power BI, Azure Databricks, ADLS |
Most organizations use more than one provider. An analyst might query data in BigQuery, orchestrate pipelines with AWS Glue, and publish dashboards via Power BI. Understanding each platform's strengths allows you to select the right tool rather than forcing every workload into one ecosystem.
Core Cloud Storage for Analytics
Object Storage
Object storage is the foundation of cloud analytics. AWS S3, Google Cloud Storage (GCS), and Azure Blob Storage all provide durable, infinitely scalable storage at low cost (roughly $0.02 per GB/month). Raw data — CSV, Parquet, JSON, Avro — lives here before being queried or processed. Object storage supports data lake architectures where schema is applied at query time rather than at ingestion.
File Formats That Matter
Format | Type | Best For | Compression |
|---|---|---|---|
CSV | Row-oriented, text | Simple exports, interoperability | Poor |
JSON / JSONL | Row-oriented, text | Nested/semi-structured data, APIs | Moderate |
Parquet | Columnar, binary | Analytical queries, large datasets | Excellent |
ORC | Columnar, binary | Hive / Spark workloads | Excellent |
Avro | Row-oriented, binary | Streaming, schema evolution | Good |
Delta Lake / Iceberg | Columnar + metadata layer | ACID transactions on data lakes | Excellent |
Parquet is the default choice for analytical workloads — columnar storage means queries that touch only a few columns scan far less data, reducing both cost and latency. Always partition Parquet files by common filter columns (date, region, category) to enable partition pruning.
Cloud Data Warehouses
Amazon Redshift
Redshift is AWS's managed columnar data warehouse. It runs on provisioned clusters (you choose node type and count) or Redshift Serverless (auto-scaling, pay per query). Redshift Spectrum allows querying S3 data directly without loading it. The COPY command efficiently bulk-loads data from S3. Redshift excels at complex multi-table joins on structured data at the terabyte scale.
Google BigQuery
BigQuery is GCP's fully serverless data warehouse — there are no clusters to manage. Storage and compute are billed separately. Queries are priced at $5 per TB scanned (on-demand), making it critical to use partitioned and clustered tables. BigQuery ML allows training machine learning models with SQL. Its streaming insert API enables near-real-time analytics. BigQuery is often the fastest path from raw data to SQL-accessible analytics.
-- BigQuery: partition by date, cluster by user_id to minimize scan costs
CREATE TABLE analytics.events
PARTITION BY DATE(event_timestamp)
CLUSTER BY user_id
AS SELECT * FROM raw.events_staging;
-- Query only the last 7 days (partition pruning kicks in)
SELECT
user_id,
COUNT(*) AS event_count
FROM analytics.events
WHERE event_timestamp >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
GROUP BY user_id
ORDER BY event_count DESC
LIMIT 100;
Azure Synapse Analytics
Synapse combines a data warehouse (formerly SQL Data Warehouse), Spark pools, and data integration into one service. It integrates natively with Azure Data Lake Storage Gen2 and Power BI. Dedicated SQL pools offer predictable performance; serverless SQL pools let you query data lake files without any provisioning. Synapse is the natural choice for organizations already invested in the Microsoft stack.
Snowflake (Multi-Cloud)
Snowflake runs on AWS, GCP, and Azure and separates storage, compute, and cloud services into three distinct layers. Multiple virtual warehouses can query the same data concurrently without contention. Snowflake's Time Travel feature lets you query data as it existed at any point in the last 90 days — invaluable for debugging pipeline errors. Its data sharing feature allows sharing live data with external partners without copying it.
Data Pipeline and ETL Services
Service | Provider | Type | Best For |
|---|---|---|---|
AWS Glue | AWS | Serverless ETL + data catalog | S3-to-Redshift pipelines, schema discovery |
Google Dataflow | GCP | Managed Apache Beam | Streaming and batch pipelines, unified model |
Azure Data Factory | Azure | Visual ETL / ELT orchestration | Connecting 90+ data sources, low-code pipelines |
AWS Lambda | AWS | Serverless compute | Event-driven micro-transformations, API triggers |
Google Cloud Functions | GCP | Serverless compute | Lightweight event-driven processing |
Databricks (multi-cloud) | AWS/GCP/Azure | Managed Apache Spark + MLflow | Large-scale transformations, ML pipelines |
Querying Data Lakes with Serverless SQL
A key skill for cloud analysts is querying data sitting in object storage without loading it into a warehouse first. AWS Athena (powered by Presto/Trino), BigQuery external tables, and Azure Synapse serverless SQL all enable this pattern.
-- AWS Athena: query Parquet files in S3 directly
CREATE EXTERNAL TABLE sales_raw (
order_id STRING,
customer_id STRING,
amount DOUBLE,
order_date DATE
)
PARTITIONED BY (year INT, month INT)
STORED AS PARQUET
LOCATION 's3://my-data-lake/sales/';
-- Load partitions (run after new data arrives)
MSCK REPAIR TABLE sales_raw;
-- Query with partition filter — only scans relevant S3 prefixes
SELECT
customer_id,
SUM(amount) AS total_revenue
FROM sales_raw
WHERE year = 2024 AND month = 3
GROUP BY customer_id
ORDER BY total_revenue DESC;
Cloud BI and Visualization Tools
Tool | Provider | Strengths | Typical Use Case |
|---|---|---|---|
Amazon QuickSight | AWS | Native AWS integration, SPICE in-memory engine, per-session pricing | Embedded analytics, AWS-native dashboards |
Looker / Looker Studio | GCP | LookML semantic layer, governed metrics, free Looker Studio tier | Self-service BI with controlled metric definitions |
Power BI | Azure / Microsoft | Excel familiarity, DirectQuery, enterprise governance | Microsoft-stack organizations, executive reporting |
Tableau (multi-cloud) | Salesforce | Best-in-class visualization, Hyper engine | Complex visual analytics, data storytelling |
Cloud Cost Management for Analysts
Cloud costs can escalate quickly without discipline. Analysts directly control costs through how they write queries and structure data. Key practices include using columnar formats (Parquet reduces bytes scanned by 60–90% vs CSV), partitioning tables by the columns most commonly filtered, clustering or sorting tables on secondary filter columns, avoiding SELECT * on wide tables, and setting query cost limits in BigQuery or workload management rules in Redshift.
# Python: estimate BigQuery query cost before running it
from google.cloud import bigquery
client = bigquery.Client()
query = """
SELECT user_id, SUM(revenue) as total
FROM `project.dataset.transactions`
WHERE DATE(created_at) >= '2024-01-01'
GROUP BY user_id
"""
job_config = bigquery.QueryJobConfig(dry_run=True, use_query_cache=False)
job = client.query(query, job_config=job_config)
bytes_processed = job.total_bytes_processed
cost_usd = (bytes_processed / 1e12) * 5 # $5 per TB
print(f"Bytes to process: {bytes_processed / 1e9:.2f} GB")
print(f"Estimated cost: ${cost_usd:.4f}")
Infrastructure as Code for Analysts
Modern analytics teams define cloud resources in code rather than clicking through consoles. Terraform is the dominant tool for provisioning infrastructure (storage buckets, data warehouse clusters, IAM roles) in a repeatable, version-controlled way. dbt (data build tool) applies software engineering practices — version control, testing, documentation, modular transformations — to SQL-based data modeling inside cloud warehouses. Together, Terraform and dbt form the backbone of a production-grade analytics engineering workflow.
Security and Access Control
Each provider has its own identity and access management system: AWS IAM, Google Cloud IAM, and Azure RBAC/Active Directory. The principle of least privilege applies: analysts should have read access to the tables they need and nothing more. Service accounts (not personal credentials) should be used for automated pipelines. All three providers support column-level security (restricting access to specific columns containing PII) and row-level security (filtering rows based on the requesting user's attributes). Data should always be encrypted at rest and in transit — all major cloud services enable this by default.
Choosing the Right Cloud Service
Task | Recommended Approach |
|---|---|
Store raw data files | S3 / GCS / Blob Storage in Parquet format, partitioned by date |
Ad-hoc SQL on data lake | AWS Athena, BigQuery external tables, or Synapse Serverless SQL |
Structured data warehouse | Redshift, BigQuery, Synapse dedicated pool, or Snowflake |
Streaming ingestion | AWS Kinesis, GCP Pub/Sub, Azure Event Hubs |
Batch ETL pipelines | AWS Glue, GCP Dataflow, Azure Data Factory, or dbt |
Large-scale Spark processing | Databricks, AWS EMR, GCP Dataproc |
BI dashboards | Power BI (Microsoft shops), Looker/Looker Studio (GCP), QuickSight (AWS) |
ML model training | BigQuery ML (SQL-based), Vertex AI, SageMaker, Azure ML |
Summary
Cloud platforms have made scalable, cost-effective analytics accessible to organizations of all sizes. AWS offers the broadest catalog and most mature ecosystem; GCP leads on serverless analytics and ML with BigQuery and Vertex AI; Azure dominates in enterprise and Microsoft-integrated environments. Every analyst benefits from understanding object storage, columnar file formats, cloud data warehouses, and cost management principles — these fundamentals apply regardless of which provider your organization uses. As cloud services continue to converge, the most valuable skill is knowing which service solves which problem, and being able to build pipelines that move data reliably from source to insight.
Create a free reader account to keep reading.