Cloud Platforms for Data Analysts

Why Cloud Platforms Matter for Data Analysts

Cloud platforms have fundamentally changed how data analysts work. Instead of managing on-premises servers, analysts today use cloud infrastructure to store petabytes of data, run distributed queries in seconds, build scalable pipelines, and share interactive dashboards with global teams — all without touching hardware. Understanding the major cloud providers and their data services has become as essential as knowing SQL.

This guide covers the three dominant platforms — AWS, Google Cloud, and Azure — along with the core services analysts use daily, cost management principles, and how to choose the right tool for each job.

The Three Major Cloud Providers

Provider	Market Position	Strengths for Data Analytics	Key Analytics Services
Amazon Web Services (AWS)	Largest by market share (~33%)	Broadest service catalog, mature ecosystem, extensive third-party integrations	Redshift, Athena, Glue, S3, QuickSight, EMR
Google Cloud Platform (GCP)	Third (~11%), strong analytics focus	BigQuery's serverless SQL, tight integration with Google products, ML leadership	BigQuery, Dataflow, Looker, Pub/Sub, Vertex AI
Microsoft Azure	Second (~22%), enterprise focus	Deep Microsoft/Office integration, strong hybrid cloud, Power BI dominance	Synapse Analytics, Azure Data Factory, Power BI, Azure Databricks, ADLS

Most organizations use more than one provider. An analyst might query data in BigQuery, orchestrate pipelines with AWS Glue, and publish dashboards via Power BI. Understanding each platform's strengths allows you to select the right tool rather than forcing every workload into one ecosystem.

Core Cloud Storage for Analytics

Object Storage

Object storage is the foundation of cloud analytics. AWS S3, Google Cloud Storage (GCS), and Azure Blob Storage all provide durable, infinitely scalable storage at low cost (roughly $0.02 per GB/month). Raw data — CSV, Parquet, JSON, Avro — lives here before being queried or processed. Object storage supports data lake architectures where schema is applied at query time rather than at ingestion.

File Formats That Matter

Format	Type	Best For	Compression
CSV	Row-oriented, text	Simple exports, interoperability	Poor
JSON / JSONL	Row-oriented, text	Nested/semi-structured data, APIs	Moderate
Parquet	Columnar, binary	Analytical queries, large datasets	Excellent
ORC	Columnar, binary	Hive / Spark workloads	Excellent
Avro	Row-oriented, binary	Streaming, schema evolution	Good
Delta Lake / Iceberg	Columnar + metadata layer	ACID transactions on data lakes	Excellent

Parquet is the default choice for analytical workloads — columnar storage means queries that touch only a few columns scan far less data, reducing both cost and latency. Always partition Parquet files by common filter columns (date, region, category) to enable partition pruning.

Cloud Data Warehouses

Amazon Redshift

Redshift is AWS's managed columnar data warehouse. It runs on provisioned clusters (you choose node type and count) or Redshift Serverless (auto-scaling, pay per query). Redshift Spectrum allows querying S3 data directly without loading it. The COPY command efficiently bulk-loads data from S3. Redshift excels at complex multi-table joins on structured data at the terabyte scale.

Google BigQuery

BigQuery is GCP's fully serverless data warehouse — there are no clusters to manage. Storage and compute are billed separately. Queries are priced at $5 per TB scanned (on-demand), making it critical to use partitioned and clustered tables. BigQuery ML allows training machine learning models with SQL. Its streaming insert API enables near-real-time analytics. BigQuery is often the fastest path from raw data to SQL-accessible analytics.

-- BigQuery: partition by date, cluster by user_id to minimize scan costs
CREATE TABLE analytics.events
PARTITION BY DATE(event_timestamp)
CLUSTER BY user_id
AS SELECT * FROM raw.events_staging;

-- Query only the last 7 days (partition pruning kicks in)
SELECT
  user_id,
  COUNT(*) AS event_count
FROM analytics.events
WHERE event_timestamp >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
GROUP BY user_id
ORDER BY event_count DESC
LIMIT 100;

Azure Synapse Analytics

Synapse combines a data warehouse (formerly SQL Data Warehouse), Spark pools, and data integration into one service. It integrates natively with Azure Data Lake Storage Gen2 and Power BI. Dedicated SQL pools offer predictable performance; serverless SQL pools let you query data lake files without any provisioning. Synapse is the natural choice for organizations already invested in the Microsoft stack.

Snowflake (Multi-Cloud)

Snowflake runs on AWS, GCP, and Azure and separates storage, compute, and cloud services into three distinct layers. Multiple virtual warehouses can query the same data concurrently without contention. Snowflake's Time Travel feature lets you query data as it existed at any point in the last 90 days — invaluable for debugging pipeline errors. Its data sharing feature allows sharing live data with external partners without copying it.

Data Pipeline and ETL Services

Service	Provider	Type	Best For
AWS Glue	AWS	Serverless ETL + data catalog	S3-to-Redshift pipelines, schema discovery
Google Dataflow	GCP	Managed Apache Beam	Streaming and batch pipelines, unified model
Azure Data Factory	Azure	Visual ETL / ELT orchestration	Connecting 90+ data sources, low-code pipelines
AWS Lambda	AWS	Serverless compute	Event-driven micro-transformations, API triggers
Google Cloud Functions	GCP	Serverless compute	Lightweight event-driven processing
Databricks (multi-cloud)	AWS/GCP/Azure	Managed Apache Spark + MLflow	Large-scale transformations, ML pipelines

Querying Data Lakes with Serverless SQL

A key skill for cloud analysts is querying data sitting in object storage without loading it into a warehouse first. AWS Athena (powered by Presto/Trino), BigQuery external tables, and Azure Synapse serverless SQL all enable this pattern.

-- AWS Athena: query Parquet files in S3 directly
CREATE EXTERNAL TABLE sales_raw (
  order_id    STRING,
  customer_id STRING,
  amount      DOUBLE,
  order_date  DATE
)
PARTITIONED BY (year INT, month INT)
STORED AS PARQUET
LOCATION 's3://my-data-lake/sales/';

-- Load partitions (run after new data arrives)
MSCK REPAIR TABLE sales_raw;

-- Query with partition filter — only scans relevant S3 prefixes
SELECT
  customer_id,
  SUM(amount) AS total_revenue
FROM sales_raw
WHERE year = 2024 AND month = 3
GROUP BY customer_id
ORDER BY total_revenue DESC;

Cloud BI and Visualization Tools

Tool	Provider	Strengths	Typical Use Case
Amazon QuickSight	AWS	Native AWS integration, SPICE in-memory engine, per-session pricing	Embedded analytics, AWS-native dashboards
Looker / Looker Studio	GCP	LookML semantic layer, governed metrics, free Looker Studio tier	Self-service BI with controlled metric definitions
Power BI	Azure / Microsoft	Excel familiarity, DirectQuery, enterprise governance	Microsoft-stack organizations, executive reporting
Tableau (multi-cloud)	Salesforce	Best-in-class visualization, Hyper engine	Complex visual analytics, data storytelling

Cloud Cost Management for Analysts

Cloud costs can escalate quickly without discipline. Analysts directly control costs through how they write queries and structure data. Key practices include using columnar formats (Parquet reduces bytes scanned by 60–90% vs CSV), partitioning tables by the columns most commonly filtered, clustering or sorting tables on secondary filter columns, avoiding SELECT * on wide tables, and setting query cost limits in BigQuery or workload management rules in Redshift.

# Python: estimate BigQuery query cost before running it
from google.cloud import bigquery

client = bigquery.Client()

query = """
  SELECT user_id, SUM(revenue) as total
  FROM `project.dataset.transactions`
  WHERE DATE(created_at) >= '2024-01-01'
  GROUP BY user_id
"""

job_config = bigquery.QueryJobConfig(dry_run=True, use_query_cache=False)
job = client.query(query, job_config=job_config)

bytes_processed = job.total_bytes_processed
cost_usd = (bytes_processed / 1e12) * 5  # $5 per TB

print(f"Bytes to process: {bytes_processed / 1e9:.2f} GB")
print(f"Estimated cost: ${cost_usd:.4f}")

Infrastructure as Code for Analysts

Modern analytics teams define cloud resources in code rather than clicking through consoles. Terraform is the dominant tool for provisioning infrastructure (storage buckets, data warehouse clusters, IAM roles) in a repeatable, version-controlled way. dbt (data build tool) applies software engineering practices — version control, testing, documentation, modular transformations — to SQL-based data modeling inside cloud warehouses. Together, Terraform and dbt form the backbone of a production-grade analytics engineering workflow.

Security and Access Control

Each provider has its own identity and access management system: AWS IAM, Google Cloud IAM, and Azure RBAC/Active Directory. The principle of least privilege applies: analysts should have read access to the tables they need and nothing more. Service accounts (not personal credentials) should be used for automated pipelines. All three providers support column-level security (restricting access to specific columns containing PII) and row-level security (filtering rows based on the requesting user's attributes). Data should always be encrypted at rest and in transit — all major cloud services enable this by default.

Choosing the Right Cloud Service

Task	Recommended Approach
Store raw data files	S3 / GCS / Blob Storage in Parquet format, partitioned by date
Ad-hoc SQL on data lake	AWS Athena, BigQuery external tables, or Synapse Serverless SQL
Structured data warehouse	Redshift, BigQuery, Synapse dedicated pool, or Snowflake
Streaming ingestion	AWS Kinesis, GCP Pub/Sub, Azure Event Hubs
Batch ETL pipelines	AWS Glue, GCP Dataflow, Azure Data Factory, or dbt
Large-scale Spark processing	Databricks, AWS EMR, GCP Dataproc
BI dashboards	Power BI (Microsoft shops), Looker/Looker Studio (GCP), QuickSight (AWS)
ML model training	BigQuery ML (SQL-based), Vertex AI, SageMaker, Azure ML

Summary

Cloud platforms have made scalable, cost-effective analytics accessible to organizations of all sizes. AWS offers the broadest catalog and most mature ecosystem; GCP leads on serverless analytics and ML with BigQuery and Vertex AI; Azure dominates in enterprise and Microsoft-integrated environments. Every analyst benefits from understanding object storage, columnar file formats, cloud data warehouses, and cost management principles — these fundamentals apply regardless of which provider your organization uses. As cloud services continue to converge, the most valuable skill is knowing which service solves which problem, and being able to build pipelines that move data reliably from source to insight.