Previewing Data Structures
Inspect schemas and sample data before downloading full datasets
Verify data structure, column types, and content before committing to full dataset download.
When to use this guide
Use this guide when you need to:
- Verify required columns exist
- Check data types before processing
- Estimate data quality from samples
- Avoid downloading incompatible formats
Prerequisites
- Dataset ID or download URL
- Understanding of your data requirements (required columns, types)
Quick schema preview
Ask Claude to preview
"Show me the schema for dataset [dataset-id]"
Claude fetches distributions, selects downloadable format, and previews schema automatically.
What you'll see:
- Column names
- Inferred data types
- Sample values
- Row count estimate
Direct schema preview
# First, get download URL
distributions = get_dataset_distributions(dataset_id="bev-stat-wien-2024")
csv_url = next(d['downloadURL'] for d in distributions
if d.get('format') == 'CSV')
# Preview schema
schema = preview_schema(url=csv_url, format="csv")Parameters:
Prop
Type
Returns:
{
"url": "https://data.wien.gv.at/.../data.csv",
"format": "csv",
"partial_fetch": true,
"bytes_fetched": 65536,
"columns": [
{
"name": "jahr",
"type": "integer",
"sample_values": [2022, 2023, 2024]
},
{
"name": "bezirk",
"type": "string",
"sample_values": ["Innere Stadt", "Leopoldstadt"]
}
]
}Error handling:
NetworkError:
{"error": "NetworkError", "message": "Failed to fetch URL"}Solution: URL may be stale, fetch fresh distributions
FormatError:
{"error": "FormatError", "message": "Could not detect delimiter"}Solution: Specify format explicitly: format="csv"
Validating required columns
Ask Claude to verify columns
"Does dataset [id] have columns X, Y, Z?"
Claude fetches schema and checks for required columns.
Programmatic column validation
# Get schema
schema = preview_schema(url=csv_url)
# Define requirements
required_columns = ["jahr", "bezirk", "faelle"]
actual_columns = [c['name'] for c in schema['columns']]
# Check presence
missing = set(required_columns) - set(actual_columns)
if missing:
print(f"Missing columns: {missing}")
else:
print("All required columns present")
# Verify types
jahr_col = next(c for c in schema['columns'] if c['name'] == 'jahr')
if jahr_col['type'] != 'integer':
print("Warning: Jahr column not integer type")Error handling:
Column not found:
try:
jahr_col = next(c for c in schema['columns'] if c['name'] == 'jahr')
except StopIteration:
print("Jahr column not found in schema")Previewing data samples
Ask for sample rows
"Show me sample rows from dataset [id]"
Claude fetches first 10-20 rows to show data structure.
Direct data preview
data = preview_data(url=csv_url, format="csv", max_rows=10)Parameters:
Prop
Type
Returns:
{
"url": "...",
"format": "csv",
"rows": [
{"jahr": 2024, "bezirk": "Wien", "faelle": 150},
{"jahr": 2024, "bezirk": "Graz", "faelle": 120}
],
"row_count": 10,
"estimated_total_rows": 1000
}Error handling:
ParseError:
{"error": "ParseError", "message": "Invalid CSV at row 5"}Solution: File may be corrupted or malformed
Checking data quality from samples
Quality assessment questions
Ask Claude to check data quality:
- "Are there null values in the dataset?"
- "What's the date range of this data?"
- "How many unique regions are in the sample?"
Claude analyzes preview data to answer.
Programmatic quality checks
preview = preview_data(url=csv_url, max_rows=50)
# Check for null values
for row in preview['rows']:
for key, value in row.items():
if value is None or value == "":
print(f"Null value in column: {key}")
# Check date range
dates = [row['jahr'] for row in preview['rows'] if 'jahr' in row]
print(f"Date range: {min(dates)} to {max(dates)}")
# Check unique values
bezirke = {row['bezirk'] for row in preview['rows'] if 'bezirk' in row}
print(f"Unique districts in sample: {len(bezirke)}")Common issues:
Sparse data:
null_count = sum(1 for row in preview['rows']
for v in row.values() if v is None)
if null_count > len(preview['rows']) * 0.5:
print("Warning: >50% null values in sample")Inconsistent types:
# Check if numeric column has non-numeric values
faelle = [row.get('faelle') for row in preview['rows']]
non_numeric = [c for c in faelle if not isinstance(c, (int, float))]
if non_numeric:
print(f"Warning: Non-numeric values in faelle: {non_numeric}")Troubleshooting
Preview fails on valid URL
Symptom: URL works in browser but preview fails
Cause: Server doesn't support HTTP Range requests
Solutions:
- Preview falls back to full download (may be slow)
- Use smaller max_bytes parameter
- Check if different format available
Type inference incorrect
Symptom: Column shows "string" but contains numbers
Cause: Sample rows have inconsistent types or headers
Solutions:
- Increase max_bytes to sample more rows
- Check if first rows are headers or metadata
- Use preview_data to see actual values
Cannot detect format
Symptom: FormatError on auto-detection
Cause: Unusual file format or delimiter
Solutions:
- Specify format explicitly:
format="csv" - Check file extension matches actual content
- Download sample manually to inspect
Partial fetch warning
Symptom: Response shows partial_fetch: true
Cause: File larger than max_bytes limit
Solutions:
- This is normal - preview uses partial fetch by design
- Increase max_bytes if you need more sample data
- Schema should still be accurate from partial data
Next steps
- Quality Metrics Guide - Assess data quality
- Searching Guide - Find datasets to preview
- API Reference - Complete tool documentation
How is this guide?
Last updated on