Previewing Data Structures

Verify data structure, column types, and content before committing to full dataset download.

When to use this guide

Use this guide when you need to:

Verify required columns exist
Check data types before processing
Estimate data quality from samples
Avoid downloading incompatible formats

Prerequisites

Dataset ID or download URL
Understanding of your data requirements (required columns, types)

Quick schema preview

Ask Claude to preview

"Show me the schema for dataset [dataset-id]"

Claude fetches distributions, selects downloadable format, and previews schema automatically.

What you'll see:

Column names
Inferred data types
Sample values
Row count estimate

Direct schema preview

# First, get download URL
distributions = get_dataset_distributions(dataset_id="bev-stat-wien-2024")
csv_url = next(d['downloadURL'] for d in distributions
               if d.get('format') == 'CSV')

# Preview schema
schema = preview_schema(url=csv_url, format="csv")

Parameters:

Prop

Type

Returns:

{
  "url": "https://data.wien.gv.at/.../data.csv",
  "format": "csv",
  "partial_fetch": true,
  "bytes_fetched": 65536,
  "columns": [
    {
      "name": "jahr",
      "type": "integer",
      "sample_values": [2022, 2023, 2024]
    },
    {
      "name": "bezirk",
      "type": "string",
      "sample_values": ["Innere Stadt", "Leopoldstadt"]
    }
  ]
}

Error handling:

NetworkError:

{"error": "NetworkError", "message": "Failed to fetch URL"}

Solution: URL may be stale, fetch fresh distributions

FormatError:

{"error": "FormatError", "message": "Could not detect delimiter"}

Solution: Specify format explicitly: format="csv"

Validating required columns

Ask Claude to verify columns

"Does dataset [id] have columns X, Y, Z?"

Claude fetches schema and checks for required columns.

Programmatic column validation

# Get schema
schema = preview_schema(url=csv_url)

# Define requirements
required_columns = ["jahr", "bezirk", "faelle"]
actual_columns = [c['name'] for c in schema['columns']]

# Check presence
missing = set(required_columns) - set(actual_columns)
if missing:
    print(f"Missing columns: {missing}")
else:
    print("All required columns present")

# Verify types
jahr_col = next(c for c in schema['columns'] if c['name'] == 'jahr')
if jahr_col['type'] != 'integer':
    print("Warning: Jahr column not integer type")

Error handling:

Column not found:

try:
    jahr_col = next(c for c in schema['columns'] if c['name'] == 'jahr')
except StopIteration:
    print("Jahr column not found in schema")

Previewing data samples

Ask for sample rows

"Show me sample rows from dataset [id]"

Claude fetches first 10-20 rows to show data structure.

Direct data preview

data = preview_data(url=csv_url, format="csv", max_rows=10)

Parameters:

Prop

Type

Returns:

{
  "url": "...",
  "format": "csv",
  "rows": [
    {"jahr": 2024, "bezirk": "Wien", "faelle": 150},
    {"jahr": 2024, "bezirk": "Graz", "faelle": 120}
  ],
  "row_count": 10,
  "estimated_total_rows": 1000
}

Error handling:

ParseError:

{"error": "ParseError", "message": "Invalid CSV at row 5"}

Solution: File may be corrupted or malformed

Checking data quality from samples

Quality assessment questions

Ask Claude to check data quality:

"Are there null values in the dataset?"
"What's the date range of this data?"
"How many unique regions are in the sample?"

Claude analyzes preview data to answer.

Programmatic quality checks

preview = preview_data(url=csv_url, max_rows=50)

# Check for null values
for row in preview['rows']:
    for key, value in row.items():
        if value is None or value == "":
            print(f"Null value in column: {key}")

# Check date range
dates = [row['jahr'] for row in preview['rows'] if 'jahr' in row]
print(f"Date range: {min(dates)} to {max(dates)}")

# Check unique values
bezirke = {row['bezirk'] for row in preview['rows'] if 'bezirk' in row}
print(f"Unique districts in sample: {len(bezirke)}")

Common issues:

Sparse data:

null_count = sum(1 for row in preview['rows']
                 for v in row.values() if v is None)
if null_count > len(preview['rows']) * 0.5:
    print("Warning: >50% null values in sample")

Inconsistent types:

# Check if numeric column has non-numeric values
faelle = [row.get('faelle') for row in preview['rows']]
non_numeric = [c for c in faelle if not isinstance(c, (int, float))]
if non_numeric:
    print(f"Warning: Non-numeric values in faelle: {non_numeric}")