Preview Examples
Data preview and schema introspection patterns
Learn how to inspect dataset contents before downloading.
Schema Introspection
Check the structure of a dataset without downloading data:
# Preview CSV schema
preview_schema(
url="https://www.data.gv.at/katalog/dataset/example.csv",
format="CSV"
)Returns:
- Column names
- Inferred data types (string, integer, float, boolean, date)
- Type inference based on 10 sample rows
Supported formats: CSV, JSON
Comprehensive schema analysis with validation:
# Get schema first
schema = preview_schema(
url="https://www.data.gv.at/katalog/dataset/example.csv",
format="CSV"
)
# Validate structure
print(f"Found {len(schema['columns'])} columns")
for column in schema['columns']:
print(f" - {column['name']}: {column['type']}")
# Check for expected columns
expected_cols = ["id", "name", "value"]
actual_cols = [c['name'] for c in schema['columns']]
if all(col in actual_cols for col in expected_cols):
print("Success: Schema validated - proceed with data preview")
else:
missing = set(expected_cols) - set(actual_cols)
print(f"Error: Missing columns: {missing}")Advanced validation:
- Verify column presence
- Check data types match expectations
- Identify potential data quality issues
Data Preview
View sample rows from a dataset:
# Preview first 20 rows
preview_data(
url="https://www.data.gv.at/katalog/dataset/example.csv",
max_rows=20,
format="CSV"
)Configuration:
max_rows- Number of rows to preview (default: 100, max: 1000)format- File format (CSV or JSON, auto-detected from URL if not specified)
Returns:
- Sample data rows
- Column names and types
- Row count in preview
- Estimated total rows (if truncated)
Complete data inspection workflow:
# Get detailed preview
preview = preview_data(
url="https://www.data.gv.at/katalog/dataset/example.csv",
max_rows=100,
format="CSV"
)
# Analyze preview results
print(f"Preview contains {preview['row_count']} rows")
if 'estimated_total_rows' in preview:
print(f"Estimated total: {preview['estimated_total_rows']} rows")
print(f"File is truncated - showing sample")
# Examine data quality
for i, row in enumerate(preview['data'][:10], 1):
print(f"Row {i}: {row}")
# Check for null values
for row in preview['data']:
null_fields = [k for k, v in row.items() if v is None or v == '']
if null_fields:
print(f"Warning: Null values in {null_fields}")Data quality checks:
- Null value detection
- Format consistency
- Data range validation
- Encoding verification
CSV Handling
Different Delimiters
The preview tool automatically detects CSV delimiters:
# Comma-separated (auto-detected)
preview_data(
url="https://example.com/data.csv",
format="CSV"
)
# Semicolon-separated (auto-detected, common in European datasets)
preview_data(
url="https://example.com/european-data.csv",
format="CSV"
)
# Tab-separated (auto-detected)
preview_data(
url="https://example.com/data.tsv",
format="CSV"
)Uses Python's csv.Sniffer for automatic delimiter detection.
Verify delimiter detection and handle edge cases:
# Preview with delimiter detection
preview = preview_data(
url="https://example.com/data.csv",
format="CSV"
)
# Verify structure looks correct
if len(preview['data']) > 0:
first_row = preview['data'][0]
num_columns = len(first_row.keys())
print(f"Detected {num_columns} columns")
# Check if columns look reasonable
if num_columns < 2:
print("Warning: Only one column detected")
print(" Delimiter might not be detected correctly")
else:
print("Delimiter detection successful")
# For problematic files, check raw data
# (Would need direct HTTP download for manual parsing)Common delimiter issues:
- Mixed delimiters in file
- Delimiters inside quoted values
- Non-standard delimiters (pipe, etc.)
Large CSV Files
Preview handles large files efficiently:
# Default: 64KB preview (enough for ~1000 rows)
preview_data(
url="https://example.com/large-dataset.csv",
max_rows=100,
format="CSV"
)Performance characteristics:
- Default chunk size: 64KB
- Maximum chunk size: 512KB
- Only downloads what's needed for preview
- Estimated total rows calculated from sample
Optimize preview for different file sizes:
# Small preview for quick check (20 rows)
quick_preview = preview_data(
url="https://example.com/large-dataset.csv",
max_rows=20,
format="CSV"
)
# Medium preview for validation (100 rows)
medium_preview = preview_data(
url="https://example.com/large-dataset.csv",
max_rows=100,
format="CSV"
)
# Maximum preview for complete analysis (1000 rows)
full_preview = preview_data(
url="https://example.com/large-dataset.csv",
max_rows=1000,
format="CSV"
)
# Use estimated total to calculate sampling
if 'estimated_total_rows' in quick_preview:
total = quick_preview['estimated_total_rows']
sample_pct = (20 / total) * 100
print(f"Previewing {sample_pct:.2f}% of data")Optimization strategy:
- Use 20 rows for quick structure check
- Use 100 rows for data quality validation
- Use 1000 rows for statistical sampling
- Check estimated_total_rows to understand coverage
JSON Handling
Flat JSON Arrays
Preview JSON arrays directly:
preview_data(
url="https://example.com/data.json",
format="JSON"
)Example input:
[
{"id": 1, "name": "Vienna", "population": 1900000},
{"id": 2, "name": "Graz", "population": 290000}
]Analyze JSON structure and data:
# Get preview
preview = preview_data(
url="https://example.com/data.json",
format="JSON"
)
# Analyze array structure
if preview['data']:
first_obj = preview['data'][0]
fields = first_obj.keys()
print(f"JSON objects have {len(fields)} fields:")
for field in fields:
print(f" - {field}")
# Check field consistency across objects
inconsistent = []
for i, obj in enumerate(preview['data']):
if set(obj.keys()) != set(fields):
inconsistent.append(i)
if inconsistent:
print(f"Warning: Objects with different fields: {inconsistent}")
else:
print("Success: Consistent structure across all objects")Nested JSON Objects
Automatically detects nested data arrays:
preview_data(
url="https://example.com/api-response.json",
format="JSON"
)Example input with nested data:
{
"status": "success",
"data": [
{"id": 1, "value": 100},
{"id": 2, "value": 200}
]
}Common nested keys automatically detected: data, results, items, records
Truncated JSON Recovery
If JSON is truncated, the tool attempts recovery:
# Handles partially downloaded JSON
preview_data(
url="https://example.com/large-data.json",
max_rows=50,
format="JSON"
)Recovery strategy:
- Find last complete object in truncated data
- Parse successfully completed objects
- Return partial but valid results
Understand and verify recovery results:
# Preview potentially truncated JSON
preview = preview_data(
url="https://example.com/large-data.json",
max_rows=100,
format="JSON"
)
# Check if truncation occurred
if 'estimated_total_rows' in preview:
requested = 100
received = preview['row_count']
if received < requested:
print(f"Warning: Truncation detected:")
print(f" Requested: {requested} rows")
print(f" Received: {received} rows")
print(f" Estimated total: {preview['estimated_total_rows']}")
# Calculate completeness
completeness = (received / requested) * 100
print(f" Preview completeness: {completeness:.1f}%")
else:
print("Success: Full preview retrieved")
# Verify recovered data is valid
if preview['data']:
for i, obj in enumerate(preview['data']):
if not isinstance(obj, dict):
print(f"Warning: Object {i} is not a valid dict")Complete Preview Workflow
Fast dataset preview for exploration:
Find Dataset
# Search for datasets
results = search_datasets(
query="Bevölkerung Wien Bezirk",
formats=["CSV"],
limit=5
)
# Pick a dataset
dataset_id = results["results"][0]["id"]Get Download URL
# Get distributions
distributions = get_dataset_distributions(dataset_id=dataset_id)
# Find CSV distribution
csv_dist = next(
(d for d in distributions if d.get("format", {}).get("id") == "CSV"),
None
)
download_url = csv_dist.get("access_url") if csv_dist else NonePreview Data
# Quick preview
preview = preview_data(
url=download_url,
max_rows=20,
format="CSV"
)
# Display sample
for row in preview["data"][:5]:
print(row)Complete validation workflow for production use:
Search and Select
results = search_datasets(
query="Bevölkerung Wien Bezirk",
formats=["CSV"],
boost_quality=True,
limit=5
)
dataset_id = results["results"][0]["id"]
dataset = get_dataset(dataset_id=dataset_id)
print(f"Selected: {dataset['title']['de']}")
print(f"Modified: {dataset['modified']}")Validate Schema
distributions = get_dataset_distributions(dataset_id=dataset_id)
csv_dist = next(
(d for d in distributions if d.get("format", {}).get("id") == "CSV"),
None
)
download_url = csv_dist.get("access_url")
# Check schema first
schema = preview_schema(url=download_url, format="CSV")
print(f"Schema has {len(schema['columns'])} columns:")
for col in schema['columns']:
print(f" - {col['name']}: {col['type']}")
# Validate expected columns
expected = ["Jahr", "Bezirk", "Einwohner"]
actual = [c['name'] for c in schema['columns']]
schema_valid = all(col in actual for col in expected)Preview and Verify
if schema_valid:
# Get data sample
preview = preview_data(
url=download_url,
max_rows=100,
format="CSV"
)
print(f"Preview: {preview['row_count']} rows")
# Data quality checks
null_count = 0
for row in preview['data']:
for value in row.values():
if value is None or value == '':
null_count += 1
null_pct = (null_count / (preview['row_count'] * len(schema['columns']))) * 100
print(f"Null values: {null_pct:.1f}%")
if null_pct < 5:
print("Success: Data quality acceptable - proceed with download")
else:
print("Warning: High null percentage - review data quality")Error Handling
Handle common preview errors:
# Invalid URL
try:
preview_data(url="not-a-valid-url", format="CSV")
except ToolError as e:
print(f"Error: {e}")
# Unsupported format
try:
preview_data(
url="https://example.com/data.pdf",
format="PDF"
)
except ToolError as e:
print(f"Error: {e}") # "Unsupported format: PDF"
# Network issues
try:
preview_data(
url="https://example.com/nonexistent.csv",
format="CSV"
)
except ToolError as e:
print(f"Error: {e}") # Connection error with guidanceComprehensive error handling and recovery:
def safe_preview(url, max_rows=20, format=None):
"""Preview with complete error handling."""
try:
# Attempt preview
preview = preview_data(
url=url,
max_rows=max_rows,
format=format
)
# Verify we got data
if not preview.get('data'):
return {
'success': False,
'error': 'No data returned',
'suggestion': 'URL may be empty or invalid format'
}
# Check data quality
if preview['row_count'] == 0:
return {
'success': False,
'error': 'Empty dataset',
'suggestion': 'File contains no data rows'
}
return {
'success': True,
'preview': preview,
'rows': preview['row_count']
}
except ToolError as e:
error_msg = str(e).lower()
# Categorize errors
if 'unsupported format' in error_msg:
return {
'success': False,
'error': 'Unsupported format',
'suggestion': 'Only CSV and JSON are supported'
}
elif 'invalid url' in error_msg:
return {
'success': False,
'error': 'Invalid URL',
'suggestion': 'URL must start with http:// or https://'
}
elif 'connection' in error_msg or 'network' in error_msg:
return {
'success': False,
'error': 'Network error',
'suggestion': 'Check URL accessibility and network connection'
}
else:
return {
'success': False,
'error': str(e),
'suggestion': 'See error message for details'
}
# Usage
result = safe_preview(url="https://example.com/data.csv")
if result['success']:
print(f"Success: Preview successful: {result['rows']} rows")
else:
print(f"Error: Preview failed: {result['error']}")
print(f" Suggestion: {result['suggestion']}")Format Auto-detection
Let the tool detect format from URL extension:
# Format inferred from .csv extension
preview_data(url="https://example.com/data.csv")
# Format inferred from .json extension
preview_data(url="https://example.com/data.json")Override auto-detection when needed:
# File has .txt extension but is actually CSV
preview_data(
url="https://example.com/data.txt",
format="CSV"
)
# API endpoint returns CSV without extension
preview_data(
url="https://api.example.com/export?format=csv",
format="CSV"
)
# Verify auto-detection worked
url = "https://example.com/data.unknown"
try:
# Try without format
preview = preview_data(url=url)
except ToolError:
# Auto-detection failed, try explicit formats
for fmt in ["CSV", "JSON"]:
try:
preview = preview_data(url=url, format=fmt)
print(f"Success: Detected as {fmt}")
break
except ToolError:
continueBest Practices
Preview Before Download
Always preview schema and sample data before downloading large files:
# Check schema first
schema = preview_schema(url=download_url, format="CSV")
# Verify structure
if "expected_column" in [c["name"] for c in schema["columns"]]:
# Structure looks good, get sample data
preview_data(url=download_url, max_rows=20)Complete validation before committing to download:
def validate_before_download(url, expected_columns, format="CSV"):
"""Comprehensive pre-download validation."""
# Step 1: Schema validation
schema = preview_schema(url=url, format=format)
actual_cols = [c['name'] for c in schema['columns']]
missing = set(expected_columns) - set(actual_cols)
if missing:
return {
'valid': False,
'reason': f'Missing columns: {missing}'
}
# Step 2: Data type validation
for col in schema['columns']:
if col['name'] in expected_columns:
# Check type is appropriate
if col['type'] == 'string' and col['name'].endswith('_id'):
print(f"Warning: {col['name']} is string, expected number?")
# Step 3: Data quality check
preview = preview_data(url=url, max_rows=100, format=format)
if preview['row_count'] == 0:
return {
'valid': False,
'reason': 'No data rows'
}
# Check null percentage
total_cells = preview['row_count'] * len(schema['columns'])
null_cells = sum(
1 for row in preview['data']
for value in row.values()
if value is None or value == ''
)
null_pct = (null_cells / total_cells) * 100
return {
'valid': True,
'schema': schema,
'sample_rows': preview['row_count'],
'null_percentage': null_pct,
'download_recommended': null_pct < 10
}
# Usage
result = validate_before_download(
url=download_url,
expected_columns=["id", "name", "value"]
)
if result['valid'] and result['download_recommended']:
print("Success: Validation passed - safe to download")Use Appropriate Sample Size
Balance preview detail with performance:
# Quick peek (20 rows) - for initial exploration
preview_data(url=download_url, max_rows=20)
# Detailed inspection (100 rows) - for validation
preview_data(url=download_url, max_rows=100)
# Maximum sample (1000 rows) - for statistical analysis
preview_data(url=download_url, max_rows=1000)Validate Data Types
Check inferred types match expectations:
schema = preview_schema(url=download_url, format="CSV")
for column in schema["columns"]:
if column["name"] == "population":
assert column["type"] == "integer", "Population should be integer"
if column["name"] == "date":
assert column["type"] == "date", "Date should be date type"Handle Multi-language Data
Austrian datasets often contain German text:
# Preview to check language and encoding
preview = preview_data(url=download_url, max_rows=10, format="CSV")
# Verify special characters render correctly
for row in preview["data"]:
# Check for Umlauts: ä, ö, ü, ß
for value in row.values():
if isinstance(value, str) and any(c in value for c in 'äöüßÄÖÜ'):
print(f"Success: German characters detected: {value}")Performance Tips
Minimize downloads:
- Use
preview_schemafor structure-only checks (faster) - Use small
max_rowsfor quick verification - Preview only when necessary
Estimated download sizes:
- CSV: ~500 bytes per row (varies by columns)
- JSON: ~200 bytes per object (varies by structure)
- Default 64KB chunk: ~128 CSV rows or ~320 JSON objects
Next Steps
- Workflows - Complete end-to-end scenarios
- Search Examples - Finding datasets to preview
- API Reference - Full preview tool documentation
How is this guide?
Last updated on