Assessing Dataset Quality
Evaluate dataset quality and discover related datasets
Determine dataset reliability through quality scores and discover related datasets for comparison.
When to use this guide
Use this guide when you need to:
- Verify data quality before using in research or production
- Understand what DQV quality scores mean
- Find similar or related datasets
- Compare quality across dataset options
Prerequisites
- Dataset ID from search results
- Understanding of your quality requirements (research vs exploration)
Understanding quality scores
Quality scores (0-100) measure metadata completeness, update frequency, and standards compliance.
Score ranges:
- 85-100: Excellent (research, citations)
- 70-84: Good (production applications)
- 50-69: Acceptable (exploratory analysis)
- <50: Poor (verify carefully)
Score components:
- Completeness (40 points): Title, description, license, contact, keywords, themes, spatial, temporal coverage
- Timeliness (20 points): Recent updates, modification date
- Compliance (20 points): DCAT-AP standard compliance
- Accessibility (20 points): Valid download links, open formats
Quick quality check
Ask Claude for quality assessment
"What's the quality score for dataset [dataset-id]?"
Claude runs analyze_dataset_quality and explains results.
What you'll learn:
- Overall quality score (0-100)
- What metadata is present or missing
- Whether quality meets your needs
Direct quality analysis
quality = analyze_dataset_quality(dataset_id="bev-stat-wien-2024")Parameters:
Prop
Type
Returns:
{
"dataset_id": "bev-stat-wien-2024",
"metadata": {
"has_title": true,
"has_description": true,
"has_license": true,
"has_contact": false,
"has_keywords": true,
"has_themes": true,
"has_spatial": true,
"has_temporal": false,
"completeness_score": 75
},
"metrics": {
"overall_score": 82,
"completeness": 75,
"timeliness": 12,
"compliance": 18,
"accessibility": 18
},
"degraded": false
}Quality decision matrix:
| Score | Use for | Verification |
|---|---|---|
| 85-100 | Research, citations | Proceed with confidence |
| 70-84 | Production apps | Verify critical fields present |
| 50-69 | Exploration | Check schema, preview data |
| <50 | Caution | Manual verification required |
Error handling:
Degraded service:
if quality.get('degraded'):
print("Quality service degraded - using cached metrics")
# Verify critical fields manuallyDataset not found:
{"error": "NotFoundError", "message": "Dataset not found"}Solution: Dataset ID may be incorrect or stale
Quality-aware search
Search with quality preference
"Find high-quality datasets about health"
Claude enables boost_quality automatically when quality is mentioned.
Quality boost in search
search_datasets(
query="gesundheit",
boost_quality=True
)What happens:
- Datasets with score >80 get 2x relevance boost
- Datasets with score 60-80 get 1.5x relevance boost
- Results sorted by boosted relevance
Note: Quality boost only active when query is provided (not on facet-only searches)
Error handling:
try:
results = search_datasets(query="gesundheit", boost_quality=True)
except ToolError as e:
print(f"Search with quality boost failed: {e}")
# Try without boost
results = search_datasets(query="gesundheit")Finding related datasets
Ask for related datasets
"Find datasets related to [dataset-id]"
Claude finds datasets with similar themes, keywords, or same publisher.
Direct related dataset search
related = find_related_datasets(
dataset_id="bev-stat-wien-2024",
min_score=20
)Parameters:
Prop
Type
Scoring algorithm:
- Theme match: 30 points per shared theme (max 60)
- Keyword match: 10 points per shared keyword (max 30)
- Same publisher: 15 points bonus
Returns:
{
"reference_dataset_id": "bev-stat-wien-2024",
"related_datasets": [
{
"id": "wien-einwohner-bezirk",
"title": "Einwohnerinnen und Einwohner Wien nach Bezirken",
"similarity_score": 75,
"match_reasons": {
"theme_matches": ["SOCI", "REGI"],
"keyword_matches": ["bevölkerung", "wien"],
"same_publisher": true
}
}
]
}Error handling:
try:
related = find_related_datasets(dataset_id="bev-stat-wien-2024")
except ToolError as e:
print(f"Related search failed: {e}")Empty results:
if len(related['related_datasets']) == 0:
print("No related datasets found")
# Try lowering min_score threshold
related = find_related_datasets(
dataset_id="bev-stat-wien-2024",
min_score=10
)Interpreting metadata completeness
Understanding what's missing
Ask Claude: "What metadata is missing from dataset [id]?"
Claude explains which fields are absent and why they matter.
Programmatic completeness check
quality = analyze_dataset_quality(dataset_id="bev-stat-wien-2024")
metadata = quality['metadata']
# Critical fields for research
critical_fields = ['has_title', 'has_description', 'has_license', 'has_contact']
missing_critical = [f for f in critical_fields if not metadata.get(f)]
if missing_critical:
print(f"Missing critical fields: {missing_critical}")
print("Not recommended for research use")
# Check completeness score
if metadata['completeness_score'] < 70:
print("Warning: Completeness below 70%")
print("Missing fields may limit usability")Field importance:
| Field | Importance | Impact if missing |
|---|---|---|
| Title | Critical | Can't identify dataset |
| Description | Critical | Can't assess relevance |
| License | Critical | Can't determine usage rights |
| Contact | High | Can't report issues |
| Keywords | Medium | Reduced discoverability |
| Themes | Medium | Can't filter by topic |
| Temporal | Low | Unknown time period |
| Spatial | Low | Unknown geographic coverage |
Troubleshooting
All datasets have low quality scores
Symptom: No results with score >70
Cause: Domain has sparse metadata
Solutions:
- Lower threshold to 60 for acceptable quality
- Check what metadata is missing (use analyze_dataset_quality)
- Verify critical fields (license, description) present even if score low
Quality analysis returns degraded
Symptom: degraded: true in response
Cause: Quality service temporarily unavailable
Solutions:
- Response includes cached metrics (may be stale)
- Verify critical fields manually from dataset metadata
- Retry after few minutes if current metrics needed
Find related returns no results
Symptom: Empty related_datasets array
Cause: min_score too high or dataset very unique
Solutions:
- Lower min_score parameter (try 10-15)
- Check reference dataset has themes and keywords
- Search by same publisher or themes instead
Quality score doesn't match expectations
Symptom: High-quality dataset shows low score
Cause: Missing optional metadata fields
Solutions:
- Review metadata breakdown in quality analysis
- Check if critical fields present (title, description, license)
- Consider metadata richness vs actual data quality
- Contact publisher to improve metadata
Next steps
- Searching Guide - Find datasets with quality filtering
- Data Preview Guide - Verify data structure
- API Reference - Complete tool documentation
How is this guide?
Last updated on