data.gv.at MCP Server Logodata.gv.at MCP

Assessing Dataset Quality

Evaluate dataset quality and discover related datasets

Determine dataset reliability through quality scores and discover related datasets for comparison.

When to use this guide

Use this guide when you need to:

  • Verify data quality before using in research or production
  • Understand what DQV quality scores mean
  • Find similar or related datasets
  • Compare quality across dataset options

Prerequisites

  • Dataset ID from search results
  • Understanding of your quality requirements (research vs exploration)

Understanding quality scores

Quality scores (0-100) measure metadata completeness, update frequency, and standards compliance.

Score ranges:

  • 85-100: Excellent (research, citations)
  • 70-84: Good (production applications)
  • 50-69: Acceptable (exploratory analysis)
  • <50: Poor (verify carefully)

Score components:

  • Completeness (40 points): Title, description, license, contact, keywords, themes, spatial, temporal coverage
  • Timeliness (20 points): Recent updates, modification date
  • Compliance (20 points): DCAT-AP standard compliance
  • Accessibility (20 points): Valid download links, open formats

Quick quality check

Ask Claude for quality assessment

"What's the quality score for dataset [dataset-id]?"

Claude runs analyze_dataset_quality and explains results.

What you'll learn:

  • Overall quality score (0-100)
  • What metadata is present or missing
  • Whether quality meets your needs

Direct quality analysis

quality = analyze_dataset_quality(dataset_id="bev-stat-wien-2024")

Parameters:

Prop

Type

Returns:

{
  "dataset_id": "bev-stat-wien-2024",
  "metadata": {
    "has_title": true,
    "has_description": true,
    "has_license": true,
    "has_contact": false,
    "has_keywords": true,
    "has_themes": true,
    "has_spatial": true,
    "has_temporal": false,
    "completeness_score": 75
  },
  "metrics": {
    "overall_score": 82,
    "completeness": 75,
    "timeliness": 12,
    "compliance": 18,
    "accessibility": 18
  },
  "degraded": false
}

Quality decision matrix:

ScoreUse forVerification
85-100Research, citationsProceed with confidence
70-84Production appsVerify critical fields present
50-69ExplorationCheck schema, preview data
<50CautionManual verification required

Error handling:

Degraded service:

if quality.get('degraded'):
    print("Quality service degraded - using cached metrics")
    # Verify critical fields manually

Dataset not found:

{"error": "NotFoundError", "message": "Dataset not found"}

Solution: Dataset ID may be incorrect or stale

Search with quality preference

"Find high-quality datasets about health"

Claude enables boost_quality automatically when quality is mentioned.

search_datasets(
    query="gesundheit",
    boost_quality=True
)

What happens:

  • Datasets with score >80 get 2x relevance boost
  • Datasets with score 60-80 get 1.5x relevance boost
  • Results sorted by boosted relevance

Note: Quality boost only active when query is provided (not on facet-only searches)

Error handling:

try:
    results = search_datasets(query="gesundheit", boost_quality=True)
except ToolError as e:
    print(f"Search with quality boost failed: {e}")
    # Try without boost
    results = search_datasets(query="gesundheit")

"Find datasets related to [dataset-id]"

Claude finds datasets with similar themes, keywords, or same publisher.

related = find_related_datasets(
    dataset_id="bev-stat-wien-2024",
    min_score=20
)

Parameters:

Prop

Type

Scoring algorithm:

  • Theme match: 30 points per shared theme (max 60)
  • Keyword match: 10 points per shared keyword (max 30)
  • Same publisher: 15 points bonus

Returns:

{
  "reference_dataset_id": "bev-stat-wien-2024",
  "related_datasets": [
    {
      "id": "wien-einwohner-bezirk",
      "title": "Einwohnerinnen und Einwohner Wien nach Bezirken",
      "similarity_score": 75,
      "match_reasons": {
        "theme_matches": ["SOCI", "REGI"],
        "keyword_matches": ["bevölkerung", "wien"],
        "same_publisher": true
      }
    }
  ]
}

Error handling:

try:
    related = find_related_datasets(dataset_id="bev-stat-wien-2024")
except ToolError as e:
    print(f"Related search failed: {e}")

Empty results:

if len(related['related_datasets']) == 0:
    print("No related datasets found")
    # Try lowering min_score threshold
    related = find_related_datasets(
        dataset_id="bev-stat-wien-2024",
        min_score=10
    )

Interpreting metadata completeness

Understanding what's missing

Ask Claude: "What metadata is missing from dataset [id]?"

Claude explains which fields are absent and why they matter.

Programmatic completeness check

quality = analyze_dataset_quality(dataset_id="bev-stat-wien-2024")
metadata = quality['metadata']

# Critical fields for research
critical_fields = ['has_title', 'has_description', 'has_license', 'has_contact']
missing_critical = [f for f in critical_fields if not metadata.get(f)]

if missing_critical:
    print(f"Missing critical fields: {missing_critical}")
    print("Not recommended for research use")

# Check completeness score
if metadata['completeness_score'] < 70:
    print("Warning: Completeness below 70%")
    print("Missing fields may limit usability")

Field importance:

FieldImportanceImpact if missing
TitleCriticalCan't identify dataset
DescriptionCriticalCan't assess relevance
LicenseCriticalCan't determine usage rights
ContactHighCan't report issues
KeywordsMediumReduced discoverability
ThemesMediumCan't filter by topic
TemporalLowUnknown time period
SpatialLowUnknown geographic coverage

Troubleshooting

All datasets have low quality scores

Symptom: No results with score >70

Cause: Domain has sparse metadata

Solutions:

  1. Lower threshold to 60 for acceptable quality
  2. Check what metadata is missing (use analyze_dataset_quality)
  3. Verify critical fields (license, description) present even if score low

Quality analysis returns degraded

Symptom: degraded: true in response

Cause: Quality service temporarily unavailable

Solutions:

  1. Response includes cached metrics (may be stale)
  2. Verify critical fields manually from dataset metadata
  3. Retry after few minutes if current metrics needed

Symptom: Empty related_datasets array

Cause: min_score too high or dataset very unique

Solutions:

  1. Lower min_score parameter (try 10-15)
  2. Check reference dataset has themes and keywords
  3. Search by same publisher or themes instead

Quality score doesn't match expectations

Symptom: High-quality dataset shows low score

Cause: Missing optional metadata fields

Solutions:

  1. Review metadata breakdown in quality analysis
  2. Check if critical fields present (title, description, license)
  3. Consider metadata richness vs actual data quality
  4. Contact publisher to improve metadata

Next steps

How is this guide?

Last updated on

On this page