MediaWiki to Avro Converter

Transform MediaWiki tables into Apache Avro schema with type detection

About MediaWiki to Avro Converter

Convert MediaWiki tables to Apache Avro schema format with automatic type detection and data serialization. Perfect for big data applications using Hadoop, Kafka, and other Apache ecosystem tools.

Key Features

  • Automatic Type Detection: Intelligently detects int, long, double, boolean, and string types
  • Nullable Fields: All fields support null values for flexibility
  • Custom Schema Names: Configure schema name and namespace
  • Sample Data Generation: Optionally include JSON data for testing
  • Field Name Sanitization: Converts headers to valid Avro field names
  • Documentation: Preserves original header names in field docs
  • File Download: Save as .avsc schema file

How to Use

  1. Input MediaWiki Table: Paste your MediaWiki table or upload a .wiki file
  2. Configure Schema: Set schema name and namespace
  3. Choose Options: Toggle sample data inclusion
  4. Review Output: The Avro schema generates automatically
  5. Copy or Download: Use the Copy or Download button to save your schema

Type Detection

  • int: Positive integers
  • long: Negative integers or large numbers
  • double: Decimal numbers
  • boolean: true/false values
  • string: All other text data
  • null: Empty cells are treated as null

Example Conversion

MediaWiki Input:

{| class="wikitable" border="1"
! Name !! Age !! City !! Active
|-
| John Doe || 28 || New York || true
|-
| Jane Smith || 34 || London || false
|}

Avro Schema Output:

{
  "type": "record",
  "name": "TableData",
  "namespace": "com.example",
  "doc": "Generated from MediaWiki table",
  "fields": [
    {
      "name": "name",
      "type": ["null", "string"],
      "doc": "Name"
    },
    {
      "name": "age",
      "type": ["null", "int"],
      "doc": "Age"
    },
    {
      "name": "city",
      "type": ["null", "string"],
      "doc": "City"
    },
    {
      "name": "active",
      "type": ["null", "boolean"],
      "doc": "Active"
    }
  ]
}

Common Use Cases

  • Hadoop Integration: Define schemas for Hadoop data processing
  • Kafka Streaming: Create schemas for Kafka message serialization
  • Data Lakes: Structure data for Apache Parquet and ORC formats
  • ETL Pipelines: Define data contracts for ETL processes
  • Wiki to Data: Convert Wikipedia tables to big data formats

About Apache Avro

Apache Avro is a data serialization system that provides rich data structures, a compact binary format, and schema evolution capabilities. It's widely used in big data ecosystems for efficient data storage and transmission.

Frequently Asked Questions (FAQ)

  • Is the generated schema ready to use with Schema Registry or Kafka immediately?

    Yes, the schema is valid Avro JSON, but you may want to adjust the {name}, {namespace}, and documentation fields to match your organization's conventions before registering it with a schema registry or using it in Kafka.

  • How are nullable fields represented in the schema?

    Each field uses a union type of ["null", "type"], which is Avro's standard way to represent optional fields. Empty cells in your table are mapped to "null" values in the sample data.

  • What if the type detection doesn't match my expected data types?

    The converter uses simple heuristics based on the cell values. If you need stricter or different types (for example, using "string" instead of "int"), you can edit the generated schema JSON manually before deploying it.

  • Can I disable the sample data generation?

    Yes. Use the "Include Sample Data" option in the tool. When disabled, only the Avro schema is generated without the accompanying JSON sample records.

  • Is any of my MediaWiki or Avro data stored or transmitted?

    No. All schema generation and type detection happen entirely in your browser. Your MediaWiki input, generated schema, and sample data are not sent to any external servers.

Privacy & Security

All conversions happen locally in your browser. Your MediaWiki data is never uploaded to any server, ensuring complete privacy and security.