BigQuery use_avro_logical_types ignored in Python script

ghz 昨天 ⋅ 6 views

I'm trying to load AVRO files to BigQuery using a Python script. This process itself succeeds, but I'm having some trouble getting BigQuery to use AVRO's logical data types during table creation.

Using these logical types is documented by Google here and was added to the google-cloud-python libraries here.

I'm not a coder by profession but I'd expect below snippet to be correct... The use_avro_logical_types property however, seems to be ignored and timestamps are loaded as INT instead of TIMESTAMP.

...    
with open(full_name, 'rb') as source_file:
           var_job_config = google.cloud.bigquery.job.LoadJobConfig()
           var_job_config.source_format = 'AVRO'
           var_job_config.use_avro_logical_types = True
           job = client.load_table_from_file(
               source_file, table_ref, job_config=var_job_config)
        job.result()  # Waits for job to complete
...

The AVRO schema is as follows:

{
    "type": "record",
    "name": "table_test",
    "fields": [{
        "name": "id_",
        "type": {
            "type": "bytes",
            "logicalType": "decimal",
            "precision": 29,
            "scale": 0
        }
    },
    {
        "name": "datetime_",
        "type": ["null",
        {
            "type": "long",
            "logicalType": "timestamp-micros"
        }]
    },
    {
        "name": "integer_",
        "type": ["null",
        {
            "type": "bytes",
            "logicalType": "decimal",
            "precision": 29,
            "scale": 0
        }]
    },
    {
        "name": "varchar_",
        "type": ["null",
        {
            "type": "string",
            "logicalType": "varchar",
            "maxLength": 60
        }]
    },
    {
        "name": "capture_time",
        "type": {
            "type": "long",
            "logicalType": "timestamp-millis"
        }
    },
    {
        "name": "op_type",
        "type": "int"
    },
    {
        "name": "seq_no",
        "type": {
            "type": "string",
            "logicalType": "varchar",
            "maxLength": 16
        }
    }]
}

Can anyone elaborate on this issue? Thanks!

Answer

The issue you're encountering where logical AVRO types, such as timestamp-micros and decimal, are being loaded as INT in BigQuery rather than their respective types (e.g., TIMESTAMP or NUMERIC), could be due to the fact that the use_avro_logical_types property is not correctly set, or that certain logical types are not supported as expected in the version of the google-cloud-bigquery library you are using.

Key Points to Consider:

  1. Ensure You Have the Correct Version of google-cloud-bigquery:

    As you mentioned, the ability to handle AVRO logical types was introduced in PR 6827 for the google-cloud-python library. Make sure you are using a version of the library that includes this fix (at least version 2.8.0 or higher). You can check your installed version by running:

    pip show google-cloud-bigquery
    

    If you're not on the correct version, you can update it with:

    pip install --upgrade google-cloud-bigquery
    
  2. Check the Correct Use of use_avro_logical_types:

    The use_avro_logical_types option is designed to map certain logical types from AVRO (e.g., timestamp-millis, timestamp-micros, decimal, etc.) into their corresponding BigQuery types. You are setting this correctly in your script, but it's worth double-checking that it's being applied as expected.

  3. Logical Types Mapping:

    Make sure the logical types in the AVRO schema are correctly mapped to BigQuery types. For example:

    • timestamp-micros and timestamp-millis should map to the TIMESTAMP type in BigQuery.
    • decimal should map to the NUMERIC type.

    If these mappings are not correctly applied, BigQuery might default to INT for timestamps or STRING for decimal values.

  4. Potential AVRO Schema Changes:

    In your AVRO schema, the nullable types (e.g., "type": ["null", {...}]) can sometimes cause issues if not handled properly, as BigQuery may not interpret them as nullable in the way you'd expect. You could try modifying the schema slightly to ensure that nullable types are correctly processed.

  5. Debugging the Loaded Schema:

    After running the load job, check the schema of the loaded table to see how the data types are being interpreted. You can do this by querying the table schema in BigQuery:

    table = client.get_table(table_ref)
    print("Table Schema: ", table.schema)
    

    This will give you the exact data types BigQuery has assigned to your columns, which can help identify mismatches.

Example:

Here’s an updated version of your code snippet with an added step to check the schema after the load job. This can help identify if the logical types are being mapped correctly.

import google.cloud.bigquery
import json

# Create a BigQuery client
client = google.cloud.bigquery.Client()

# Table reference (replace with your actual table)
table_ref = client.dataset('your_dataset').table('your_table')

# Load the AVRO file
with open(full_name, 'rb') as source_file:
    job_config = google.cloud.bigquery.job.LoadJobConfig()
    job_config.source_format = 'AVRO'
    job_config.use_avro_logical_types = True
    
    job = client.load_table_from_file(
        source_file, table_ref, job_config=job_config
    )
    
    job.result()  # Waits for the job to complete

# Check the schema of the loaded table to ensure correct types
table = client.get_table(table_ref)
print("Table Schema: ", table.schema)

Possible Issue with Your AVRO Schema:

One thing that might cause issues is how the AVRO schema is being processed. Ensure that your schema defines the correct logical types. For example, the varchar logical type should be mapped to STRING in BigQuery, and timestamp-micros should map to TIMESTAMP.

Here’s a quick recap of some AVRO logical types and their expected BigQuery types:

  • timestamp-millisTIMESTAMP
  • timestamp-microsTIMESTAMP
  • decimalNUMERIC
  • varcharSTRING

Ensure that your AVRO schema's logical types are mapped as intended.

Example Schema Fix:

Here’s a corrected AVRO schema snippet with a simplified structure, ensuring that logical types are used properly:

{
  "type": "record",
  "name": "table_test",
  "fields": [
    {
      "name": "id_",
      "type": {
        "type": "bytes",
        "logicalType": "decimal",
        "precision": 29,
        "scale": 0
      }
    },
    {
      "name": "datetime_",
      "type": [
        "null",
        {
          "type": "long",
          "logicalType": "timestamp-micros"
        }
      ]
    },
    {
      "name": "integer_",
      "type": [
        "null",
        {
          "type": "bytes",
          "logicalType": "decimal",
          "precision": 29,
          "scale": 0
        }
      ]
    },
    {
      "name": "varchar_",
      "type": [
        "null",
        {
          "type": "string",
          "logicalType": "varchar",
          "maxLength": 60
        }
      ]
    },
    {
      "name": "capture_time",
      "type": {
        "type": "long",
        "logicalType": "timestamp-millis"
      }
    },
    {
      "name": "op_type",
      "type": "int"
    },
    {
      "name": "seq_no",
      "type": {
        "type": "string",
        "logicalType": "varchar",
        "maxLength": 16
      }
    }
  ]
}

Final Considerations:

  1. Ensure that the library you're using is the latest version (>= 2.8.0) for the correct handling of AVRO logical types.
  2. Check if the schema of your table in BigQuery matches the expected types (e.g., TIMESTAMP for timestamp types, NUMERIC for decimal types).
  3. If the issue persists, there may be specific constraints or limitations with how the schema is being loaded, and contacting Google Cloud support could provide more insight.

Let me know if these suggestions help, or if you encounter further issues!