I'm trying to load AVRO files to BigQuery using a Python script. This process itself succeeds, but I'm having some trouble getting BigQuery to use AVRO's logical data types during table creation.
Using these logical types is documented by Google here and was added to the google-cloud-python libraries here.
I'm not a coder by profession but I'd expect below snippet to be correct... The use_avro_logical_types property however, seems to be ignored and timestamps are loaded as INT instead of TIMESTAMP.
...
with open(full_name, 'rb') as source_file:
var_job_config = google.cloud.bigquery.job.LoadJobConfig()
var_job_config.source_format = 'AVRO'
var_job_config.use_avro_logical_types = True
job = client.load_table_from_file(
source_file, table_ref, job_config=var_job_config)
job.result() # Waits for job to complete
...
The AVRO schema is as follows:
{
"type": "record",
"name": "table_test",
"fields": [{
"name": "id_",
"type": {
"type": "bytes",
"logicalType": "decimal",
"precision": 29,
"scale": 0
}
},
{
"name": "datetime_",
"type": ["null",
{
"type": "long",
"logicalType": "timestamp-micros"
}]
},
{
"name": "integer_",
"type": ["null",
{
"type": "bytes",
"logicalType": "decimal",
"precision": 29,
"scale": 0
}]
},
{
"name": "varchar_",
"type": ["null",
{
"type": "string",
"logicalType": "varchar",
"maxLength": 60
}]
},
{
"name": "capture_time",
"type": {
"type": "long",
"logicalType": "timestamp-millis"
}
},
{
"name": "op_type",
"type": "int"
},
{
"name": "seq_no",
"type": {
"type": "string",
"logicalType": "varchar",
"maxLength": 16
}
}]
}
Can anyone elaborate on this issue? Thanks!
Answer
The issue you're encountering where logical AVRO types, such as timestamp-micros
and decimal
, are being loaded as INT
in BigQuery rather than their respective types (e.g., TIMESTAMP
or NUMERIC
), could be due to the fact that the use_avro_logical_types
property is not correctly set, or that certain logical types are not supported as expected in the version of the google-cloud-bigquery
library you are using.
Key Points to Consider:
-
Ensure You Have the Correct Version of
google-cloud-bigquery
:As you mentioned, the ability to handle AVRO logical types was introduced in PR 6827 for the
google-cloud-python
library. Make sure you are using a version of the library that includes this fix (at least version2.8.0
or higher). You can check your installed version by running:pip show google-cloud-bigquery
If you're not on the correct version, you can update it with:
pip install --upgrade google-cloud-bigquery
-
Check the Correct Use of
use_avro_logical_types
:The
use_avro_logical_types
option is designed to map certain logical types from AVRO (e.g.,timestamp-millis
,timestamp-micros
,decimal
, etc.) into their corresponding BigQuery types. You are setting this correctly in your script, but it's worth double-checking that it's being applied as expected. -
Logical Types Mapping:
Make sure the logical types in the AVRO schema are correctly mapped to BigQuery types. For example:
timestamp-micros
andtimestamp-millis
should map to theTIMESTAMP
type in BigQuery.decimal
should map to theNUMERIC
type.
If these mappings are not correctly applied, BigQuery might default to
INT
for timestamps orSTRING
for decimal values. -
Potential AVRO Schema Changes:
In your AVRO schema, the
nullable
types (e.g.,"type": ["null", {...}]
) can sometimes cause issues if not handled properly, as BigQuery may not interpret them as nullable in the way you'd expect. You could try modifying the schema slightly to ensure that nullable types are correctly processed. -
Debugging the Loaded Schema:
After running the load job, check the schema of the loaded table to see how the data types are being interpreted. You can do this by querying the table schema in BigQuery:
table = client.get_table(table_ref) print("Table Schema: ", table.schema)
This will give you the exact data types BigQuery has assigned to your columns, which can help identify mismatches.
Example:
Here’s an updated version of your code snippet with an added step to check the schema after the load job. This can help identify if the logical types are being mapped correctly.
import google.cloud.bigquery
import json
# Create a BigQuery client
client = google.cloud.bigquery.Client()
# Table reference (replace with your actual table)
table_ref = client.dataset('your_dataset').table('your_table')
# Load the AVRO file
with open(full_name, 'rb') as source_file:
job_config = google.cloud.bigquery.job.LoadJobConfig()
job_config.source_format = 'AVRO'
job_config.use_avro_logical_types = True
job = client.load_table_from_file(
source_file, table_ref, job_config=job_config
)
job.result() # Waits for the job to complete
# Check the schema of the loaded table to ensure correct types
table = client.get_table(table_ref)
print("Table Schema: ", table.schema)
Possible Issue with Your AVRO Schema:
One thing that might cause issues is how the AVRO schema is being processed. Ensure that your schema defines the correct logical types. For example, the varchar
logical type should be mapped to STRING
in BigQuery, and timestamp-micros
should map to TIMESTAMP
.
Here’s a quick recap of some AVRO logical types and their expected BigQuery types:
timestamp-millis
→TIMESTAMP
timestamp-micros
→TIMESTAMP
decimal
→NUMERIC
varchar
→STRING
Ensure that your AVRO schema's logical types are mapped as intended.
Example Schema Fix:
Here’s a corrected AVRO schema snippet with a simplified structure, ensuring that logical types are used properly:
{
"type": "record",
"name": "table_test",
"fields": [
{
"name": "id_",
"type": {
"type": "bytes",
"logicalType": "decimal",
"precision": 29,
"scale": 0
}
},
{
"name": "datetime_",
"type": [
"null",
{
"type": "long",
"logicalType": "timestamp-micros"
}
]
},
{
"name": "integer_",
"type": [
"null",
{
"type": "bytes",
"logicalType": "decimal",
"precision": 29,
"scale": 0
}
]
},
{
"name": "varchar_",
"type": [
"null",
{
"type": "string",
"logicalType": "varchar",
"maxLength": 60
}
]
},
{
"name": "capture_time",
"type": {
"type": "long",
"logicalType": "timestamp-millis"
}
},
{
"name": "op_type",
"type": "int"
},
{
"name": "seq_no",
"type": {
"type": "string",
"logicalType": "varchar",
"maxLength": 16
}
}
]
}
Final Considerations:
- Ensure that the library you're using is the latest version (>= 2.8.0) for the correct handling of AVRO logical types.
- Check if the schema of your table in BigQuery matches the expected types (e.g.,
TIMESTAMP
for timestamp types,NUMERIC
for decimal types). - If the issue persists, there may be specific constraints or limitations with how the schema is being loaded, and contacting Google Cloud support could provide more insight.
Let me know if these suggestions help, or if you encounter further issues!