Im trying to load data to neo4j db from xml file using py2neo
this python script works fine but its too slow since Im adding the nodes first then the relationships with two exceptions handlers. besides that the XML file size is around 200MB.
Im wondering if there is faster way to perform this task?
XML file:
<Persons>
<person>
<id>XA123</id>
<first_name>Adam</first_name>
<last_name>John</last_name>
<phone>01-12322222</phone>
</person>
<person>
<id>XA7777</id>
<first_name>Anna</first_name>
<last_name>Watson</last_name>
<relationship>
<type>Friends</type>
<to>XA123</to>
</relationship>
</person>
</Persons>
python script:
#!/usr/bin/python3
from xml.dom import minidom
from py2neo import Graph, Node, Relationship, authenticate
graph = Graph("http://localhost:7474/db/data/")
authenticate("localhost:7474", "neo4j", "admin")
xml_file = open("data.xml")
xml_doc = minidom.parse(xml_file)
persons = xml_doc.getElementsByTagName('person')
# Adding Nodes
for person in persons:
ID_ = person.getElementsByTagName('id')[0].firstChild.data
fName = person.getElementsByTagName('first_name')[0].firstChild.data
lName = person.getElementsByTagName('last_name')[0].firstChild.data
# not every person has phone number
try:
phone = person.getElementsByTagName('phone')[0].firstChild.data
except IndexError:
phone = "None"
label = "Person"
node = Node(label, ID=ID_, LastName=fName, FirstName=lName, Phone=phone)
graph.create(node)
# Adding Relationships
for person in persons:
ID_ = person.getElementsByTagName('id')[0].firstChild.data
label = "Person"
node1 = graph.find_one(label, property_key="ID", property_value=ID_)
# relationships
try:
has_relations = person.getElementsByTagName('relationship')
for relation in has_relations:
node2 = graph.find_one(label,
property_key="ID",
property_value=relation.getElementsByTagName('to')[0].firstChild.data)
relationship = Relationship(node1,
relation.getElementsByTagName('type')[0].firstChild.data, node2)
graph.create(relationship)
except IndexError:
continue
Answer
To speed up the process of importing data from your XML file into a Neo4j database, there are several optimizations you can consider. The main bottlenecks in your script are:
- Multiple Cycles of Database Interaction: You’re creating nodes first, then making a separate pass to create relationships. Each database interaction is slow, especially when done in separate operations.
- Repeated Search for Existing Nodes: The
graph.find_one()
method is slow when querying nodes individually in a loop.
Here’s how we can address these issues:
1. Batching Operations:
Instead of performing individual graph.create()
calls for each node and relationship, you can batch your operations using transactional batches. This significantly improves performance, as Neo4j is optimized for batch inserts.
2. Using py2neo
's NodeMatcher
for Fast Lookups:
Instead of repeatedly calling graph.find_one()
, you can use NodeMatcher
, which is optimized for looking up nodes by properties, particularly when you're dealing with large datasets.
3. Transaction Handling:
By batching nodes and relationships in transactions, you reduce the overhead caused by multiple commits to the database.
Optimized Python Script:
#!/usr/bin/python3
from xml.dom import minidom
from py2neo import Graph, Node, Relationship, NodeMatcher
graph = Graph("http://localhost:7474/db/data/")
authenticate("localhost:7474", "neo4j", "admin")
xml_file = open("data.xml")
xml_doc = minidom.parse(xml_file)
persons = xml_doc.getElementsByTagName('person')
# Use NodeMatcher for efficient lookup
matcher = NodeMatcher(graph)
# Use a batch to collect nodes and relationships for bulk creation
batch_size = 1000 # Adjust based on your system's memory and performance
nodes = []
relationships = []
# Add Nodes in Batch
for person in persons:
ID_ = person.getElementsByTagName('id')[0].firstChild.data
fName = person.getElementsByTagName('first_name')[0].firstChild.data
lName = person.getElementsByTagName('last_name')[0].firstChild.data
try:
phone = person.getElementsByTagName('phone')[0].firstChild.data
except IndexError:
phone = "None"
# Create the node and store it in the list
node = Node("Person", ID=ID_, LastName=fName, FirstName=lName, Phone=phone)
nodes.append(node)
if len(nodes) >= batch_size:
# Create batch of nodes
with graph.begin() as tx:
for n in nodes:
tx.create(n)
nodes = [] # Clear nodes batch after inserting
# Insert any remaining nodes in the batch
if nodes:
with graph.begin() as tx:
for n in nodes:
tx.create(n)
# Add Relationships in Batch
for person in persons:
ID_ = person.getElementsByTagName('id')[0].firstChild.data
node1 = matcher.match("Person", ID=ID_).first() # Faster lookup
# Process relationships
try:
has_relations = person.getElementsByTagName('relationship')
for relation in has_relations:
to_id = relation.getElementsByTagName('to')[0].firstChild.data
node2 = matcher.match("Person", ID=to_id).first()
if node2: # Check if the target node exists
relationship = Relationship(node1,
relation.getElementsByTagName('type')[0].firstChild.data,
node2)
relationships.append(relationship)
if len(relationships) >= batch_size:
# Create batch of relationships
with graph.begin() as tx:
for r in relationships:
tx.create(r)
relationships = [] # Clear relationships batch after inserting
except IndexError:
continue
# Insert any remaining relationships in the batch
if relationships:
with graph.begin() as tx:
for r in relationships:
tx.create(r)
Key Optimizations:
-
Batching Node Creations: Nodes are now created in batches (with a batch size of
1000
, which you can adjust based on your system's performance). This avoids repeatedly opening a new transaction for every node. -
Batching Relationship Creations: Similarly, relationships are collected in a list and written to the database in batches to minimize transaction overhead.
-
Efficient Node Lookup: We use
NodeMatcher
to quickly find nodes by ID rather than searching for nodes repeatedly withgraph.find_one()
. This is faster, especially when you have a large dataset. -
Transaction Management: Using
graph.begin()
to create a transaction ensures that the nodes and relationships are written in bulk, significantly speeding up the process. -
Handling Missing Relationships: The script checks if a target node exists before attempting to create a relationship, preventing errors and ensuring more efficient processing.
Additional Tips:
-
Indexing: If you are frequently querying nodes by properties like
ID
, consider creating an index on theID
property for better performance. Neo4j supports automatic indexes for properties that are frequently queried.CREATE INDEX ON :Person(ID);
-
Parallel Processing: If your hardware supports it and your dataset is large enough, consider parallelizing the batch processing to insert nodes and relationships concurrently. You can use libraries like
concurrent.futures
ormultiprocessing
to run parallel tasks.
This approach should significantly improve the speed of importing data into Neo4j from your XML file. Let me know how it works or if you need further assistance!