Loading data to neo4j from XML using py2neo

ghz 昨天 ⋅ 1 views

Im trying to load data to neo4j db from xml file using py2neo

this python script works fine but its too slow since Im adding the nodes first then the relationships with two exceptions handlers. besides that the XML file size is around 200MB.

Im wondering if there is faster way to perform this task?

XML file:

<Persons>
    <person>
        <id>XA123</id>
        <first_name>Adam</first_name>
        <last_name>John</last_name>
        <phone>01-12322222</phone>
    </person>
    <person>
        <id>XA7777</id>
        <first_name>Anna</first_name>
        <last_name>Watson</last_name>
        <relationship>
            <type>Friends</type>
            <to>XA123</to>
        </relationship>
    </person>
</Persons>

python script:

#!/usr/bin/python3

from xml.dom import minidom
from py2neo import Graph, Node, Relationship, authenticate


graph = Graph("http://localhost:7474/db/data/")
authenticate("localhost:7474", "neo4j", "admin")

xml_file = open("data.xml")
xml_doc = minidom.parse(xml_file)
persons = xml_doc.getElementsByTagName('person')

# Adding Nodes
for person in persons:
    ID_ = person.getElementsByTagName('id')[0].firstChild.data
    fName = person.getElementsByTagName('first_name')[0].firstChild.data
    lName = person.getElementsByTagName('last_name')[0].firstChild.data

    # not every person has phone number
    try:
        phone = person.getElementsByTagName('phone')[0].firstChild.data
    except IndexError:
        phone = "None"

    label = "Person"
    node = Node(label, ID=ID_, LastName=fName, FirstName=lName, Phone=phone)
    graph.create(node)


# Adding Relationships
for person in persons:
    ID_ = person.getElementsByTagName('id')[0].firstChild.data

    label = "Person"
    node1 = graph.find_one(label, property_key="ID", property_value=ID_)

    # relationships
    try:
        has_relations = person.getElementsByTagName('relationship')
        for relation in has_relations:
            node2 = graph.find_one(label,
                                   property_key="ID",
                                   property_value=relation.getElementsByTagName('to')[0].firstChild.data)

            relationship = Relationship(node1,
                                        relation.getElementsByTagName('type')[0].firstChild.data, node2)
            graph.create(relationship)
    except IndexError:
        continue

Answer

To speed up the process of importing data from your XML file into a Neo4j database, there are several optimizations you can consider. The main bottlenecks in your script are:

  1. Multiple Cycles of Database Interaction: You’re creating nodes first, then making a separate pass to create relationships. Each database interaction is slow, especially when done in separate operations.
  2. Repeated Search for Existing Nodes: The graph.find_one() method is slow when querying nodes individually in a loop.

Here’s how we can address these issues:

1. Batching Operations:

Instead of performing individual graph.create() calls for each node and relationship, you can batch your operations using transactional batches. This significantly improves performance, as Neo4j is optimized for batch inserts.

2. Using py2neo's NodeMatcher for Fast Lookups:

Instead of repeatedly calling graph.find_one(), you can use NodeMatcher, which is optimized for looking up nodes by properties, particularly when you're dealing with large datasets.

3. Transaction Handling:

By batching nodes and relationships in transactions, you reduce the overhead caused by multiple commits to the database.

Optimized Python Script:

#!/usr/bin/python3

from xml.dom import minidom
from py2neo import Graph, Node, Relationship, NodeMatcher

graph = Graph("http://localhost:7474/db/data/")
authenticate("localhost:7474", "neo4j", "admin")

xml_file = open("data.xml")
xml_doc = minidom.parse(xml_file)
persons = xml_doc.getElementsByTagName('person')

# Use NodeMatcher for efficient lookup
matcher = NodeMatcher(graph)

# Use a batch to collect nodes and relationships for bulk creation
batch_size = 1000  # Adjust based on your system's memory and performance
nodes = []
relationships = []

# Add Nodes in Batch
for person in persons:
    ID_ = person.getElementsByTagName('id')[0].firstChild.data
    fName = person.getElementsByTagName('first_name')[0].firstChild.data
    lName = person.getElementsByTagName('last_name')[0].firstChild.data

    try:
        phone = person.getElementsByTagName('phone')[0].firstChild.data
    except IndexError:
        phone = "None"

    # Create the node and store it in the list
    node = Node("Person", ID=ID_, LastName=fName, FirstName=lName, Phone=phone)
    nodes.append(node)

    if len(nodes) >= batch_size:
        # Create batch of nodes
        with graph.begin() as tx:
            for n in nodes:
                tx.create(n)
        nodes = []  # Clear nodes batch after inserting

# Insert any remaining nodes in the batch
if nodes:
    with graph.begin() as tx:
        for n in nodes:
            tx.create(n)

# Add Relationships in Batch
for person in persons:
    ID_ = person.getElementsByTagName('id')[0].firstChild.data
    node1 = matcher.match("Person", ID=ID_).first()  # Faster lookup

    # Process relationships
    try:
        has_relations = person.getElementsByTagName('relationship')
        for relation in has_relations:
            to_id = relation.getElementsByTagName('to')[0].firstChild.data
            node2 = matcher.match("Person", ID=to_id).first()

            if node2:  # Check if the target node exists
                relationship = Relationship(node1,
                                            relation.getElementsByTagName('type')[0].firstChild.data,
                                            node2)
                relationships.append(relationship)

                if len(relationships) >= batch_size:
                    # Create batch of relationships
                    with graph.begin() as tx:
                        for r in relationships:
                            tx.create(r)
                    relationships = []  # Clear relationships batch after inserting
    except IndexError:
        continue

# Insert any remaining relationships in the batch
if relationships:
    with graph.begin() as tx:
        for r in relationships:
            tx.create(r)

Key Optimizations:

  1. Batching Node Creations: Nodes are now created in batches (with a batch size of 1000, which you can adjust based on your system's performance). This avoids repeatedly opening a new transaction for every node.

  2. Batching Relationship Creations: Similarly, relationships are collected in a list and written to the database in batches to minimize transaction overhead.

  3. Efficient Node Lookup: We use NodeMatcher to quickly find nodes by ID rather than searching for nodes repeatedly with graph.find_one(). This is faster, especially when you have a large dataset.

  4. Transaction Management: Using graph.begin() to create a transaction ensures that the nodes and relationships are written in bulk, significantly speeding up the process.

  5. Handling Missing Relationships: The script checks if a target node exists before attempting to create a relationship, preventing errors and ensuring more efficient processing.

Additional Tips:

  • Indexing: If you are frequently querying nodes by properties like ID, consider creating an index on the ID property for better performance. Neo4j supports automatic indexes for properties that are frequently queried.

    CREATE INDEX ON :Person(ID);
    
  • Parallel Processing: If your hardware supports it and your dataset is large enough, consider parallelizing the batch processing to insert nodes and relationships concurrently. You can use libraries like concurrent.futures or multiprocessing to run parallel tasks.

This approach should significantly improve the speed of importing data into Neo4j from your XML file. Let me know how it works or if you need further assistance!