Read about ...
  • Apple Mac & OSX
  • Internet & Social Media
  • Reviews
  • Security
  • SEO & Marketing
  • Web Design
  • Web Hosting
  • RSS RSS Feed

    How-to Handle Large XML Files in PHP

    Sooner or later, when you are processing, importing and ingesting data from large suppliers such as Amazon, iTunes Music Store and Virgin Megastore, to name but 3, you are likely to hit a scenario where you are need to handle, process, import and output XML files that are serveral hundred megabytes or even gigabytes in size. In the course of the past year, doing just that, I have come up with some of my own tips and tools for actually processing the suckers, and here I try to document some of these for the benefit of others.

    Splitting the XML into Smaller Chunk Files

    After preparing the XML you are ready to start processing it. PHP has horrendous memory usage skills and therefore it is hardly the best idea to try and read the file into memory and parse it using the XML tree-based parsers that are built in. To avoid having to switch to using Perl stream-based parsers, I devised an ingenious means by which to split the XML file into variable sized chunks, thus making it much easier to both process the XML with PHP but also to stop and start the process.

    The process of chunking the file essentially take advantage of the line breaks in the XML and reads through the XML line by line. It is then programmed to perform certain actions as it reaches certain parts of the XML. For one it increments a counter as it reaches the end of each record, so that when it hits a designated number it can close the current chunk file and begin a new one. The end result of the process is that you will end up with a decent number of bite-sized xml files that are well-formed and readable by any parser.

    To start with just initialise all the required variables:

    PHP:
    1. <?php
    2.  
    3. //initialize vars
    4. $begin=time(); // script start time
    5. $start = time(); // last gate time
    6. $interval=time(); // current gate time
    7. $minutes=1; // intervals for gates
    8. $filenum = 1; // start chunk file number at 1
    9. $recordnum = 1; // start at record 1
    10.  
    11. //file settings
    12. $basefilename = ""; // the base file name for the chunks
    13. $xmlfile = ""; // the xml file name to be processed
    14. $xmldatadelimiter = ""; // core data delimiter
    15. $xmlitemdelimiter = ""; // record delimiter
    16. $chunksize = ""; // number of records in each chunk file
    17. $xmlstring ="<?xml version="1.0" encoding="UTF-8">\n";
    18. $xmlstring.="<$xmldatadelimiter\n>"; // xmlchunk file header
    19.  
    20. //dirs and files
    21. $dir = "/path/to/directory"; // path to where splits will be stored
    22. $exportfile = "$dir"."/splits/$basefilename-$filenum.xml";

    Once this is done you are ready to start processing the initial XML file:

    PHP:
    1. //start processing
    2. echo "Processing (".$dir."/$xmlfile)\n";
    3.  
    4. $handle = @fopen($dir."/$xmlfile","r");
    5.  
    6. if ($handle) {
    7.  
    8. while (!feof($handle)) {
    9.  
    10. $buffer = fgets($handle, 4096);

    This essentially loops through the file line by line. This means that you need to ensure that the XML file has line breaks at the end of each element. If it does not, the entire XML file will be read into memory by PHP!

    No that you have the current line held in $buffer, you can start processing the contents of the line and act accordingly in any or all of 4 possible ways:

    1. increment the record number for the item
    2. write the line to the chunk file.
    3. increment the chunk file number, close the xml and start on a new export file

    And so each of these actions can be acheived as follows:

    PHP:
    1. // if item delimiter reached
    2. // increment record number iterator
    3. if (eregi("<!--$xmlitemdelimiter-->",$buffer)==true) {
    4. $recordnum++;
    5. }

    PHP:
    1. //write line to chunk file
    2. error_log("$buffer",3,$exportfile);

    PHP:
    1. // if chunk limit reached then start to
    2. // close the file with well formed xml
    3. if ($recordnum&gt;chunksize) {
    4.  
    5. // post feed end tag
    6. error_log("<!--$xmldatadelimiter-->",3,$exportfile);
    7.  
    8. // and increment file number to start new log file chunk
    9. //reset record counter number for new chunk file
    10. $recordnum=0;
    11. $filenum++;
    12.  
    13. //update export file name
    14. $exportfile = "$dir"."/splits/$basefilename-$filenum.xml";
    15.  
    16. //echo status report to STDOUT
    17. echo"Segment $filenum. Record ".($chunksize*$filenum).".\n";
    18.  
    19. // write new chunk xml file header
    20. error_log($xmlstring,3,$exportfile);
    21.  
    22. }

    After this just run through some internal script time logging and script killing code to (1) keep track of the time it takes and (2) to kill it in event of the script running out of control.

    PHP:
    1. //put in a catch so that script doesn't run riot and
    2. //will die after X number of cycles
    3. if ($filenum&gt;5000) {
    4. die();
    5. }
    6.  
    7. if (($interval-$start)&gt;60) {
    8. $minutes++;
    9. echo $minutes." Minutes so far.\n";
    10. $start=time();
    11. } else {
    12. $interval = time();
    13. }

    Then just close the file handle for the original XML file, and then as part of the initial check to see if the XML file exists, add an error message to the else clause:

    PHP:
    1. }
    2. fclose($handle);
    3. } else {
    4. echo"Unable to open file! (".$dir."$xmlfile")\n";
    5. }

    To finish up the process just echo out some time based stats:

    PHP:
    1. $procend = time();
    2.  
    3. echo "\n####\n";
    4. echo "Split Complete (".floor((($procend-$begin)/60))." Minutes)\n";
    5.  
    6. ?&gt;

    Don't Eat With Your Mouthful!

    And so in turn, once you have these files, you can use a PHP script to read the contents of the directory containing the split files and then read then into whatever parser you chouse, be it something like simpleXML or a custom php XML handler.

    Find out more over at the PHP documentation site

    Bookmark This Article:
    • Digg
    • del.icio.us
    • Facebook
    • Google
    • Furl
    • LinkedIn
    • Live
    • Ma.gnolia
    • MySpace
    • Pownce
    • Reddit
    • StumbleUpon
    • Technorati
    • E-mail this story to a friend!
    • Print this article!
    • Yahoo! Buzz
    • YahooMyWeb
    • TwitThis

    8 Responses - Join the debate!

    1. Travis Ballard:

      i just wanted to thank you for this. it helped me quite a bit with a project i just finished. at first i tried just using simplexml to read a 400mb xml file and well, php wasn’t having that. now it may create 20k files but it deletes them when it’s done and everything is working great. thanks!

    2. Jim Chan:

      HeyThanks for sharing the great piece of code. Just have a question and hope if you could please address this question. I think I have misconfigured the file.

      Here is an example for items: Which one is core data delimiter / record delimeter?

      New York
      Yes

      New York
      Yes

      Thanks again!
      -Jim

    3. webdesign:

      Great tutorial .. you missed alot of syntax but when i finally got it runningno more big file errors :D

    4. cranbow:

      Thanks for the tips. This helped me a lot. I cleaned up the code a bit for my purposes, but it worked very well.

    5. Vincent:

      No problem, glad the notion and technique is of use as much as the code. No doubt I would refactor the code if I needed it again :) Best, Vincent -

    6. Christian Weiske:

      You are circumventing the problem here. PHP has much more elegant ways to deal with huge XML files. I described one in a blog entry:
      http://cweiske.de/tagebuch/Importing%20huge%20XML%20files%20using%20PHP5%20-%20efficiently%20and%20conveniently.htm

    7. Vincent Roman:

      Christian –

      Nice post and an elegant solution. Alas in my real world scenario and based on restrictions, I couldn’t use it.

      Yours is one way of many and thanks so much ofr sharing :)

      Best, Vincent -

    8. Francisco:

      Olá,

      Tenho o seguinte código abaixo e gostaria de saber como faço um laço para ler diverços xmls em um diretório para ir gravando no banco um a um.

      session_start();
      include(’inc/head.php’);
      include(’banco.php’);

      $xml = array();
      $xml_elem = null;
      $contador=1;
      function startElement( $parser, $nome, $attrs ){
      global $xml, $xml_elem;
      $xml_elem = $nome;
      }

      function endElement( $parser, $nome ){
      global $xml_elem;
      if ( $nome == ‘LIVRO’) {echo ‘contador = ‘.$contador++;}
      $xml_elem = null;
      }

      function dadosXML( $parser, $texto ){
      global $xml, $xml_elem,$contador;
      if ( $xml_elem == ‘ID’ || $xml_elem == ‘TITULO’ || $xml_elem == ‘AUTOR’ || $xml_elem == ‘DATA’ ){
      $xml[$contador][$xml_elem]=$texto;
      if ( $xml_elem == ‘DATA’){$contador++;}
      }
      }
      $parser = xml_parser_create();
      xml_set_element_handler( $parser, “startElement”, “endElement” );
      xml_set_character_data_handler( $parser, “dadosXML” );
      $file = fopen( ‘livros.xml’, ‘r’ );
      while( $dados = fread( $file, 4096 ) ){
      xml_parse( $parser, $dados );
      echo fgets($file).’Teste de Leitura XML.’;
      }
      xml_parser_free( $parser );
      $pLinha = ”;

      if ($open){
      inicializaProcesso($open);
      $ID=”";$TITULO=”";$AUTOR=”";$DATA =”";
      foreach( $xml as $contXml ){
      echo ‘ID = ‘.$contXml['ID'];
      echo ‘TITULO = ‘.$contXml['TITULO'];
      echo ‘AUTOR = ‘.$contXml['AUTOR'];
      echo ‘DATA = ‘.$contXml['DATA'];
      $sql = “INSERT INTO FVS_lib(id,titulo,autor,data)
      VALUES (”.$contXml['ID'].”,’”.$contXml['TITULO'].”‘,’”.$contXml['AUTOR'].”‘,’”.$contXml['DATA'].”‘);”;
      echo ‘passei por aqui’.$contXml['ID'];
      $query = ifx_prepare($sql, $open);
      $result = ifx_do($query);
      if (!$result){
      echo “Nenhum dado foi gravado”;
      exit;
      }else{ echo “Dados gravados com sucesso”;}

      }
      finalizaProcesso($open);
      echo “CONEXÃO COM O DB REALIZADA COM SUCESSO”;
      }else{ echo ‘NAO CONECTADO’; }

      foreach( $xml as $contXml ){
      echo $pLinha.”“.
      $contXml['ID'].”
      “.
      $contXml['TITULO'].”“.
      $contXml['AUTOR'].”".
      $contXml['DATA'].”";
      ‘DATA’.”";
      }

      function stringSql(){
      global $xml;
      foreach( $xml as $contXml ){
      $sql = “INSERT INTO FVS_lib(id,titulo,autor,data)
      VALUES (”.$contXml['ID'].”,”.$contXml['TITULO'].”,”.$contXml['AUTOR'].”,”.$contXml['DATA'].”);”;
      //echo ‘passei por aqui’.$contXml['ID'];
      $query = ifx_prepare($sql, $conexao);
      $result = ifx_do($query);

      }
      return $sql;
      }
      function insertCont($conexao){
      $id = 12;$titulo = “rteterter”;$autor = “yereryre”;$data = “reyeewre”;
      $query = ifx_prepare(”SELECT * FROM FVS_lib”, $conexao);
      $result = ifx_do($query);
      while($arr1 = ifx_fetch_row($query)){
      $id = $arr1['id'];
      $titulo = $arr1['titulo'];
      $autor = $arr1['autor'];
      $data = $arr1['data'];
      echo “alert(’Salvo com Sucesso!’);”;
      }
      if ($id == 0){
      $exist .= “Nenhum registro encontrado.”;
      echo $exist;
      }
      $printResult = imprimeInf($id,$titulo,$autor,$data);
      echo $printResult.”;
      echo $mensagem.’—————–’;

      }

      function imprimeInf($id,$titulo,$autor,$data){
      $string .= “ID : $id “;
      $string .= “Titulo : $titulo“;
      $string .= “Autor : $autor“;
      $string .= “Data : $data“;
      return $string;

      }

    Why not join the conversation? Leave a reply.