 StackOverflow data is available for free. To play with the XML data dump, you must insert it into MySQL database. This report shows you how.
StackOverflow data is available for free. To play with the XML data dump, you must insert it into MySQL database. This report shows you how.
This article is not for beginners. To follow this article, you should have the StackOverflow XML data dump. You should know MySQL server, SQL, Python, XML and Linux command line. You should be able to solve problems with these tools. IIRC: This report is written from memory.
Download and Uncompress to get posts.xml
Downloaded “Stack Exchange Data Dump – Sept 2011” with torrent.
Uncompressed StackOverflow
$ 7z e stackoverflow.com.7z.001
And selected “A(u)to rename all” when asked. Moved the files to a separate dir.
Convert XML to (my)SQL
Converted posts.xml to SQL with this excellent and short SAX parser. Nickname “no” claims it’s his “first python program ever“, which I doubt.
http://paste.org/8946
Here is the updated version:
xml2sql.py
#!/usr/bin/env python
# http://paste.org/8946 "no" 2009
# Updated by Tero Karvinen http://TeroKarvinen.com
import xml.sax.handler
import xml.sax
import sys
class SOHandler(xml.sax.handler.ContentHandler):
        def __init__(self):
                self.errParse = 0
        def startElement(self, name, attributes):
                if name != "row":
                        self.table = name;
                        self.outFile = open(name+".sql","w")
                        self.errfile = open(name+".err","w")
                else:
                        skip = 0
                        currentRow = u"insert into "+self.table+"("
                        for attr in attributes.keys():
                                currentRow += str(attr) + ","
                        currentRow = currentRow[:-1]
                        currentRow += u") values ("
                        for attr in attributes.keys():
                                try:
                                        currentRow += u'"{0}", \
'.format(attributes[attr].replace('\\','\\\\').replace('"', '\\"').replace("'", "\\'"))
                                except UnicodeEncodeError:
                                        self.errParse += 1;
                                        skip = 1;
                                        self.errfile.write(currentRow)
                        if skip != 1:
                                currentRow = currentRow[:-1]
                                currentRow += u");"
                                #print len(attributes.keys())
                                self.outFile.write(currentRow.encode("utf-8"))
                                self.outFile.write("")
                                self.outFile.flush()
                                print currentRow.encode("utf-8");
        def characters(self, data):
                pass
        def endElement(self, name):
                pass
if len(sys.argv) < 2:
        print "Give me an xml file argument!"
        sys.exit(1)
parser = xml.sax.make_parser()
handler = SOHandler()
parser.setContentHandler(handler)
Had to replace backslashes “\” with double backslashes “\\”. Added shebang “#!/usr/bin/env python” and made runnable ‘chmod u+x xml2sql.py’.
$ ./xml2sql.py posts.xml
This created posts.sql. Running the command took a while.
Could also use time, nice, ionice and pv and pipe output to /dev/null.
Create a New Database and User
Changed MySQL root password to one I can remember and logged in as database root.
$ sudo dpkg-reconfigure mysql-server-5.1
$ mysql -u root -p
Created a new user, a database with the same name and granted all priviledges.
mysql> create database sotero; mysql> grant all on sotero.* to sotero@localhost identified by "NoThisIsNotMyRealPassword_u7ohkA"
Tested
$ mysql -u sotero -p mysql> use sotero; Database changed mysql> exit;
Automated login
$ nano /home/tero/.my.cnf
[client] user = sotero database = sotero password= NoThisIsNotMyRealPassword_u7ohkA
Now ‘mysql’ logged to database without questions.
Inserting SQL into Live Database
First, we have to create the tables (mysql.sql from soddi), use the correct database (before-posts.sql). Then we can insert the data (posts.sql).
Here are the helper files
$ cat before-posts.sql
use sotero;
mysql.sql is from soddi, with minor modifications. (soddi/trunk/soddi/Loaders/Scripts/mysql.sql)
$ hg clone https://bitbucket.org/bitpusher/soddi
$ cp ../../soddi/trunk/soddi/Loaders/Scripts/mysql.sql .
$  perl -p -i -e ‘s/DUMMY/sotero/g’ mysql.sql
$  perl -p -i -e ‘s/VoteTypes/votetypes/g’ mysql.sql
$  perl -p -i -e ‘s/PostTypes/posttypes/g’ mysql.sql
Then I run the actual insertion command
$ cat mysql.sql before-posts.sql posts.sql |mysql
This took a long time. I’m still waiting, and ‘SELECT COUNT(id) FROM posts;’ says over five million. But we can keep playing with the data while we wait.
Playing with the data
mysql> SELECT id, title, score FROM posts LIMIT 3; +----+-------------------------------------------------------------------------------------+-------+ | id | title | score | +----+-------------------------------------------------------------------------------------+-------+ | 4 | When setting a form's opacity should I use a decimal or double? | 83 | | 6 | Why doesn't the percentage width child in absolutely positioned parent work in IE7? | 20 | | 7 | NULL | 84 | +----+-------------------------------------------------------------------------------------+-------+ 3 rows in set (0.00 sec) mysql> desc posts; +-----------------------+--------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +-----------------------+--------------+------+-----+---------+-------+ | Id | int(11) | NO | PRI | NULL | | | AcceptedAnswerId | int(11) | YES | | NULL | | | AnswerCount | int(11) | YES | | NULL | | | Body | longtext | NO | | NULL | | | ClosedDate | datetime | YES | | NULL | | | CommentCount | int(11) | YES | | NULL | | | CommunityOwnedDate | datetime | YES | | NULL | | | CreationDate | datetime | NO | | NULL | | | FavoriteCount | int(11) | YES | | NULL | | | LastActivityDate | datetime | NO | | NULL | | | LastEditDate | datetime | YES | | NULL | | | LastEditorDisplayName | varchar(40) | YES | | NULL | | | LastEditorUserId | int(11) | YES | | NULL | | | OwnerUserId | int(11) | YES | | NULL | | | ParentId | int(11) | YES | | NULL | | | PostTypeId | int(11) | NO | | NULL | | | Score | int(11) | NO | | NULL | | | Tags | varchar(150) | YES | | NULL | | | Title | varchar(250) | YES | | NULL | | | ViewCount | int(11) | NO | | NULL | | +-----------------------+--------------+------+-----+---------+-------+ 20 rows in set (0.00 sec)
Enjoy Your Data!
Even though the insert is still running, we can start honing our queries. It’s much nicer to practice SQL with real data than inserting ten records by hand.
Hi
Thanks a lot for the tutorial. I tried to follow it, but when I execute the last command, I get the following error:
ERROR 1064 (42000) at line 191: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ‘)’ at line 1
Do you have any idea?