Data crunching: 150 mailing list text files to UTF8 MySQL database
$30-50 USD
Completed
Posted almost 19 years ago
$30-50 USD
Paid on delivery
Goal: A PHP script that can convert about 150 text files containing posts from 10 years of 2 newsgroups to a single MySQL table in UTF8 format, for use with a search engine. All files to convert will be given at the start of the project. Details: I'm trying to create a single MySQL table from 10 years of newsgroup postings. The postings are spread over 150 text files, each file containing a month of posts. There are a few issues such as encoding and other conditions, as described in the deliverables.
## Deliverables
Deliverables: A PHP script that will convert all .txt files in the same directory to a single MySQL table dump file with the following fields: (1) sequential ID of post (2) mailing list ID (3) name of author (4) email address of author (5) date/time of post (6) actual encoding of original post (7) title of post (8) full text of post (9) full text of post with quoted text removed (for searching) Issues: (a) Because several mailing list systems were used, the format by which each post is separated and the format of the headers of each post differ. There are maybe 5 total such formats. As an example, some of the files needing conversion are here: [login to view URL] (b) The posts are mostly in the SJIS encoding. However, there are several that are in EUC or ISO 2022-JP. The _actual_ encoding of each post needs to be checked, and the post needs to be converted to UTF8 before being stored in the database. This may be the trickiest part of the project, so make sure that you are comfortable with multi-byte Japanese encodings. For example, if you open one of the files found at the above website in a Web browser, some will only render properly when SJIS is selected as the encoding. Others will only render properly when ISO 2022-JP is selected as the encoding. The actual encoding for each post needs to be figured out, and stored as field (6). (c) All email addresses need to be obscured. For example, "someguy[at][login to view URL]" would need to be changed to "someguy[at]g...". This is true for both the email address field (4) as well as all full text fields (8) and (9). Note that [at] has been used here in place of the at sign, due to the RAC site restrictions. (d) The dates and times for all posts need to be unified to the format used by MySQL, for sorting. This is stored in field (5). (e) The full text field without quoted portions (9) is the same as the original text, but with all lines beginning with ">" removed, or all lines following a line with "----- Original Message -----" removed. You will need to be creative to create a good way to remove these portions, but 95% is acceptible.
## Platform
PHP 5