On Sat, Apr 2, 2011 at 12:47 AM, Ted Flethuseo <flethuseo / gmail.com> wrote:
> Hi everyone,
>
> I need to build 3 relational tables from an xml text. In this tables, I
> need to keep track of words that have the <emph> and <bold> tags in them
> along with the
> word mentioned and its count in the <p> tag. This is easier to
> illustrate with an example:
>
> I need to take this text:
>
> <p> My name is <strong>Ted</strong>, and I like <emph>coffee</emph>.
> <strong>Ted</strong> does not like tea. </p>
> <p> I have a brother who likes <emph>tea</emph> but does not like
> <emph>coffee</emph> </p>
>
> To 3 normalized tables like this:
>
> ...p_table...
> p_id =A0 =A0desc
> 1 =A0 =A0 =A0 My name is....
> 2 =A0 =A0 =A0 I have a ....
>
>
> ...p_to_emph_table...
> p_id =A0 =A0e_id =A0 =A0count
> 1 =A0 =A0 =A0 2 =A0 =A0 =A0 1
> 2 =A0 =A0 =A0 1 =A0 =A0 =A0 1
> 2 =A0 =A0 =A0 2 =A0 =A0 =A0 1
>
>
> ...emph_table...
> e_id =A0 =A0emph_word
> 1 =A0 =A0 =A0 Tea
> 2 =A0 =A0 =A0 Coffee
>
> I am not sure what would be the best approach to parse this xml with
> ruby or what tool
> could help me do this efficiently?

What I'd do is parse the XML (use Nokogiri, for example) and get all p
elements. For each p element, insert it into p_table if not present
and get its id. Look at all emph inside the p element, and for each of
them:
- Check if the word is already in emph_table and get the id or
- Insert it into emph_table and get the id

With that id, insert or update a row in the p_to_emph_table with the p
and the word id.

This is a straightforward approach that should work. Make a try (ask
any question that blocks you) and let us know how it goes.

Jesus.