The text is never pulled from any format. If you train only html files,
and then analyze html files, these html tags are treated just like
normal words. They just don't show up in the results, because they are
mostly equally often used in both the training texts and the analyzed
text.
The algorithm is very simple, and takes absolutely no assumption of the
input.