Re: [tor-talk] Rewriting a text to avoid stylometry

Thus spake NoName (antispam06@xxxxxxx):

> On 17.05.2013 23:26, Andrew Lewman wrote:
> >On Fri, 17 May 2013 21:54:40 +0200
> >NoName <antispam06@xxxxxxx> wrote:
> >
> >>I have been reading lately about the ability to fingerprint an user
> >>based on the particularities of writing. Each person has a
> >>prefference for certain words, makes certain spelling mistakes, and
> >>so on. And the more text the better for the machine to identify the
> >>writer.
> >
> >See https://psal.cs.drexel.edu/index.php/JStylo-Anonymouth and
> >https://events.ccc.de/congress/2011/Fahrplan/events/4781.de.html
> Thank you! But if you have more, post it.

The above research is a great starting point (and comes with some open
source tools you can try out, albeit they are a bit slow), but this is a
very hard problem because language provides many, many ways for style
variations to differentiate people. Audio is of course even worse.

On the converse, while stylometry attacks are scary in theory, in
practice they tend to fall apart when thrown against "suspect lists"
even as large as tor-talk (I believe the current state-of-the-art is
O(100) suspects). This is a reflection of the difficulty in identifying
population-bisecting features that actually work in the general sense
without introducing false positives due to the natural tendency for
people to share and imitate elements of writing style. At least when it
comes to written text, the adversary really needs to start with a short
list of suspects using prior knowledge.

In both cases, what we really need are solid metrics to rank the
contribution of features to classification accuracy, so we can choose
the language features to obfuscate first:

Ad-hoc techniques as simple as making a conscious effort to "sound" like
someone else have also been shown to be effective without requiring much
practice, but it can also be difficult to break certain key stylistic

Mike Perry

