Interesting indeed. I'll just flow through my thought process in looking at
this, take as you will.

I'd go with a bracketed approach, noting likely times for WORK and HOME.
When that's a given, you can partition them based on unique patterns (User
A leaves home or rather stops using Device A around 8AM. Device B only ever
stays active from 9 to 5. Device A reactivates at around 6.)

The way to go about it is to build a base of known information, then look
for mappable discrepancies to limit the dataset to something more doable.

By discrepancies, I mean something such as User A getting sick and staying
at home all day watching Netflix and thumbing through pages. If you can
find a window of devices that have a correlating time shift, you can limit
the result set. If User A is sick all day and Device A is hot all day, you
may notice that Device B is cold the entire day. That's worth note.

Another thing that's a gold mine is if the user ever works from home. If
you notice Device B go hot in the same IP range as Device A, you have
another discrepancy that can be mapped. Chances are high that someone in
that area owns both devices, and your window shrinks even more.

Let's throw in Device C, User A's mobile device. If you happen to add that
to the equation, you can connect Device A and Device B via user by the
location and movement of Device C, which can be further strengthened again
by discrepancies in behavior. If you notice all 3 devices in the same
location, you have about as close to a bingo as you may ever get. The more
devices that hit a discrepancy at the same time, the better shot you have.

In areas like San Francisco, you can also take into account the possibility
of commutes, and map that. Devices that go cold in Oakland and go hot again
in over the 8 hour window may be indication of a longer commute, allowing a
narrowing of the window.

*TL;DR:* Noticeable patterns and Windows give you a good portion of a base
percentage to work with, but deviations from the normal are where you get
your weight in gold.

This of course being musings from a non-data scientist, so take with a
grain of salt. Just musing about.

On Sat, Oct 4, 2014 at 3:52 PM, Tom Copeland <tom / thomasleecopeland.com>
wrote:

> I have no useful suggestions°ń but thanks for °»The Ruby Way°…!
>
> On Oct 3, 2014, at 5:38 PM, Hal Fulton <rubyhacker / gmail.com> wrote:
>
> > I haven't posted here lately... I hope many of you still remember
> > my name...   :)
> >
> > I am working in my day job on a very interesting and challenging problem
> > (yes, mostly in Ruby).
> >
> > Since I have known many Rubyists who were creative and imaginative, I
> > thought I would seek opinions here.
> >
> > If you are familiar with the term "cross-device matching," that is what
> this
> > is all about.
> >
> > If you're not familiar -- here is a rough synopsis of the classic
> problem.
> >
> > Ad networks (and such) use cookies and pixels and whatever techniques
> > they can in order to better target their advertising.
> >
> > There are strict privacy constraints, of course. No one is supposed to
> store
> > information like, "This is Dr. Chandra from Urbana, Illinois" -- but
> it's perfectly
> > OK to store information like "this is user 123, who searched for a new
> car
> > today, and is the same guy who bought a toaster last week."
> >
> > The big problem is that "user 123" on a laptop may be user 456 on a
> tablet
> > and user 789 on a phone. Being able to match or associate these users
> with
> > a good level of probability is sort of a Holy Grail in the industry.
> >
> > Of course, if you're Facebook or Google or something, you can do
> "deterministic"
> > matching with a very high degree of certainty. Otherwise, you have to
> take the
> > "probabilistic" approach, as I am here.
> >
> > So I am making some progress here, but I am really reaching out for new
> and
> > interesting ideas.
> >
> > In essence, I am examining a data stream of millions of anonymized users
> and
> > trying to group them together based on pure data analysis. We have quite
> a bit
> > of information including URL clicked, IP address, user agent, time of
> day, DMA,
> > device type, and so on.
> >
> > For an app-related event, we can get the Apple IDFA or the Android ID. We
> > *cannot* find those IDs for a browser-related event, even on the phone.
> We can
> > access (our) cookies if there are any (browser but not app), etc. etc.
> >
> > If I had near-infinite storage and processing power, I would build a
> matrix of
> > several quadrillion entries and update it over time, finding essentially
> a probability
> > vector for each user with respect to every other user. Then I could
> apply some
> > heuristics and weight them appropriately.
> >
> > However, to do this in "reasonable" time with limited RAM and disk is
> another
> > problem entirely.
> >
> > I'm having acceptable success so far, but I am definitely interested in
> hearing
> > others' thoughts on this.
> >
> > Thanks,
> > Hal Fulton
> >
> >
> >
> >
>
>