A very interesting problem

I haven’t posted here lately… I hope many of you still remember
my name… :slight_smile:

I am working in my day job on a very interesting and challenging problem
(yes, mostly in Ruby).

Since I have known many Rubyists who were creative and imaginative, I
thought I would seek opinions here.

If you are familiar with the term “cross-device matching,” that is what
this
is all about.

If you’re not familiar – here is a rough synopsis of the classic
problem.

Ad networks (and such) use cookies and pixels and whatever techniques
they can in order to better target their advertising.

There are strict privacy constraints, of course. No one is supposed to
store
information like, “This is Dr. Chandra from Urbana, Illinois” – but
it’s
perfectly
OK to store information like “this is user 123, who searched for a new
car
today, and is the same guy who bought a toaster last week.”

The big problem is that “user 123” on a laptop may be user 456 on a
tablet
and user 789 on a phone. Being able to match or associate these users
with
a good level of probability is sort of a Holy Grail in the industry.

Of course, if you’re Facebook or Google or something, you can do
“deterministic”
matching with a very high degree of certainty. Otherwise, you have to
take
the
“probabilistic” approach, as I am here.

So I am making some progress here, but I am really reaching out for new
and
interesting ideas.

In essence, I am examining a data stream of millions of anonymized users
and
trying to group them together based on pure data analysis. We have quite
a
bit
of information including URL clicked, IP address, user agent, time of
day,
DMA,
device type, and so on.

For an app-related event, we can get the Apple IDFA or the Android ID.
We
cannot find those IDs for a browser-related event, even on the phone.
We
can
access (our) cookies if there are any (browser but not app), etc. etc.

If I had near-infinite storage and processing power, I would build a
matrix
of
several quadrillion entries and update it over time, finding essentially
a
probability
vector for each user with respect to every other user. Then I could
apply
some
heuristics and weight them appropriately.

However, to do this in “reasonable” time with limited RAM and disk is
another
problem entirely.

I’m having acceptable success so far, but I am definitely interested in
hearing
others’ thoughts on this.

Thanks,
Hal F.

I’d be surprised if anybody would share such valuable knowledge for
free.
All I can add is to dust off stochastic methods and math analysis of
youth university years and rethink way “anonymized” data are collected
and stored.

I’ve been working not for an ad company but one telecommunication
provider and was very disgusted what all data were collected and how
they were misused. Formally filtered/mangled to obey laws but the
primary data were kept stored nevertheless and used for “higher”
interests.

On Fri, Oct 3, 2014 at 11:38 PM, Hal F. [email protected]
wrote:

I haven’t posted here lately… I hope many of you still remember
my name… :slight_smile:

Yes, of course. Welcome back!

I am working in my day job on a very interesting and challenging problem
(yes, mostly in Ruby).

Since I have known many Rubyists who were creative and imaginative, I
thought I would seek opinions here.

If you are familiar with the term “cross-device matching,” that is what this
is all about.

The problem is indeed interesting and I am tempted to spend a few
brain CPU cycles on it. However, I do not really like the
application…

Kind regards

robert

I have no useful suggestions but thanks for The Ruby Way!

Hello Hal, its good to see you here again

Like Robert I have some worries about the use
At the same time I’ve often wished I could preserve a web session across
devices so…

Using chrome you have the option to sync info across all browser
instances
Not sure if this is accessible to any one but Google?
Or how about a service that a person signs up for that lets you track
their
info?

Read an article awhile ago where they were able to “fingerprint”
specific
users based on
the info browsers broadcast as users surf the web. Not sure how
consistent
this info is for a particular person across devices
but maybe worth looking into

cheers,
Chris

On Sat, Oct 4, 2014 at 6:04 PM, Robert K.
[email protected]

Hal F. [email protected] wrote:

So I am making some progress here, but I am really reaching out for new and
interesting ideas.

Do you have any work-in-progress code/ideas to share with us?

In essence, I am examining a data stream of millions of anonymized users and
trying to group them together based on pure data analysis. We have quite a bit
of information including URL clicked, IP address, user agent, time of day,
DMA, device type, and so on.

And can you publish the anonymized data so we can help analyze/test
against it, too?

Unlike some others here, I do not have any reservations about the
applications of this[1]

What I do care about is being able to keep public and make use of
any effort I might put into this myself. No NDAs or like.

[1] Probably most users of my work use it to do things I disagree
with, and use it for things I would never touch myself;
but I refuse to discriminate against fields of endeavor.

It just so happens that Stanford just started an on-line course which
might be helpful in thinking about problems like this.

Note that the text for the course can be downloaded via a link on that
page, apparently even if you aren’t signed up for the course.


Sent from Mailbox

Interesting indeed. I’ll just flow through my thought process in looking
at
this, take as you will.

I’d go with a bracketed approach, noting likely times for WORK and HOME.
When that’s a given, you can partition them based on unique patterns
(User
A leaves home or rather stops using Device A around 8AM. Device B only
ever
stays active from 9 to 5. Device A reactivates at around 6.)

The way to go about it is to build a base of known information, then
look
for mappable discrepancies to limit the dataset to something more
doable.

By discrepancies, I mean something such as User A getting sick and
staying
at home all day watching Netflix and thumbing through pages. If you can
find a window of devices that have a correlating time shift, you can
limit
the result set. If User A is sick all day and Device A is hot all day,
you
may notice that Device B is cold the entire day. That’s worth note.

Another thing that’s a gold mine is if the user ever works from home. If
you notice Device B go hot in the same IP range as Device A, you have
another discrepancy that can be mapped. Chances are high that someone in
that area owns both devices, and your window shrinks even more.

Let’s throw in Device C, User A’s mobile device. If you happen to add
that
to the equation, you can connect Device A and Device B via user by the
location and movement of Device C, which can be further strengthened
again
by discrepancies in behavior. If you notice all 3 devices in the same
location, you have about as close to a bingo as you may ever get. The
more
devices that hit a discrepancy at the same time, the better shot you
have.

In areas like San Francisco, you can also take into account the
possibility
of commutes, and map that. Devices that go cold in Oakland and go hot
again
in over the 8 hour window may be indication of a longer commute,
allowing a
narrowing of the window.

TL;DR: Noticeable patterns and Windows give you a good portion of a
base
percentage to work with, but deviations from the normal are where you
get
your weight in gold.

This of course being musings from a non-data scientist, so take with a
grain of salt. Just musing about.

On Sat, Oct 4, 2014 at 3:52 PM, Tom C. [email protected]

That does look verrrry interesting, thank you!

Hal

On Mon, Oct 6, 2014 at 12:28 PM, Rick DeNatale [email protected]

A friend of mine that works for the NSA told me that they have an
elegant
solution to that problem. But they do not share their code not even with
other government agencies.