BorlandTalk.com Forum Index BorlandTalk.com
Borland discussion newsgroups
 
Archives   FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Tracking a lot of numbers to avoid repetition
Goto page 1, 2, 3  Next
 
Post new topic   Reply to topic    BorlandTalk.com Forum Index -> C++ Builder (Non-Technical)
View previous topic :: View next topic  
Author Message
Andrue Cope [TeamB]
Guest





PostPosted: Mon Mar 13, 2006 12:03 pm    Post subject: Tracking a lot of numbers to avoid repetition Reply with quote



I have a minor potential problem and I'd be interested in suggestion
for possible solutions.

Suppose you have a process that is logging nodes that are identified by
a unique 64 bit numeric ID. You know nothing more about the range
values other than they are 64 bit numbers. The order in which you are
notified of them, the spread etc. are complete unknowns. The number of
nodes is a complete unknown.

Your task is to detect and prevent duplicate nodes being processed. At
present we are using an std::set<QWORD> but given enough nodes this
will run out of memory and sadly this is not a 'once in a lifetime'
possibility.

Through good design principals (<g>) this is all being managed by a
wrapper class so I can fairly easily change the internals to whatever I
want.

I'm thinking of changing to a std::vector<QWORD> and using
std::equal_range. At some point I can swap the vector out to disk and
use our own file iterator class. The downside of this is that there is
a performance issue. Other times we've used this strategy we can rely
on the items being presorted so we can optimse for appending the entry
to the sorted file. We can't this time and although the insertion will
work it is never good to try and insert data into the middle of a file.

One solution would be to use an intermediate index file so that the
actual data doesn't have to be rearranged but in this case I'd gain
nothing because the index file would be a bunch of QWORDs anyway.

So to give a more concrete example:If I said I was going to throw a
couple of hundred million 64 bit random numbers at you how would you
detect duplicates? Oh and performance does matter :)

--
Andrue Cope [TeamB]
[Bicester, Uk]
http://info.borland.com/newsgroups/guide.html
Back to top
Des O'Toole
Guest





PostPosted: Mon Mar 13, 2006 1:03 pm    Post subject: Re: Tracking a lot of numbers to avoid repetition Reply with quote



"Andrue Cope [TeamB]" <no.spam (AT) not (DOT) a.valid.address> wrote in message
news:44155958$1 (AT) newsgroups (DOT) borland.com...
Quote:
I have a minor potential problem and I'd be interested in suggestion
for possible solutions.

Suppose you have a process that is logging nodes that are identified by
a unique 64 bit numeric ID. You know nothing more about the range
values other than they are 64 bit numbers. The order in which you are
notified of them, the spread etc. are complete unknowns. The number of
nodes is a complete unknown.

Your task is to detect and prevent duplicate nodes being processed. At
present we are using an std::set<QWORD> but given enough nodes this
will run out of memory and sadly this is not a 'once in a lifetime'
possibility.

Through good design principals (<g>) this is all being managed by a
wrapper class so I can fairly easily change the internals to whatever I
want.

I'm thinking of changing to a std::vector<QWORD> and using
std::equal_range. At some point I can swap the vector out to disk and
use our own file iterator class. The downside of this is that there is
a performance issue. Other times we've used this strategy we can rely
on the items being presorted so we can optimse for appending the entry
to the sorted file. We can't this time and although the insertion will
work it is never good to try and insert data into the middle of a file.

One solution would be to use an intermediate index file so that the
actual data doesn't have to be rearranged but in this case I'd gain
nothing because the index file would be a bunch of QWORDs anyway.

So to give a more concrete example:If I said I was going to throw a
couple of hundred million 64 bit random numbers at you how would you
detect duplicates? Oh and performance does matter Smile

Hi Andrue,

If you're prepared to create your own solution instead of just relying on
the STL then you could split the 64 bit range into 'zones' of 2^x and have a
flag to denote if a value exists in this zone/range. If the flag is false
then the number is not a duplicate, if true, then you will need to check
against all the entries in that zone.

As an example, lets say you are prepared to have 4GB lookup table
(containing 32,000,000,000 bit fields). The incoming number is shifted to
divide by 2^24 to find its entry in the bit field. As you are only talking
about 100's of millions of random entries the chances are that the vast
majority of entries will be processed by the bitfield method and be very
quick to process.

For fairly random data, this will probably be quite a quick way to work.It's
quickest mode will be in establishing that a new number is not a duplicate.

HTH

Des
Back to top
Leo Siefert
Guest





PostPosted: Mon Mar 13, 2006 2:03 pm    Post subject: Re: Tracking a lot of numbers to avoid repetition Reply with quote



Andrue Cope [TeamB] wrote:

Quote:
Your task is to detect and prevent duplicate nodes being processed. At
present we are using an std::set<QWORD> but given enough nodes this
will run out of memory and sadly this is not a 'once in a lifetime'
possibility.

You could try inserting them into an sql table with a unique field and
let it complain about duplicates.

- Leo
Back to top
Guest






PostPosted: Mon Mar 13, 2006 2:03 pm    Post subject: Re: Tracking a lot of numbers to avoid repetition Reply with quote

"Andrue Cope [TeamB]" <no.spam (AT) not (DOT) a.valid.address> writes:

Quote:
So to give a more concrete example:If I said I was going to throw a
couple of hundred million 64 bit random numbers at you how would you
detect duplicates? Oh and performance does matter Smile

Use a hash table.
Of course, the hash table design needs to avoid clustering on your
dataset, since otherwise performance will drop.

For an account of one implementation:
http://www.irisa.fr/texmex/publications/versionElect/2005/CVDB2005_Naturel.pdf.


You may also want to read Monge's Matching algorithms in a duplicate
detection system. Although this is mainly about data record
duplicates, it does offer some ideas.
Back to top
Jonathan Benedicto
Guest





PostPosted: Mon Mar 13, 2006 2:03 pm    Post subject: Re: Tracking a lot of numbers to avoid repetition Reply with quote

Andrue Cope [TeamB] wrote:
Quote:
So to give a more concrete example:If I said I was going to throw a
couple of hundred million 64 bit random numbers at you how would you
detect duplicates?

Maybe this? :-D

for( QWORD i = 0; i < list.length; ++i )
{
for( QWORD j = i + 1; j < list.length; ++j )
{
if( list[ i ] == list[ j ] )
throw Exception( "duplicate" );
}
}

Jonathan
Back to top
Andrue Cope [TeamB]
Guest





PostPosted: Mon Mar 13, 2006 2:03 pm    Post subject: Re: Tracking a lot of numbers to avoid repetition Reply with quote

Des O'Toole wrote:

Quote:
If you're prepared to create your own solution instead of just
relying on the STL then you could split the 64 bit range into 'zones'
of 2^x and have a flag to denote if a value exists in this
zone/range. If the flag is false then the number is not a duplicate,
if true, then you will need to check against all the entries in that
zone.

That's an interesting idea that I hadn't thought of. It might not help
avoid the memory consumption issue (that's going to depend on the
spread) but could be useful for performance. Most of the time this is
going to be used there 'probably will' be an element of clumping.

--
Andrue Cope [TeamB]
[Bicester, Uk]
http://info.borland.com/newsgroups/guide.html
Back to top
Guest






PostPosted: Mon Mar 13, 2006 2:03 pm    Post subject: Re: Tracking a lot of numbers to avoid repetition Reply with quote

The most space effective methods for this use Bloom filters ( a hash
table based probablistic data structure ) to solve this problem with
an arbitarily low false positive rate. The false positive rate can be
set to 0.1% for example, and thus you can avoid the expensive lookup
in 99.9% of cases

For a case using 64 bit integer IDs, see:

Duplicate Detection in Click Streams
by Ahmed Metwally,Divyakant Agrawal and Amr El Abbadi
http://www.cs.ucsb.edu/research/tech_reports/reports/2004-23.pdf

For more on Bloom filters see:

Bloom filter
http://en.wikipedia.org/wiki/Bloom_filter
Network Applications of Bloom Filters: A Survey
http://www.eecs.harvard.edu/~michaelm/NEWWORK/postscripts/BloomFilterSurvey.pdf
Using Bloom Filters
http://www.perl.com/pub/a/2004/04/08/bloom_filters.html
A Little Bloom Filter Theory (and a Bag of filter Tricks)
http://www.cap-lore.com/code/BloomTheory.html
Bloom Filters - the math
http://www.cs.wisc.edu/~cao/papers/summary-cache/node8.html
Compressed Bloom Filters
http://www.eecs.harvard.edu/~michaelm/NEWWORK/postscripts/cbf.pdf

--

Seek simplicity and mistrust it.
Alfred Whitehead

A witty saying proves nothing.
Voltaire
Back to top
Duane Hebert
Guest





PostPosted: Mon Mar 13, 2006 3:03 pm    Post subject: Re: Tracking a lot of numbers to avoid repetition Reply with quote

"Andrue Cope [TeamB]" <no.spam (AT) not (DOT) a.valid.address> wrote in message
news:44155958$1 (AT) newsgroups (DOT) borland.com...

Quote:
So to give a more concrete example:If I said I was going to throw a
couple of hundred million 64 bit random numbers at you how would you
detect duplicates? Oh and performance does matter Smile


It seems that you're concerned about the overhead to
detect duplicates. I'm not sure what the complexity of
std::set is to detect duplicates. Are you searching the
set first or testing failure of insert?

You've probably benchmarked this but my initial thought
would be a map since find is O(log n).
Back to top
Andrue Cope [TeamB]
Guest





PostPosted: Mon Mar 13, 2006 4:03 pm    Post subject: Re: Tracking a lot of numbers to avoid repetition Reply with quote

<rambam (AT) bigpond (DOT) net.au> wrote:

Quote:
The most space effective methods for this use Bloom filters ( a hash
table based probablistic data structure ) to solve this problem with
an arbitarily low false positive rate.

Very interesting, thanks. I'm a little concerned about the false
positive though so will need to look further. The problem is that each
node could contain the metadata for over a dozen files so we can't
afford to reject any of them.

--
Andrue Cope [TeamB]
[Bicester, Uk]
http://info.borland.com/newsgroups/guide.html
Back to top
Andrue Cope [TeamB]
Guest





PostPosted: Mon Mar 13, 2006 4:03 pm    Post subject: Re: Tracking a lot of numbers to avoid repetition Reply with quote

Duane Hebert wrote:

Quote:
It seems that you're concerned about the overhead to
detect duplicates. I'm not sure what the complexity of
std::set is to detect duplicates.

Um. I'll put my hand up to that one. I wrote the class when I was first
learning the STL and I chose std::set because I didn't want to linear
search a vector. Of course in the intervening years I learnt about
std::equal_range and friends :)

Performance is an issue but ultimately memory footprint is what
matters. Although we do need this to be reasonably fast what we
absolutely have to avoid is running out of memory or falsely detecting
duplicates. Basically this is phase one of reindexing a hierarchical
database. During phase one we are finding all the nodes and we just
want to be sure we don't get stuck in a loop during the next phase.

--
Andrue Cope [TeamB]
[Bicester, Uk]
http://info.borland.com/newsgroups/guide.html
Back to top
Jonathan Benedicto
Guest





PostPosted: Mon Mar 13, 2006 4:03 pm    Post subject: Re: Tracking a lot of numbers to avoid repetition Reply with quote

Andrue Cope [TeamB] wrote:
Quote:
So to give a more concrete example:If I said I was going to throw a
couple of hundred million 64 bit random numbers at you how would you
detect duplicates? Oh and performance does matter Smile

What about creating a temp file with CreateFile, setting its size to the max
of QWORDs you'll get, mapping it and filling it with 0.

Then, to check whether the QWORD is already stored, check the map ptr for a
value greater than 0. If not greater, set it.

Jonathan
Back to top
Andrue Cope [TeamB]
Guest





PostPosted: Mon Mar 13, 2006 4:03 pm    Post subject: Re: Tracking a lot of numbers to avoid repetition Reply with quote

Leo Siefert wrote:

Quote:
You could try inserting them into an sql table with a unique field and
let it complain about duplicates.

Unfortunately that would mean impementing a SQL database on a
workstation that doesn't currently have or need one. Plus I have my
doubts (I'm open to correction) on the performance aspects of
communicating with a SQL database as compared to code integrated into
the application.

What I perhaps didn't make clear in my first post is that this is
required as part of a construction process (ie;getting ready for the
big show rather than the big show itself). When there is a huge number
of nodes users will accept that it takes longer than five minutes to
get ready to process them but anything more than half an hour and they
will become concerned.

--
Andrue Cope [TeamB]
[Bicester, Uk]
http://info.borland.com/newsgroups/guide.html
Back to top
Andrue Cope [TeamB]
Guest





PostPosted: Mon Mar 13, 2006 4:03 pm    Post subject: Re: Tracking a lot of numbers to avoid repetition Reply with quote

Gillmer J. Derge [TeamB] wrote:

Quote:
I think he's concerned with size, since each value in a set would
require 3 to 5 times (depending on whether we're dealing with 32 or
64 bit pointers) as much memory as it would in a vector.

Primarily yes. We're happy to trade a little performance away just to
be assured that we can at least do the job. The nature of the work is
such that "we can get most of it" or "there might be a few drop outs"
is not good enough. If a disk has 200 million objects on it then that's
how many we have to find and process.

Quote:
Ultimately any homebuilt solution you're thinking about is
essentially just a crappy database that's good enough to do the job
but no better.

That depends. It could also be a hand tuned, optimised database. A
couple of years ago we looked at using SQL for our backend database of
objects. We found that it just didn't have the performance we needed
for file system access. Interestingly Microsoft appear to have come to
same conclusion with WinFS :)

--
Andrue Cope [TeamB]
[Bicester, Uk]
http://info.borland.com/newsgroups/guide.html
Back to top
vavan
Guest





PostPosted: Mon Mar 13, 2006 4:03 pm    Post subject: Re: Tracking a lot of numbers to avoid repetition Reply with quote

On Mon, 13 Mar 2006 14:51:49 +0000, "Andrue Cope [TeamB]"
<no.spam (AT) not (DOT) a.valid.address> wrote:

Quote:
Unfortunately that would mean impementing a SQL database on a
workstation that doesn't currently have or need one. Plus I have my
doubts (I'm open to correction) on the performance aspects of
communicating with a SQL database as compared to code integrated into
the application.

you forgot that there are db-engines available free from sql-layer
overhead. and sometimes you can even integrate them right in your
original application

--
Vladimir Ulchenko aka vavan
Back to top
David Dean
Guest





PostPosted: Mon Mar 13, 2006 4:03 pm    Post subject: Re: Tracking a lot of numbers to avoid repetition Reply with quote

In article <44155958$1 (AT) newsgroups (DOT) borland.com>,
"Andrue Cope [TeamB]" <no.spam (AT) not (DOT) a.valid.address> wrote:

Quote:
So to give a more concrete example:If I said I was going to throw a
couple of hundred million 64 bit random numbers at you how would you
detect duplicates? Oh and performance does matter Smile

A binary tree is my first inclination. I imagine this would get
memory intensive, so I might load subtrees in from files on every nth
node.

--
-David

Nihil curo de ista tua stulta superstitione.
Back to top
Display posts from previous:   
Post new topic   Reply to topic    BorlandTalk.com Forum Index -> C++ Builder (Non-Technical) All times are GMT
Goto page 1, 2, 3  Next
Page 1 of 3

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2006 phpBB Group
SEO toolkit © 2004-2006 webmedic.