| View previous topic :: View next topic |
| Author |
Message |
Andrue Cope [TeamB] Guest
|
Posted: Mon Mar 13, 2006 12:03 pm Post subject: Tracking a lot of numbers to avoid repetition |
|
|
I have a minor potential problem and I'd be interested in suggestion
for possible solutions.
Suppose you have a process that is logging nodes that are identified by
a unique 64 bit numeric ID. You know nothing more about the range
values other than they are 64 bit numbers. The order in which you are
notified of them, the spread etc. are complete unknowns. The number of
nodes is a complete unknown.
Your task is to detect and prevent duplicate nodes being processed. At
present we are using an std::set<QWORD> but given enough nodes this
will run out of memory and sadly this is not a 'once in a lifetime'
possibility.
Through good design principals (<g>) this is all being managed by a
wrapper class so I can fairly easily change the internals to whatever I
want.
I'm thinking of changing to a std::vector<QWORD> and using
std::equal_range. At some point I can swap the vector out to disk and
use our own file iterator class. The downside of this is that there is
a performance issue. Other times we've used this strategy we can rely
on the items being presorted so we can optimse for appending the entry
to the sorted file. We can't this time and although the insertion will
work it is never good to try and insert data into the middle of a file.
One solution would be to use an intermediate index file so that the
actual data doesn't have to be rearranged but in this case I'd gain
nothing because the index file would be a bunch of QWORDs anyway.
So to give a more concrete example:If I said I was going to throw a
couple of hundred million 64 bit random numbers at you how would you
detect duplicates? Oh and performance does matter :)
--
Andrue Cope [TeamB]
[Bicester, Uk]
http://info.borland.com/newsgroups/guide.html |
|
| Back to top |
|
 |
Des O'Toole Guest
|
Posted: Mon Mar 13, 2006 1:03 pm Post subject: Re: Tracking a lot of numbers to avoid repetition |
|
|
"Andrue Cope [TeamB]" <no.spam (AT) not (DOT) a.valid.address> wrote in message
news:44155958$1 (AT) newsgroups (DOT) borland.com...
| Quote: | I have a minor potential problem and I'd be interested in suggestion
for possible solutions.
Suppose you have a process that is logging nodes that are identified by
a unique 64 bit numeric ID. You know nothing more about the range
values other than they are 64 bit numbers. The order in which you are
notified of them, the spread etc. are complete unknowns. The number of
nodes is a complete unknown.
Your task is to detect and prevent duplicate nodes being processed. At
present we are using an std::set<QWORD> but given enough nodes this
will run out of memory and sadly this is not a 'once in a lifetime'
possibility.
Through good design principals (<g>) this is all being managed by a
wrapper class so I can fairly easily change the internals to whatever I
want.
I'm thinking of changing to a std::vector<QWORD> and using
std::equal_range. At some point I can swap the vector out to disk and
use our own file iterator class. The downside of this is that there is
a performance issue. Other times we've used this strategy we can rely
on the items being presorted so we can optimse for appending the entry
to the sorted file. We can't this time and although the insertion will
work it is never good to try and insert data into the middle of a file.
One solution would be to use an intermediate index file so that the
actual data doesn't have to be rearranged but in this case I'd gain
nothing because the index file would be a bunch of QWORDs anyway.
So to give a more concrete example:If I said I was going to throw a
couple of hundred million 64 bit random numbers at you how would you
detect duplicates? Oh and performance does matter
|
Hi Andrue,
If you're prepared to create your own solution instead of just relying on
the STL then you could split the 64 bit range into 'zones' of 2^x and have a
flag to denote if a value exists in this zone/range. If the flag is false
then the number is not a duplicate, if true, then you will need to check
against all the entries in that zone.
As an example, lets say you are prepared to have 4GB lookup table
(containing 32,000,000,000 bit fields). The incoming number is shifted to
divide by 2^24 to find its entry in the bit field. As you are only talking
about 100's of millions of random entries the chances are that the vast
majority of entries will be processed by the bitfield method and be very
quick to process.
For fairly random data, this will probably be quite a quick way to work.It's
quickest mode will be in establishing that a new number is not a duplicate.
HTH
Des |
|
| Back to top |
|
 |
Leo Siefert Guest
|
Posted: Mon Mar 13, 2006 2:03 pm Post subject: Re: Tracking a lot of numbers to avoid repetition |
|
|
Andrue Cope [TeamB] wrote:
| Quote: | Your task is to detect and prevent duplicate nodes being processed. At
present we are using an std::set<QWORD> but given enough nodes this
will run out of memory and sadly this is not a 'once in a lifetime'
possibility.
|
You could try inserting them into an sql table with a unique field and
let it complain about duplicates.
- Leo |
|
| Back to top |
|
 |
Guest
|
Posted: Mon Mar 13, 2006 2:03 pm Post subject: Re: Tracking a lot of numbers to avoid repetition |
|
|
"Andrue Cope [TeamB]" <no.spam (AT) not (DOT) a.valid.address> writes:
| Quote: | So to give a more concrete example:If I said I was going to throw a
couple of hundred million 64 bit random numbers at you how would you
detect duplicates? Oh and performance does matter
|
Use a hash table.
Of course, the hash table design needs to avoid clustering on your
dataset, since otherwise performance will drop.
For an account of one implementation:
http://www.irisa.fr/texmex/publications/versionElect/2005/CVDB2005_Naturel.pdf.
You may also want to read Monge's Matching algorithms in a duplicate
detection system. Although this is mainly about data record
duplicates, it does offer some ideas. |
|
| Back to top |
|
 |
Jonathan Benedicto Guest
|
Posted: Mon Mar 13, 2006 2:03 pm Post subject: Re: Tracking a lot of numbers to avoid repetition |
|
|
Andrue Cope [TeamB] wrote:
| Quote: | So to give a more concrete example:If I said I was going to throw a
couple of hundred million 64 bit random numbers at you how would you
detect duplicates?
|
Maybe this? :-D
for( QWORD i = 0; i < list.length; ++i )
{
for( QWORD j = i + 1; j < list.length; ++j )
{
if( list[ i ] == list[ j ] )
throw Exception( "duplicate" );
}
}
Jonathan |
|
| Back to top |
|
 |
Andrue Cope [TeamB] Guest
|
Posted: Mon Mar 13, 2006 2:03 pm Post subject: Re: Tracking a lot of numbers to avoid repetition |
|
|
Des O'Toole wrote:
| Quote: | If you're prepared to create your own solution instead of just
relying on the STL then you could split the 64 bit range into 'zones'
of 2^x and have a flag to denote if a value exists in this
zone/range. If the flag is false then the number is not a duplicate,
if true, then you will need to check against all the entries in that
zone.
|
That's an interesting idea that I hadn't thought of. It might not help
avoid the memory consumption issue (that's going to depend on the
spread) but could be useful for performance. Most of the time this is
going to be used there 'probably will' be an element of clumping.
--
Andrue Cope [TeamB]
[Bicester, Uk]
http://info.borland.com/newsgroups/guide.html |
|
| Back to top |
|
 |
Guest
|
|
| Back to top |
|
 |
Duane Hebert Guest
|
Posted: Mon Mar 13, 2006 3:03 pm Post subject: Re: Tracking a lot of numbers to avoid repetition |
|
|
"Andrue Cope [TeamB]" <no.spam (AT) not (DOT) a.valid.address> wrote in message
news:44155958$1 (AT) newsgroups (DOT) borland.com...
| Quote: | So to give a more concrete example:If I said I was going to throw a
couple of hundred million 64 bit random numbers at you how would you
detect duplicates? Oh and performance does matter
|
It seems that you're concerned about the overhead to
detect duplicates. I'm not sure what the complexity of
std::set is to detect duplicates. Are you searching the
set first or testing failure of insert?
You've probably benchmarked this but my initial thought
would be a map since find is O(log n). |
|
| Back to top |
|
 |
Andrue Cope [TeamB] Guest
|
Posted: Mon Mar 13, 2006 4:03 pm Post subject: Re: Tracking a lot of numbers to avoid repetition |
|
|
<rambam (AT) bigpond (DOT) net.au> wrote:
| Quote: | The most space effective methods for this use Bloom filters ( a hash
table based probablistic data structure ) to solve this problem with
an arbitarily low false positive rate.
|
Very interesting, thanks. I'm a little concerned about the false
positive though so will need to look further. The problem is that each
node could contain the metadata for over a dozen files so we can't
afford to reject any of them.
--
Andrue Cope [TeamB]
[Bicester, Uk]
http://info.borland.com/newsgroups/guide.html |
|
| Back to top |
|
 |
Andrue Cope [TeamB] Guest
|
Posted: Mon Mar 13, 2006 4:03 pm Post subject: Re: Tracking a lot of numbers to avoid repetition |
|
|
Duane Hebert wrote:
| Quote: | It seems that you're concerned about the overhead to
detect duplicates. I'm not sure what the complexity of
std::set is to detect duplicates.
|
Um. I'll put my hand up to that one. I wrote the class when I was first
learning the STL and I chose std::set because I didn't want to linear
search a vector. Of course in the intervening years I learnt about
std::equal_range and friends :)
Performance is an issue but ultimately memory footprint is what
matters. Although we do need this to be reasonably fast what we
absolutely have to avoid is running out of memory or falsely detecting
duplicates. Basically this is phase one of reindexing a hierarchical
database. During phase one we are finding all the nodes and we just
want to be sure we don't get stuck in a loop during the next phase.
--
Andrue Cope [TeamB]
[Bicester, Uk]
http://info.borland.com/newsgroups/guide.html |
|
| Back to top |
|
 |
Jonathan Benedicto Guest
|
Posted: Mon Mar 13, 2006 4:03 pm Post subject: Re: Tracking a lot of numbers to avoid repetition |
|
|
Andrue Cope [TeamB] wrote:
| Quote: | So to give a more concrete example:If I said I was going to throw a
couple of hundred million 64 bit random numbers at you how would you
detect duplicates? Oh and performance does matter
|
What about creating a temp file with CreateFile, setting its size to the max
of QWORDs you'll get, mapping it and filling it with 0.
Then, to check whether the QWORD is already stored, check the map ptr for a
value greater than 0. If not greater, set it.
Jonathan |
|
| Back to top |
|
 |
Andrue Cope [TeamB] Guest
|
Posted: Mon Mar 13, 2006 4:03 pm Post subject: Re: Tracking a lot of numbers to avoid repetition |
|
|
Leo Siefert wrote:
| Quote: | You could try inserting them into an sql table with a unique field and
let it complain about duplicates.
|
Unfortunately that would mean impementing a SQL database on a
workstation that doesn't currently have or need one. Plus I have my
doubts (I'm open to correction) on the performance aspects of
communicating with a SQL database as compared to code integrated into
the application.
What I perhaps didn't make clear in my first post is that this is
required as part of a construction process (ie;getting ready for the
big show rather than the big show itself). When there is a huge number
of nodes users will accept that it takes longer than five minutes to
get ready to process them but anything more than half an hour and they
will become concerned.
--
Andrue Cope [TeamB]
[Bicester, Uk]
http://info.borland.com/newsgroups/guide.html |
|
| Back to top |
|
 |
Andrue Cope [TeamB] Guest
|
Posted: Mon Mar 13, 2006 4:03 pm Post subject: Re: Tracking a lot of numbers to avoid repetition |
|
|
Gillmer J. Derge [TeamB] wrote:
| Quote: | I think he's concerned with size, since each value in a set would
require 3 to 5 times (depending on whether we're dealing with 32 or
64 bit pointers) as much memory as it would in a vector.
|
Primarily yes. We're happy to trade a little performance away just to
be assured that we can at least do the job. The nature of the work is
such that "we can get most of it" or "there might be a few drop outs"
is not good enough. If a disk has 200 million objects on it then that's
how many we have to find and process.
| Quote: | Ultimately any homebuilt solution you're thinking about is
essentially just a crappy database that's good enough to do the job
but no better.
|
That depends. It could also be a hand tuned, optimised database. A
couple of years ago we looked at using SQL for our backend database of
objects. We found that it just didn't have the performance we needed
for file system access. Interestingly Microsoft appear to have come to
same conclusion with WinFS :)
--
Andrue Cope [TeamB]
[Bicester, Uk]
http://info.borland.com/newsgroups/guide.html |
|
| Back to top |
|
 |
vavan Guest
|
Posted: Mon Mar 13, 2006 4:03 pm Post subject: Re: Tracking a lot of numbers to avoid repetition |
|
|
On Mon, 13 Mar 2006 14:51:49 +0000, "Andrue Cope [TeamB]"
<no.spam (AT) not (DOT) a.valid.address> wrote:
| Quote: | Unfortunately that would mean impementing a SQL database on a
workstation that doesn't currently have or need one. Plus I have my
doubts (I'm open to correction) on the performance aspects of
communicating with a SQL database as compared to code integrated into
the application.
|
you forgot that there are db-engines available free from sql-layer
overhead. and sometimes you can even integrate them right in your
original application
--
Vladimir Ulchenko aka vavan |
|
| Back to top |
|
 |
David Dean Guest
|
Posted: Mon Mar 13, 2006 4:03 pm Post subject: Re: Tracking a lot of numbers to avoid repetition |
|
|
In article <44155958$1 (AT) newsgroups (DOT) borland.com>,
"Andrue Cope [TeamB]" <no.spam (AT) not (DOT) a.valid.address> wrote:
| Quote: | So to give a more concrete example:If I said I was going to throw a
couple of hundred million 64 bit random numbers at you how would you
detect duplicates? Oh and performance does matter
|
A binary tree is my first inclination. I imagine this would get
memory intensive, so I might load subtrees in from files on every nth
node.
--
-David
Nihil curo de ista tua stulta superstitione. |
|
| Back to top |
|
 |
|