MurmurHash problem

e$B$J$+$@$G$9!#e(B

At Tue, 11 Mar 2008 10:20:27 +0900 (JST),
[email protected] wrote in [ruby-cvs:22983]:

Log:
* string.c (hash): replaced by MurmurHash described in
http://murmurhash.googlepages.com/.

e$B$3$l$G$Oe(Baligne$B$5$l$F$$$J$$e(Bworde$B%“%/%;%9$r5v$5$J$$%7%9%F%`$G$OF0$-e(B
e$B$^$;$s!#e(Bmurmurhashaligned.cppe$B$H$$$&$N$b$”$k$h$&$G$9$,!"$3$l$^$?e(B

  • inte$B$,e(B32bite$B$G$J$$$H$J$i$J$$e(B
  • little endiane$B$G$J$$$H$J$i$J$$e(B

e$B$H$$$&@)8B$,$"$j$=$&$G$9!#e(B

inte$B$,e(B64bite$B$N%7%9%F%`$G$be(B32bite$B@0?t$,$“$kJ]>Z$C$F$”$k$s$G$7$?$C$1!#e(B

e$B$^$D$b$He(B e$B$f$-$R$m$G$9e(B

In message “Re: [ruby-dev:34020] MurmurHash problem”
on Tue, 11 Mar 2008 17:32:57 +0900, Nobuyoshi N.
[email protected] writes:

|e$B$3$l$G$Oe(Baligne$B$5$l$F$$$J$$e(Bworde$B%“%/%;%9$r5v$5$J$$%7%9%F%`$G$OF0$-e(B
|e$B$^$;$s!#e(Bmurmurhashaligned.cppe$B$H$$$&$N$b$”$k$h$&$G$9$,!“$3$l$^$?e(B
|
|* inte$B$,e(B32bite$B$G$J$$$H$J$i$J$$e(B
|* little endiane$B$G$J$$$H$J$i$J$$e(B
|
|e$B$H$$$&@)8B$,$”$j$=$&$G$9!#e(B

e$B$=$&$+$“!#;DG0$@$J$”!#$^$?e(BJenkinse$B%O%C%7%e$KLa$9$+$J$"!#e(B

|inte$B$,e(B64bite$B$N%7%9%F%`$G$be(B32bite$B@0?t$,$“$kJ]>Z$C$F$”$k$s$G$7$?$C$1!#e(B

e$BJ]>Z$O$J$$$G$7$g$&$M!#e(B

e$B@.@%$G$9!#e(B

Yukihiro M. wrote:

|
|e$B$H$$$&@)8B$,$"$j$=$&$G$9!#e(B

e$B$=$&$+$"!#;DG0$@$J$"!#$^$?e(BJenkinse$B%O%C%7%e$KLa$9$+$J$"!#e(B

|inte$B$,e(B64bite$B$N%7%9%F%`$G$be(B32bite$B@0?t$,$"$kJ]>Z$C$F$"$k$s$G$7$?$C$1!#e(B

e$BJ]>Z$O$J$$$G$7$g$&$M!#e(B

C99 e$B$Ne(B stdint.h e$B$G$Oe(B uintN_t (N e$B$Oe(B 8,16,32,64)
e$B$,I,?$K$J$C$F$$$^$9$+$i!"e(B
C99 e$BBP1~$N%3%s%Q%$%i$J$i$PB8:$7$^$9!#e(B
e$B$J$N$G!"e(B32bit e$B$N%7%9%F%$G$O8@$&$^$G$b$J$/!"e(B 64bit e$BBP1~$N%7%9%F%e(B (=e$B$3$l$+$i$N%7%9%F%`e(B)
e$B$G$b$^$:3N<B$Ke(B 32bit e$B@0?t$,e(B
e$BB8:
$9$k$H8@$C$F$h$$$+$H;W$$$^$9!#e(B

e$B0J2<$N$h$&$J$b$N$re(B defines.h e$B$K$*$$$?$&$($Ge(B uint32_t
e$B$r;H$&$h$&$K$9$l$P$h$$$+$H!#e(B

#if HAVE_STDINT_H

include <stdint.h>

#else

if SIZEOF_INT == 4

ifndef int32_t

typedef int int32_t;

define int32_t int

endif

ifndef int32_t

typedef unsigned int uint32_t

endif

endif

#else

error ---->> int32_t is not defined. <<----

#endif

e$B%(%s%G%#%"%s$O>o$K%P%$%H$4$H$K<h$C$F$$$/$H$+$G$9$+$M$’!#e(B
e$B:G?7$N%3%s%Q%$%i$@$H$=$N$X$s$A$c$s$HNI$-$K$O$+$i$C$F$/$l$=$&$J5$$,$9$k$N$G$9$,!"e(B
e$B>l9gJ,$1$7$?J}$,$$$$$N$+$J$!!#e(B

e$B87L)$K$d$m$&$H$9$k$H<B9T;~J,4t$,I,MW$=$&$J$N$G$9$,!&!&!&!#e(B

e$B$H$3$m$G!"e(BRuby 1.9
e$B$C$F$I$NDxEY$^$GA0Ds$K$7$F$$$$$N$G$9$C$1!#e(B

  • ANSI C e$B$G$"$ke(B
  • char e$B$Oe(B8bit e$B$Ge(B CHAR_BIT e$B$Oe(B 8
  • char e$B$,e(B singed e$BA0Ds$OIT2De(B
  • int, long, pointer e$B$Oe(B32bit e$B0J>e$Ne(B 2 e$B$NN_>he(B
  • int == long e$BA0DsIT2De(B (LP64e$BEye(B (e$BB?$/$Ne(B UNIX e$B7Oe(B))
  • long == pointer e$BA0DsIT2De(B (LLP64e$BEye(B (Win64))
  • long long e$BA0DsIT2De(B ?
  • C99 e$BA0Ds$OIT2De(B
  • e$B%(%s%G%#%"%s$Oe(B big/little e$B$r9MN8e(B (_BIG_ENDIAN
    e$BEy$rA0Ds$K$7$F$O$J$i$J$$e(B)

e$B$"$H$Oe(B int e$B$He(B long e$B$N;H$$J,$1$G$7$g$&$+!#$J$s$H$J$/!"e(B

  • int e$B$O$?$@$N@0?t!#e(B
  • long e$B$OJ8;zNsD9$dJ8;z0LCVEy!"%]%$%s%?$H$N1i;;$,9T$o$l$&$k$b$N!#e(B
    e$B$+$J$!$H;W$C$F$$$k$N$G$9$,!"e(BLP64 e$B$@$He(Bint e$B$K$*$$$Fe(B
    64bit e$B2=$N287C$,<u$1$i$l$:!"e(B
    LLP64 e$B$@$He(B int == long != *void e$B$J$N$Ge(B long
    e$B$rJ,$1$k0U5A$,$J$$$J$!$H$+!#e(B
    e$B8e<T$Oe(B intptr_t e$B$K$7$F$b$h$5$=$&$J5$$b$7$^$9!#e(B

成瀬です。

というわけで、int32_t, uint32_t, intptr_t, uintptr_t を定義しつつ、
hash() をアライメント考慮させるパッチです。

U.Nakamura wrote:

| あとは int と long の使い分けでしょうか。なんとなく、
| * int はただの整数。
| * long は文字列長や文字位置等、ポインタとの演算が行われうるもの。
| かなぁと思っているのですが、LP64 だとint において 64bit 化の恩恵が受けられず、
| LLP64 だと int == long != *void なので long を分ける意義がないなぁとか。
| 後者は intptr_t にしてもよさそうな気もします。

longを禁止にすると私が喜ぶような気がします。
が、世界中の拡張ライブラリ作者が泣くんじゃないですかね。

禁止は「ぎゃっ」どころのさわぎじゃなさそうですが、
Ruby 本体からの追放と非推奨化はできそうですよね。

e$B$3$s$K$A$O!"$J$+$`$ie(B(e$B$&e(B)e$B$G$9!#e(B

In message “[ruby-dev:34024] Re: MurmurHash problem”
on Mar.12,2008 03:34:41, [email protected] wrote:
| * long long e$BA0DsIT2De(B ?

e$BIT2D$G$9!#8=>u$G$b$=$l$r8!=P$7$F=hM}$rJ,4t$7$F$k$O$:$G$9$1$Ie(B
e$B$M!#e(B

| e$B$"$H$Oe(B int e$B$He(B long e$B$N;H$$J,$1$G$7$g$&$+!#$J$s$H$J$/!“e(B
| * int e$B$O$?$@$N@0?t!#e(B
| * long e$B$OJ8;zNsD9$dJ8;z0LCVEy!”%]%$%s%?$H$N1i;;$,9T$o$l$&$k$b$N!#e(B
| e$B$+$J$!$H;W$C$F$$$k$N$G$9$,!"e(BLP64 e$B$@$He(Bint e$B$K$*$$$Fe(B 64bit e$B2=$N287C$,<u$1$i$l$:!"e(B
| LLP64 e$B$@$He(B int == long != *void e$B$J$N$Ge(B long e$B$rJ,$1$k0U5A$,$J$$$J$!$H$+!#e(B
| e$B8e<T$Oe(B intptr_t e$B$K$7$F$b$h$5$=$&$J5$$b$7$^$9!#e(B

longe$B$r6X;_$K$9$k$H;d$,4n$V$h$&$J5$$,$7$^$9!#e(B
e$B$,!"@$3&Cf$N3HD%%i%$%V%i%j:n<T$,5c$/$s$8$c$J$$$G$9$+$M!#e(B

e$B$=$l$G$O!#e(B

e$B$J$+$@$G$9!#e(B

At Wed, 12 Mar 2008 19:50:18 +0900,
NARUSE, Yui wrote in [ruby-dev:34026]:

e$B$H$$$&$o$1$G!"e(Bint32_t, uint32_t, intptr_t, uintptr_t e$B$rDj5A$7$D$D!“e(B
hash() e$B$r%”%i%$%a%s%H9MN8$5$;$k%Q%C%A$G$9!#e(B

inttypes.he$B$r%A%'%C%/$7$F$J$$$h$&$J5$$,$7$^$9!#e(B

e$B$"$He(BBig
endiane$B$K$J$k$H$d$dCY$/$J$j$=$&$J$N$G!"0l1~:n$C$F$*$$$?$Ne(B
e$B$b=P$7$F$_$^$9!#e(B

Index: configure.in

— configure.in (revision 15758)
+++ configure.in (working copy)
@@ -583,5 +583,5 @@ AC_CHECK_HEADERS(stdlib.h string.h unist
syscall.h pwd.h grp.h a.out.h utime.h memory.h direct.h
sys/resource.h
sys/mkdev.h sys/utime.h xti.h netinet/in_systm.h float.h ieeefp.h
pthread.h \

  • ucontext.h intrinsics.h langinfo.h locale.h)
    
  • ucontext.h intrinsics.h langinfo.h locale.h stdint.h)
    

dnl Check additional types.
@@ -621,4 +621,16 @@ AC_CHECK_TYPES(struct timespec)
AC_CHECK_TYPE(fd_mask, [AC_DEFINE(HAVE_RB_FD_INIT, 1)])

+test ${rb_cv_type_uint32_t+set} && ac_cv_type_uint32_t=yes
+AC_CHECK_TYPE(uint32_t)
+if test ${ac_cv_type_uint32_t} != yes; then

  • AC_CACHE_CHECK([unsigned 32bit int],
  • rb_cv_type_uint32_t,
  • [for type in short int long; do
  •  AC_COMPILE_IFELSE(AC_LANG_BOOL_COMPILE_TRY([], [sizeof($type) == 
    

4]),

  • [rb_cv_type_uint32_t=$type; break], [])
  • done])
  • AC_DEFINE(uint32_t, $rb_cv_type_uint32_t)
    +fi

AC_CACHE_CHECK(for stack end address, rb_cv_stack_end_address,
[rb_cv_stack_end_address=no
Index: string.c

— string.c (revision 15758)
+++ string.c (working copy)
@@ -26,4 +26,8 @@
#endif

+#if HAVE_STDINT_H
+#include <stdint.h>
+#endif
+
VALUE rb_cString;
VALUE rb_cSymbol;
@@ -1686,4 +1690,11 @@ rb_str_concat(VALUE str1, VALUE str2)
}

+#if defined i386 || defined _M_IX86 || defined __ia64
+#define UNALIGNED_WORD_ACCESS 1
+#endif
+#ifndef UNALIGNED_WORD_ACCESS
+#define UNALIGNED_WORD_ACCESS 0
+#endif
+
/* MurmurHash described in http://murmurhash.googlepages.com/ */
unsigned int
@@ -1695,14 +1706,93 @@ hash(const unsigned char * data, int len
h += 0xdeadbeef;

  • while(len >= 4) {
  • h += *(unsigned int *)data;
  • h *= m;
  • h ^= h >> r;
  • if (len >= 4) {
    +#if !UNALIGNED_WORD_ACCESS
  • int align = (VALUE)data & 3;
  • if (align) {
  •  uint32_t t = 0, d = 0;
    
  •  int sl, sr, pack;
    
  •  switch (align) {
    

+#ifdef WORDS_BIGENDIAN

  •    case 1: t |= data[2];
    
  •    case 2: t |= data[1] << 8;
    
  •    case 3: t |= data[0] << 16;
    

+#else

  •    case 1: t |= data[2] << 16;
    
  •    case 2: t |= data[1] << 8;
    
  •    case 3: t |= data[0];
    

+#endif

  •  }
    
  •  t <<= (8 * align);
    
  •  data += 4-align;
    
  •  len -= 4-align;
    
  •  sl = 8 * (4-align);
    
  •  sr = 8 * align;
    
  •  while (len >= 4) {
    
  • d = *(uint32_t *)data;
    +#ifdef WORDS_BIGENDIAN
  • t = (t << sr) | (d >> sl);
    +#else
  • t = (t >> sr) | (d << sl);
    +#endif
  • h += t;
  • h *= m;
  • h ^= h >> r;
  • t = d;
  • data += 4;
  • len -= 4;
  •  }
    
  •  pack = len < align ? len : align;
    
  •  d = 0;
    
  •  switch (pack) {
    

+#ifdef WORDS_BIGENDIAN

  •    case 3: d |= data[2];
    
  •    case 2: d |= data[1] << 8;
    
  •    case 1: d |= data[0] << 16;
    
  •    case 0:
    
  • h += (t << sr) | (d >> sl);
    +#else
  •    case 3: d |= data[2] << 16;
    
  •    case 2: d |= data[1] << 8;
    
  •    case 1: d |= data[0];
    
  •    case 0:
    
  • h += (t >> sr) | (d << sl);
    +#endif
  • h *= m;
  • h ^= h >> r;
  •  }
    
  • data += 4;
  • len -= 4;
  •  data += pack;
    
  •  len -= pack;
    
  • }

  • else
    +#endif

  • {

  •  do {
    
  • h += *(uint32_t *)data;

  • h *= m;

  • h ^= h >> r;

  • data += 4;

  • len -= 4;

  •  } while (len >= 4);
    
  • }
    }

    switch(len) {
    +#ifdef WORDS_BIGENDIAN

  •  case 3:
    
  • h += data[2];

  •  case 2:
    
  • h += data[1] << 8;

  •  case 1:
    
  • h += data[0] << 16;
    +#else
    case 3:
    h += data[2] << 16;
    @@ -1711,4 +1801,5 @@ hash(const unsigned char * data, int len
    case 1:
    h += data[0];
    +#endif
    h *= m;
    h ^= h >> r;

pegacorn e$B$G$9!#e(B

From: “NARUSE, Yui” [email protected]
Date: Wed, 12 Mar 2008 03:34:41 +0900
Message-Id: [email protected]
/~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

|inte$B$,e(B64bite$B$N%7%9%F%`$G$be(B32bite$B@0?t$,$“$kJ]>Z$C$F$”$k$s$G$7$?$C$1!#e(B

e$BJ]>Z$O$J$$$G$7$g$&$M!#e(B

C99 e$B$Ne(B stdint.h e$B$G$Oe(B uintN_t (N e$B$Oe(B 8,16,32,64) e$B$,I,?$K$J$C$F$$$^$9$+$i!"e(B
C99 e$BBP1~$N%3%s%Q%$%i$J$i$PB8:_$7$^$9!#e(B

e$B5,3J=q$OFI$s$G$$$J$$$N$G$9$,!"e(B
uintN_t e$B$OG$0U$J$N$G$O$J$$$G$7$g$&$+!#e(B

e$B%W%m%0%i%_%s%08@8le(B C e$B$N?75!G=e(B:

ƒvƒƒOƒ‰ƒ~ƒ“ƒOŒ¾Œê C ‚̐V‹@”\>
e$B87L)$JI}$r;}$D@0?t7?$K4X$9$kDj5Ae(B
intN_t e$BId9fIU$-@0?t7?$Ne(B typedef e$BL>e(B
uintN_t e$BId9f$J$7@0?t7?$Ne(B typedef e$BL>e(B
e$B87L)$K$=$NI}$r;}$D@0?t7?!#$3$l$i$N7?$OG$0U$G$"$k$,!“e(B
e$B$b$7<BAu$,e(B 8, 16, 32, 64 e$B%S%C%H$N@0?t7?$rDs6!$7$F$$$k$J$ie(B
e$BAjEv$9$ke(B typedef e$BL>$rDj5A$9$kI,MW$,$”$k!#e(B
<<

e$B@.@%$G$9!#e(B

pegacorn wrote:

C99 e$B$Ne(B stdint.h e$B$G$Oe(B uintN_t (N e$B$Oe(B 8,16,32,64) e$B$,I,?$K$J$C$F$$$^$9$+$i!“e(B
e$B87L)$K$=$NI}$r;}$D@0?t7?!#$3$l$i$N7?$OG$0U$G$”$k$,!“e(B
e$B$b$7<BAu$,e(B 8, 16, 32, 64 e$B%S%C%H$N@0?t7?$rDs6!$7$F$$$k$J$ie(B
e$BAjEv$9$ke(B typedef e$BL>$rDj5A$9$kI,MW$,$”$k!#e(B

e$B$$!“e(BC99 e$B$@$H$?$7$+$K;}$C$F$$$l$P!”$K$J$C$F$$$^$9$M!#e(B

e$B$^$!!"e(BPOSIX e$B$NJ}$G$Oe(B 32 e$B$OI,?$K$J$C$F$$$^$9$7!“e(B
http://www.opengroup.org/onlinepubs/009695399/basedefs/stdint.h.html
Win64 e$B$O$=$b$=$be(B int e$B$,e(B 32bit
e$B$J$N$G!”$3$N$I$A$i$+$H$J$k$H!"e(B
e$B$=$l$J$j$K!VJ]>c!W$5$l$F$$$k$H8@$C$F$$$$$H$O;W$&$N$G$9$,!#e(B

e$B$^$D$b$He(B e$B$f$-$R$m$G$9e(B

In message “Re: [ruby-dev:34027] Re: MurmurHash problem”
on Wed, 12 Mar 2008 23:04:38 +0900, Nobuyoshi N.
[email protected] writes:

|At Wed, 12 Mar 2008 19:50:18 +0900,
|NARUSE, Yui wrote in [ruby-dev:34026]:
|> e$B$H$$$&$o$1$G!“e(Bint32_t, uint32_t, intptr_t, uintptr_t e$B$rDj5A$7$D$D!“e(B
|> hash() e$B$r%”%i%$%a%s%H9MN8$5$;$k%Q%C%A$G$9!#e(B
|
|inttypes.he$B$r%A%'%C%/$7$F$J$$$h$&$J5$$,$7$^$9!#e(B
|
|e$B$”$He(BBig endiane$B$K$J$k$H$d$dCY$/$J$j$=$&$J$N$G!"0l1~:n$C$F$*$$$?$Ne(B
|e$B$b=P$7$F$_$^$9!#e(B

e$B$A$c$s$HF0$/$h$&$J$i%3%_%C%H$7$F$b$i$&$H$"$j$,$?$$$N$G$9$,!#e(B