NASTY bug

Dear devs,

I am trying to find a nasty bug in

lib/raw/context/session/cookie.rb

this file implements a cookie based session store, ie the session data
is
serialized to/from a cookie.
for security we store both the serialized session data and an encrypted
version of it (called diggest).

when deserializing we check the raw data against the diggest to find out
if
the user has tampered the data.

this scheme works 90%. But some times (seemingly random) the diggest
check
fails (ie crypt(data) != diggest)
for no apparent reason.

I would like to really ask everyone on this list with some free time to
have
a look at the code and help me track down
this nasty bug.

thanks in advance,
-g.

On Nov 9, 3:54 am, “George M.” [email protected]
wrote:

when deserializing we check the raw data against the diggest to find out if
the user has tampered the data.

this scheme works 90%. But some times (seemingly random) the diggest check
fails (ie crypt(data) != diggest)
for no apparent reason.

I would like to really ask everyone on this list with some free time to have
a look at the code and help me track down
this nasty bug.

Ad you busting the 4K size limit?

T.

Ad you busting the 4K size limit?

No, this is not the problem… I have a different check for this…

the diggest integrity test fails.

-g.

On Nov 9, 2007 7:54 PM, George M.
[email protected] wrote:

when deserializing we check the raw data against the diggest to find out if
the user has tampered the data.

this scheme works 90%. But some times (seemingly random) the diggest check
fails (ie crypt(data) != diggest)
for no apparent reason.

I don’t use Nitro so I only reply because your context could involve
simultaneous disk and network activity, so your experience might
mirror mine, and it took me months to work out what it was…
I had file copies randomly fail a cmp/diff checks.
I reproduce some details below.
If I was you I’d jump straight to the kernel boot parameters, place
the disks and network under heavy load and look for lost-ticks in
the
/var/log/messages.

Apparent symptom:

  • Files copied to the PVFS2 area might fail a diff or cmp check
    (see thread below).
  • Typically this occurs when:
    a) large files are copied and
    b) several clients are copying/reading to the PVFS2 area.
  • no errors were reported in /var/log/messages (but you might see
    reports about lost ticks or cpu frequency changes)

Real symptom:

  • The disks are being placed under load when the network connection
    is also under some load.

Related reports:

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=55223
AMD64 X2 lost ticks on PM timer — Linux Kernel

How I diagnosed:

  • kernel boot parameters:
    report_lost_ticks apic=debug mce=bootlog showopts

Conjectured Workaround

This allowed me to download, compile and install a new kernel. These
boot parameters may or may not remedy the inconsistent file copy
results…

  • Add kernel boot parameter (severe and gave me boot up problems)
    noapic
  • Or, less severe, and worked for me, add:
    no_timer_check

Solution:

  • Upgrade to kernel 2.6.21 (or more recent?, i.e. I’m using 2.6.21.5).
    No kernel parameters need be passed, e.g. can drop the no_timer_check.

System:

  • 3 sata drives arranged as 3 stripe LVM, formatted with xfs
    (openSUSE10.2 defaults)
  • This may be specific to the nVidia ck804 chipset and/or the AMD
    64bit processors (?)

HTH?
Mark