「統計学的に有意」なんてやめよう [内科のメモ帳]

目次

このページの翻訳:

ja

「統計学的に有意」なんてやめよう

2016年，米国統計学会 ASA は『The American Statistician』において統計的有意性とP値の誤用に警告する声明を発表¹⁾
- 同号にはこのテーマに関する多くの解説も掲載された
2019年3月，同誌は特集号 “Statistical inference in the 21st century: a world beyond P < 0.05”²⁾ を掲載．
- 同特集においてEditorは “Don’t say ‘statistically significant”「『統計的に有意』とは言わないように」という注意を促した上で，この論文集を紹介している³⁾
- また何十人もの署名者がいる別の記事⁴⁾でも，著者と雑誌編集者にこれらの用語を使わないように呼びかけている
同月，Nature にもこれに同調する主張の Comment が掲載された⁵⁾
- そしてそのコメントの草稿に賛同する科学者の署名が１週間で800以上集まった旨が記載されている
- “We agree, and call for the entire concept of statistical significance to be abandoned.”
2022年，同コメントの著者がP値関数を用いることを提案する論文を公表⁶⁾

Nature2019 で強調されている点

ここでなされている重要な提案は，二項対立主義 “dichotomania” からの脱却である
“We are not calling for a ban on P values.” - P値を禁止することを求めているのではない
「そうではなく，従来の二項対立的な方法 dichotomous way でP値を用い，結果が科学的仮説を否定するのか支持するのかを決定することをやめるよう求めているのである」

"Rather, and in line with many others over the decades, we are calling for a stop to the use of P values in the conventional, dichotomous way — to decide whether a result refutes or supports a scientific hypothesis."

有意かどうかのカテゴライズをやめよう

“bucketing results into ‘statistically significant’ and ‘statistically non-significant’ makes people think that the items assigned in that way are categorically different”
- 結果を「統計的に有意なもの」と「統計的に有意でないもの」に分けてしまうと，人々はそのように割り当てられた項目がカテゴリー的に異なるものであると考えるようになってしまう．
“the false belief that crossing the threshold of statistical significance is enough to show that a result is ‘real’ has led scientists and journal editors to privilege such results, thereby distorting the literature.”
- 「統計的有意性の閾値を超えればその結果が「本物」であることを示すのに十分である」という誤った信念によって，科学者や雑誌編集者は「統計学的に有意な結果」に特権を与え優遇させてきた．それによって多くの文献が歪められてしまった．
“the rigid focus on statistical significance encourages researchers to choose data and methods that yield statistical significance for some desired (or simply publishable) result, or that yield statistical non-significance for an undesired result, such as potential side effects of drugs — thereby invalidating conclusions.”
- “統計的有意性に固執することで，研究者は望ましい（公表しやすい）結果に対しては統計的有意性をもたらし，薬の副作用のような望ましくない結果に対しては統計的非有意性をもたらすデータや方法を選ぶようになる．これによって結論は無価値なものになってしまうのである．
“we are not advocating a ban on P values, confidence intervals or other statistical measures — only that we should not treat them categorically. This includes dichotomization as statistically significant or not, as well as categorization based on other statistical measures such as Bayes factors.”
- 私たちはP値，信頼区間，その他の統計的尺度の禁止を提唱しているわけではない．あくまでもカテゴリー的に扱うべきではないということである．これには，統計的に有意か否かの二項対立のみならず，ベイズ係数などの他の統計的尺度に基づく分類が含まれる．

Confidence Interval なんてやめよう

以下の一説が極めて重要．何度も咀嚼したい．
- 私たちは不確実性を受け入れることを学ばなければならない．そのための実用的な方法の一つは，信頼区間 confidence intervals を「両立性区間/適合性区間」compatibility intervals と改名し，過信 overconfidence を避けるような形で解釈することである．とりわけ論文著者は，区間内のすべての値，特に観察された効果（点推定値）と限界値の実際的な意味を説明することを推奨する．その際，区間を計算するために使用された統計的仮定を考えると，区間に含まれるすべての値は，データと合理的に適合している reasonably compatible ことを忘れてはならない．したがって，区間内のある特定の値（null値など）を「示されている」として特別視することは意味がない．

We must learn to embrace uncertainty. One practical way to do so is to rename confidence intervals as ‘compatibility intervals’ and interpret them in a way that avoids overconfidence. Specifically, we recommend that authors describe the practical implications of all values inside the interval, especially the observed effect (or point estimate) and the limits. In doing so, they should remember that all the values between the interval’s limits are reasonably compatible with the data, given the statistical assumptions used to compute the interval. Therefore, singling out one particular value (such as the null value) in the interval as ‘shown’ makes no sense.

Compatible Intervals

compatible には「矛盾しない」「適合する」「両立する」といった意味がある．
compatible intervals という言葉は，その区間内のすべての推定値が「手元に得られた実データの数値および特定の統計モデル」と矛盾ない（＝有意水準αで棄却されない）ものであるという意義をよく表している

CIを解釈する4つのポイント

Nature⁷⁾のコメントから抜粋

First, just because the interval gives the values most compatible with the data, given the assumptions, it doesn’t mean values outside it are incompatible; they are just less compatible. In fact, values just outside the interval do not differ substantively from those just inside the interval. It is thus wrong to claim that an interval shows all possible values.
- その区間は，仮定に基づきデータに最も適合する値を与えるが，区間外の値が「適合しない」incompatible ことを意味するのではなく，「適合性が低い」 less compatible だけである．実際，区間のすぐ外側の値は区間のすぐ内側の値と実質的な差はない。したがって区間がすべての可能な値を示していると主張するのは間違いである。
Second, not all values inside are equally compatible with the data, given the assumptions. The point estimate is the most compatible, and values near it are more compatible than those near the limits. This is why we urge authors to discuss the point estimate, even when they have a large P value or a wide interval, as well as discussing the limits of that interval.
- 前提条件を考えれば，内部のすべての値がデータと等しく適合するわけではない．点推定値が最も適合性が高く most compatible，その近傍の値は限界値近傍の値よりも適合性が高い．このため論文著者には，P値が大きい場合や信頼区間（両立性区間）が広い場合でも，その区間の限界値を議論するだけでなく，点推定値について議論するよう強く勧めている．
Third, like the 0.05 threshold from which it came, the default 95% used to compute intervals is itself an arbitrary convention. It is based on the false idea that there is a 95% chance that the computed interval itself contains the true value, coupled with the vague feeling that this is a basis for a confident decision. A different level can be justified, depending on the application.
- 0.05のしきい値と同様に，信頼区間（両立性区間）を計算するために使われるデフォルトの95%は，それ自体が恣意的な慣例 arbitrary conventionである．これは，計算された区間自体が真の値を含む95％の確率があるという誤った考えと，これが自信のある決定の根拠であるという漠然とした感覚に基づいている．用途によっては，異なる水準が正当な場合もあるだろう．
Last, and most important of all, be humble: compatibility assessments hinge on the correctness of the statistical assumptions used to compute the interval. In practice, these assumptions are at best subject to considerable uncertainty. Make these assumptions as clear as possible and test the ones you can, for example by plotting your data and by fitting alternative models, and then reporting all results.
- 最も重要なことは，謙虚になることである．互換性 Compatibility の評価は，区間を計算するために使用した統計的仮定の正しさに依存している．そして実際これらの仮定は，良くてもかなりの不確実性に左右される．これらの仮定をできるだけ明確にし，可能なものは検査しよう．例えばデータをプロットしたり，代替モデルを当てはめたりして，すべての結果を報告しよう．

有意差を主張するのをやめることへの批判

統計学的有意という言葉の使用を棄権することに対して最もよく耳にする反論は，イエスかノーかの判断を下すために必要であるというものである．しかし，規制，政策，ビジネス環境においてしばしば求められる選択においては，コスト，便益，すべての潜在的な結果の可能性に基づいた決定の方が，統計的有意性のみに基づいた決定に常に勝る

The objection we hear most against retiring statistical significance is that it is needed to make yes-or-no decisions. But for the choices often required in regulatory, policy and business environments, decisions based on the costs, benefits and likelihoods of all potential consequences always beat those made based solely on statistical significance.

検定の教育は今後も重要

「信頼区間に関する統計学の理論と方法は仮説検定のそれとパラレルであり，仮説検定の理論は，データ解析における推定を理解するための基礎のひとつであり，それを学ぶことは有用である」⁸⁾
検定の結果，P 値がアルファレベルよりも大きいか小さいかで一喜一憂することは重要ではないが，検定の考え方を学ぶことは依然として重要⁹⁾

¹⁾

ASA Statement on Statistical Significance and P-Values DOI，邦訳

²⁾

The American Statistician, Volume 73, Issue sup1 (2019) Link

³⁾

Ronald L. Wasserstein, Allen L. Schirm & Nicole A. Lazar (2019) Moving to a World Beyond “p<0.05”, The American Statistician, 73:sup1, 1-19 DOI

⁴⁾

Stuart H. Hurlbert, Richard A. Levine & Jessica Utts (2019) Coup de Grâce for a Tough Old Bull: “Statistically Significant” Expires,The American Statistician, 73:sup1, 352-357 DOI

⁵⁾ , ⁷⁾

Nature 567, 305-307 (2019) DOI

⁶⁾

Amrhein, V., & Greenland, S. (2022). Discuss practical importance of results based on interval estimates and p-value functions, not only on point estimates and null p-values. Journal of Information Technology, 37(3), 316–320. DOI

⁸⁾

Modern Epidemiology [Rothman et al., 2008]

⁹⁾

佐藤俊哉, ASA声明と疫学研究におけるP値, 計量生物学, 2017-2018, 38 巻, 2 号, p. 109-115 J-Stage

Tools

menus and quick search

quick search

site status

location indicator

ページ用ツール

meta data for this page